This presentation can be distributed under a Creative Commons License.

30
Image: xkcd.com Dependable Cloud Architecture @mikewo Mike Wood http:// mvwood.com

Transcript of This presentation can be distributed under a Creative Commons License.

Page 1: This presentation can be distributed under a Creative Commons License.

Image: xkcd.com

Dependable Cloud Architecture

@mikewo

Mike Wood

http://mvwood.com

Page 2: This presentation can be distributed under a Creative Commons License.

Questions

@mikewo

Mike Wood

http://mvwood.com

Tack

Page 3: This presentation can be distributed under a Creative Commons License.

“Failure is alwaysan option.”

Image: Discovery Channel, Fair Use

Page 4: This presentation can be distributed under a Creative Commons License.

Protection From:

What are we looking for?

Check out: http://bit.ly/wazbizcontImages: Office ClipArt & Godzilla Releasing Corp (Fair Use)

Hardware Failure Data Corruption Network Failure Loss of Facilities

Page 5: This presentation can be distributed under a Creative Commons License.

Image: FOX, Fair Use

Human Error

Page 6: This presentation can be distributed under a Creative Commons License.

What we’re trying to achieve

1. Monitoring2. Resilient Solutions

Image: Cohdra

Page 7: This presentation can be distributed under a Creative Commons License.

Image: Office ClipArt

Cost vs Risk

99.999% $1, … ,000.00

To get more 9’s here add more 0’s here.

Page 8: This presentation can be distributed under a Creative Commons License.

Image: NASA

Monitoring

Page 9: This presentation can be distributed under a Creative Commons License.

Functional Transparency

Image: Office ClipArt

Logging Messages

Hardware Health

Dependent Services Health

Page 10: This presentation can be distributed under a Creative Commons License.

Telemetry

Page 11: This presentation can be distributed under a Creative Commons License.

Image: NASA

Analyze your Data

Page 12: This presentation can be distributed under a Creative Commons License.

ResilienceImage: Office ClipArt

Page 13: This presentation can be distributed under a Creative Commons License.

Remember: Failure is always an option.

Common Points of Failure• Machine\application crashes• Throttling (exceeding capacity)• Connectivity\Network• External service dependencies

Focus less on the uptime of hardware and more about how the solution handles it WHEN

something fails!

Page 14: This presentation can be distributed under a Creative Commons License.

Try/catch != Resilient

private void createFile() {

string fileName = @"c:\workingDirectory\someFileName.txt";

try {

File.Create(fileName);}catch (DirectoryNotFoundException ex)

{Trace.WriteLine(String.Format("Unable to create {0}. {1}",

fileName, ex));

throw; } } }

Page 15: This presentation can be distributed under a Creative Commons License.

Image: Michael Wood

Decompose your system…

Page 16: This presentation can be distributed under a Creative Commons License.

Capacity BufferingContent Delivery Networks (CDN’s)

Distributed Application Cache

Local Content Cache

Enables recovery during outages or

spikes in load

Image: jepler

Page 17: This presentation can be distributed under a Creative Commons License.

Always carry a spare75% Capacity, half of our load 75% Capacity, half of our load

50% more capacity then needed• Can absorb of temporary spikes• Time to react if need to add capacity

100% of load, 150% Capacity0% Capacity, redirect all load

Over allocated, but still functioning• Degrade, but don’t fail

SYSTEM FAILURE!!!

Image: Kevin Rosseel

Page 18: This presentation can be distributed under a Creative Commons License.

Request Buffering

Image: Joe Shlabotnik

QueuesRetry PoliciesAsync Workloads

Page 19: This presentation can be distributed under a Creative Commons License.

Dept. of Redundancy Dept.

Have a backup, somewhere elseMore than one? Cost to benefit Ratio?

Ready StateHot = full capacityWarm = scaled down, but ready to growCold = mothballed, starts from zero

Image: Mr. White

Page 20: This presentation can be distributed under a Creative Commons License.

Redundancy - Its about probability

95% uptime 95% uptime 95% uptime 95% uptime

1 box : 5% downtime or 438hrs per year

2 boxes : 5/100 * 5/100 = 25/10,000 = 0.25% downtime or 22hrs per year

4 boxes : 5/100 * 5/100 * 5/100 * 5/100 = 625/100,000,0000.000625% downtime or 3.285 MINUTES per year

(that’s 18 ½ days!)

Page 21: This presentation can be distributed under a Creative Commons License.

Total Outage duration =

Time to Detect+ Time to Diagnose+ Time to Decide+ Time to ActImage: Office ClipArt

Page 22: This presentation can be distributed under a Creative Commons License.

Dynamic Addressing & Configuration

Page 23: This presentation can be distributed under a Creative Commons License.

What about your data?

Image: barrymieny

Page 24: This presentation can be distributed under a Creative Commons License.

Availability via Degradation

Image: Michael Wood

Page 25: This presentation can be distributed under a Creative Commons License.

Images: Gizmodo

Virtualization and Automation

Page 26: This presentation can be distributed under a Creative Commons License.

Images: Orion Pictures owns Terminator Franchise

Page 27: This presentation can be distributed under a Creative Commons License.

The “HI” Point

Check out: http://bit.ly/wazinternalsImages: Office Clip Art

Page 28: This presentation can be distributed under a Creative Commons License.

Image: NASA

Page 29: This presentation can be distributed under a Creative Commons License.

“Don't be too proud of this technological terror you've constructed…”

ADMIT:• Your Solution WILL fail at some point• You can learn from others just as

well as yourself

DO:• Root cause analysis• Read other root cause analysis• Plan for failure

DON’T:• Get cocky• Stick your head in the sand

Images: LucasFilm, Fair Use

Page 30: This presentation can be distributed under a Creative Commons License.

Questions@mikewo

Mike Wood

http://mvwood.com

http://bit.ly/CloudFailSafe

Tack