Dependable Cloud Architecture - Cloud Develop Edition

download Dependable Cloud Architecture - Cloud Develop Edition

If you can't read please download the document

  • date post

    18-Nov-2014
  • Category

    Technology

  • view

    587
  • download

    0

Embed Size (px)

description

This is a talk I gave at Cloud Develop 2013. It was an adapted from a workshop session that Brent Stineman and I did in Jan of 2013 for CodeMash.

Transcript of Dependable Cloud Architecture - Cloud Develop Edition

  • 1. Image: xkcd.com Dependable Cloud Architecture @mikewo Mike Wood http://mvwood.com
  • 2. Failure is always an option. Image: Discovery Channel, Fair Use
  • 3. What are we looking for? Check out: http://bit.ly/wazbizcont Images: Office ClipArt & Godzilla Releasing Corp (Fair Use) Hardware Failure Data Corruption Network Failure Loss of Facilities
  • 4. Image: FOX, Fair Use Human Error
  • 5. What were trying to achieve 1. Monitoring 2. Resilient Solutions
  • 6. Image: Office ClipArt Cost vs Risk 99.999 % $1, ,000.00 To get more 9s here add more 0s here.
  • 7. Image: NASA
  • 8. Functional Transparency Image: Office ClipArt Logging Messages Hardware Health Dependent Services Health
  • 9. Telemetry
  • 10. Image: NASA Analyze your Data
  • 11. Remember: Failure is always an option. Common Points of Failure Machineapplication crashes Throttling (exceeding capacity) ConnectivityNetwork External service dependencies
  • 12. Try/catch != Resilient
  • 13. Image: Michael Wood Decompose your system
  • 14. Request buffering Retry Policies Wait and try again Queue until available Queuing Enables Asynchronous workloads Temporal Decoupling Load Levelling
  • 15. Capacity Buffering Content Delivery Networks (CDNs) Distributed Application Cache Local Content Cache Enables recovery during outages or spikes in load
  • 16. Dynamic Addressing & Configuration
  • 17. Dept. of Redundancy Dept. Have a backup, somewhere else More than one? Cost to benefit Ratio? Ready State Hot = full capacity Warm = scaled down, but ready to grow Cold = mothballed, starts from zero Image: Mr. White
  • 18. Redundancy - Its about probability 95% uptime 95% uptime 95% uptime 95% uptime 1 box : 5% downtime or 438hrs per year 2 boxes : 5/100 * 5/100 = 25/10,000 = 0.25% downtime or 22hrs per year 4 boxes : 5/100 * 5/100 * 5/100 * 5/100 = 625/100,000,000 0.000625% downtime or 3.285 MINUTES per year (thats 18 days!)
  • 19. Always carry a spare 75% Capacity, half of our load 75% Capacity, half of our load 50% more capacity then needed Can absorb of temporary spikes Time to react if need to add capacity 100% of load, 150% Capacity0% Capacity, redirect all load Over allocated, but still functioning Degrade, but dont fail SYSTEM FAILURE!!!
  • 20. Accessible vs. Available Image: Twitter, Fair Use
  • 21. Availability via Degradation Image: Michael Wood
  • 22. Total Outage duration = Time to Detect + Time to Diagnose + Time to Decide + Time to Act Image: Office ClipArt
  • 23. Images: Gizmodo Virtualization and Automation
  • 24. Images: Orion Pictures owns Terminator Franchise
  • 25. The HI Point
  • 26. Image: NASA
  • 27. Don't be too proud of this technological terror you've constructed ADMIT: Your Solution WILL fail at some point You can learn from others just as well as yourself DO: Root cause analysis Read other root cause analysis DONT: Get cocky Stick your head in the sand
  • 28. Questions