Getting more 9s from your Cloud operations
-
Upload
chamith-kumarage -
Category
Technology
-
view
245 -
download
0
description
Transcript of Getting more 9s from your Cloud operations
● Fail-proof architecture
● Devops tools and utilities
● Monitoring - next level
● Backups and Disaster recovery
● Communication
● Best practices
● Group similar components
● Load distribution is important
● Network level isolation for each group or cluster
● Failover plan for every component
● Someone has to take care of failures
● Design for failures
● Unleash the chaos monkey
“Everything fails all the time” -- Werner Vogels (CTO, Amazon)
Source: http://dev.mysql.com/doc/refman/5.0/en/ha-overview.html
● Every operation must be scripted and tested
● One click operations
● Verification tools are a must!
● Data collecting and reporting tools
● Tools to shorten the pipeline from Dev -> Prod
● Enforce standards
● Documentation has to be a part of tooling
● Are you happy with conventional tools?
● Alert if 1m_load_avg > 5 is not enough
● Analytics is a part of monitoring
● Usage predictions and trend analysis
● Co-relating incidents with logs is very useful
● Simulate user activities
● Be your own Xavier!
● How frequently you backup?
● Alerts for backups
● Verification is a MUST
● Practice DR plan frequently
● Make the DR plan to align with the deployment plan
● Documentation!Source : http://blogger.srvnetwork.com/wp-content/uploads/2010/10/disaster_recovery_plan1.jpg
Source: http://www.accountanttown.com/site/wp-content/uploads/2010/08/sticky_note_backup_small.gif
Source: http://jenniferbrogee.files.wordpress.com/2011/03/backupyourcomputer1.jpg
● Always sound human
● “Our web-monkeys can’t find the page you are looking for”
● Downtimes or failures can be turned into opportunities
● Be honest
● Users are always curious on what’s going on
● Separate communication channel
Source: http://www.transparentuptime.com/2010/06/video-of-my-talk-upside-of-downtime-at.html
Source: http://www.transparentuptime.com/2010/06/video-of-my-talk-upside-of-downtime-at.html
● Staging setup to run parallel
● Verification process after every operation
● Change log and maintenance log
● Use of configuration management
● Manage the complete ALM
● Knowledge sharing
● Culture
Source: http://www.cartoonstock.com/newscartoons/cartoonists/rmo/lowres/business-commerce-best_practice-business_venture-business_model-business_practice-bankrupt-rmon2464l.jpg
[email protected] | @gnuchami