Getting more 9s from your Cloud operations

13

description

Managing a highly available and highly reliable cloud infrastructure has always been challenging. Apart from the technology, proper architecture, use of the right tool for the right task, discipline, intelligent monitoring, and effective communication are the key areas to focus on to obtain more 9’s from your cloud operations. This slide deck will describe; - Mitigating risks with a fail-proof architecture - Swiss Army Knife for Devops - Next generation monitoring - Effective communication - Best practices and know-hows

Transcript of Getting more 9s from your Cloud operations

Page 1: Getting more 9s from your Cloud operations
Page 2: Getting more 9s from your Cloud operations

● Fail-proof architecture

● Devops tools and utilities

● Monitoring - next level

● Backups and Disaster recovery

● Communication

● Best practices

Page 3: Getting more 9s from your Cloud operations

● Group similar components

● Load distribution is important

● Network level isolation for each group or cluster

● Failover plan for every component

● Someone has to take care of failures

● Design for failures

● Unleash the chaos monkey

“Everything fails all the time” -- Werner Vogels (CTO, Amazon)

Page 4: Getting more 9s from your Cloud operations

Source: http://dev.mysql.com/doc/refman/5.0/en/ha-overview.html

Page 5: Getting more 9s from your Cloud operations

● Every operation must be scripted and tested

● One click operations

● Verification tools are a must!

● Data collecting and reporting tools

● Tools to shorten the pipeline from Dev -> Prod

● Enforce standards

● Documentation has to be a part of tooling

Page 6: Getting more 9s from your Cloud operations

● Are you happy with conventional tools?

● Alert if 1m_load_avg > 5 is not enough

● Analytics is a part of monitoring

● Usage predictions and trend analysis

● Co-relating incidents with logs is very useful

● Simulate user activities

● Be your own Xavier!

Page 7: Getting more 9s from your Cloud operations

● How frequently you backup?

● Alerts for backups

● Verification is a MUST

● Practice DR plan frequently

● Make the DR plan to align with the deployment plan

● Documentation!Source : http://blogger.srvnetwork.com/wp-content/uploads/2010/10/disaster_recovery_plan1.jpg

Page 8: Getting more 9s from your Cloud operations

Source: http://www.accountanttown.com/site/wp-content/uploads/2010/08/sticky_note_backup_small.gif

Source: http://jenniferbrogee.files.wordpress.com/2011/03/backupyourcomputer1.jpg

Page 9: Getting more 9s from your Cloud operations

● Always sound human

● “Our web-monkeys can’t find the page you are looking for”

● Downtimes or failures can be turned into opportunities

● Be honest

● Users are always curious on what’s going on

● Separate communication channel

Page 10: Getting more 9s from your Cloud operations

Source: http://www.transparentuptime.com/2010/06/video-of-my-talk-upside-of-downtime-at.html

Page 11: Getting more 9s from your Cloud operations

Source: http://www.transparentuptime.com/2010/06/video-of-my-talk-upside-of-downtime-at.html

Page 12: Getting more 9s from your Cloud operations

● Staging setup to run parallel

● Verification process after every operation

● Change log and maintenance log

● Use of configuration management

● Manage the complete ALM

● Knowledge sharing

● Culture

Source: http://www.cartoonstock.com/newscartoons/cartoonists/rmo/lowres/business-commerce-best_practice-business_venture-business_model-business_practice-bankrupt-rmon2464l.jpg

Page 13: Getting more 9s from your Cloud operations

[email protected] | @gnuchami