Failure Happens - Reliability and how to run large websites.
-
Upload
artur-bergman -
Category
Technology
-
view
9.010 -
download
1
description
Transcript of Failure Happens - Reliability and how to run large websites.
![Page 1: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/1.jpg)
Failure HappensF***, the f*****g thing is f****d
What broke and what we learned
![Page 2: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/2.jpg)
Redundancy
Redundancy, in general terms, refers tothe quality or state of being redundant,that is: exceeding what is necessary ornormal; or duplication. This can have anegative connotation, especially inrhetoric: superfluous or repetitive; or apositive implication, especially inengineering: serving as a duplicate forpreventing failure of an entire system.
![Page 3: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/3.jpg)
Jesse Robbins Artur Bergman
![Page 4: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/4.jpg)
Artur Bergman Jesse Robbins
![Page 5: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/5.jpg)
• Jesse– Runs ops for Etelos– Firefighter/EMT– Emergency Manager
• Katrina– Experiences running large websites– Had the best title ever “Master of Disaster”
• Artur– Runs ops & engineering for Wikia– Experiences of running large websites, enterprise
(boring) and stock exchanges– Core Perl developer, long development background
• Both of us– Write for O’Reilly Radar– Speak at conferences– Annoy our peers and coworkers– Agree on nearly everything
![Page 6: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/6.jpg)
Redundant
![Page 7: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/7.jpg)
Jesse is sick
• Thankfully, we have high availability– Hence this talk
• Jesse has a 98% availability• I am more honest, probably more like
90% excluding the time I sleep• Our combined availability is 99.84%• His war stories will be missing
![Page 8: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/8.jpg)
June 23-24, 2008Jesse & Steve
![Page 9: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/9.jpg)
364.96 Main
• San Francisco data center• Hosts a lot of Web 2.0 companies• Power outage• 24 July 2008
– A day I am sure a lot of people rememberfondly
![Page 10: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/10.jpg)
![Page 11: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/11.jpg)
![Page 12: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/12.jpg)
Mistakes
• Generator 3 took down 1 and 4– 200% more outage than needed
• But really?– Not 365 Mains fault
![Page 13: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/13.jpg)
Failure happens
• A single datacenter is the problem– Since they all fail at some point
• Recovery procedures after failure– Power was gone ~45 minutes– Most services took hours to come back– Some unnamed ones more than 12 hours
• Communication– All DNS servers in the same datacenter!
![Page 14: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/14.jpg)
![Page 15: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/15.jpg)
Radar article• Disaster recovery plans exist on a different
continuum, affecting not just operations butalso your entire organisation's response todisasters.
• An earthquake is a question of when, not if.Are the startups ready for this? How long willwe expect them to be gone? Several of theworld's largest websites went down. None ofthem were ready for a datacenter outage.None of them had backup datacenters or failover that worked.
• None even had a coherent strategy forcommunicating the situation to the rest of theworld.
![Page 16: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/16.jpg)
Futility of MTBF
• Mean time between failures– Vendor quote you this all time
• Irrelevant!• Failure is inevitable• 365 Main probably had a excellent
aggregated MTBF– But when something fails, the mean time to the
next failure is hardly going to make you feel better
![Page 17: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/17.jpg)
MTTR
• Mean time to recovery• Drastically reduced severity of the
power outage even without hot standby• Noone cares if you fail once a minute
– If you recover in 50 ms• If you are down 1 minute a week, you
are still going to hit 4 nines (99.99%)
![Page 18: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/18.jpg)
Nines (roughly)
• 99% 5000 Minutes / Year 3.5 Days
![Page 19: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/19.jpg)
Nines (roughly)
• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )
![Page 20: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/20.jpg)
Nines (roughly)
• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year
![Page 21: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/21.jpg)
Nines (roughly)
• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year
![Page 22: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/22.jpg)
Nines (roughly)
• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year• 99.9999% 30 Seconds / Year
![Page 23: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/23.jpg)
Nines (roughly)
• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year• 99.9999% 30 Seconds / Year• 99.99999% 3 Seconds / Year
![Page 24: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/24.jpg)
Irrelevance of the nines
• Blizzard– $520 million in profit last year
• World of Warcraft– 10 million players
• 98-99%– By design
![Page 25: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/25.jpg)
Train your users
• Scheduled Downtime each week• Very little redundancy• Server failure
– Up to 10 minutes of data loss• Been like this from the beginning
![Page 26: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/26.jpg)
“We pay them money, so wehave to accept the downtime.”
![Page 27: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/27.jpg)
Reliability
• Don’t aim to high unless– Banks– Space shuttles– Lung/heart machines
• The higher you aim– Increases complexity (exponentially)– The harder you fail
![Page 28: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/28.jpg)
![Page 29: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/29.jpg)
Complexity killed the cat
![Page 30: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/30.jpg)
5m360.yahoo.comYahoo! 360
10mwww.livejournal.comLiveJournal
25mwww.myspace.comMySpace
45mwww.xanga.comXanga
1h 10mwww.last.fmLast.fm
1h 10mwww.orkut.comOrkut
1h 35mwww.facebook.comFacebook
2h 5mwww.classmates.comClassmates.com
4h 0mwww.linkedin.comLinkedIn
2h 55mwww.reunion.comReunion.com
5h 5mwww.hi5.comhi5
6h 0mwww.friendster.comFriendster
7h 25mspaces.live.comWindows Live Spaces
12h 28mwww.bebo.comBebo
Jan-Feb 2008 - Source pingdom.com
![Page 31: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/31.jpg)
5m360.yahoo.comYahoo! 360
10mwww.livejournal.comLiveJournal
25mwww.myspace.comMySpace
45mwww.xanga.comXanga
1h 10mwww.last.fmLast.fm
1h 10mwww.orkut.comOrkut
1h 35mwww.facebook.comFacebook
2h 5mwww.classmates.comClassmates.com
4h 0mwww.linkedin.comLinkedIn
2h 55mwww.reunion.comReunion.com
5h 5mwww.hi5.comhi5
6h 0mwww.friendster.comFriendster
7h 25mspaces.live.comWindows Live Spaces
12h 28mwww.bebo.comBebo
Jan-Feb 2008 - Source pingdom.com
$800 MM
![Page 32: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/32.jpg)
Measurement
• How do you measure uptime?• Ping doesn’t work• Connect• Your view is limited from your
monitoring stations• Network problems outside your control
– Hello Cogent
![Page 33: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/33.jpg)
Measurement• Look at the traffic
– The data is there– HTML delivery time– Image delivery time– TCP packet loss– Use an image call to collect end user performance
metrics• Calculate expected traffic rates
– Benchmark against that (bandwidth curves shouldbe smooth!)
– I always watch the bandwidth• Wikipieda method
– How many people complain on IRC?
![Page 34: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/34.jpg)
Outage?
![Page 35: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/35.jpg)
Outage!
![Page 36: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/36.jpg)
Youtube vs BGP vs Pakistan
• BGP runs your internet– Protocol for routers to share routing data– How to get from me to somewhere else
• Each organization has an AS number• Each router keeps track of the number
of AS numbers to the destination overdifferent routes
• Chooses the shortest one
![Page 37: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/37.jpg)
Anycast / Multihoming
• BGP allows you to tell multiple ISPs thatyou are capable of handling a network
• Traffic will flow the “shortest” path• If a link goes down, that router-router
BGP session goes away and the routeis then withdrawn through the system
• “BGP Convergence”– Don’t ask what it really means
![Page 38: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/38.jpg)
Networks and prefixes
• Each netblock is subclassed and has aprefix.
• People mostly know /24 which is 255addresses
• /23 is twice as that• /8 is a vast quantity
![Page 39: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/39.jpg)
IP Conservationvs
Routing table conservation• We are running out of Ips• Our routing table is growing fast
• To limit the growth of the routing table,routers will usually block any routesmore specific than /24
• Youtube was being a good citizen andbroadcasting one 22 instead of four /24
![Page 40: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/40.jpg)
Pakistan Telekom
• Government orders ban of Youtube• PT achives this by broadcasting a BGP
route for the one of Youtubes IP rangesusing a /24 prefix– Sadly, they did this to the entire world
• Routers choose the most specific routefirst, so /24 wins over /22
• All of youtube traffic went to Pakistan
![Page 41: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/41.jpg)
Try reaching for 4 nines
• A BGP error anywhere, can quickly bring youdown
• Thank the souls running the large ISPs corenetworking.– They are the reason it works
• Only way to solve this, is to be a bad citizenand spam the table with more routes. Buteven that doesn’t fully protect you from localoutages
![Page 42: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/42.jpg)
June 23-24, 2008Jesse & Steve
![Page 43: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/43.jpg)
Value of reliability(operations and performance)
• Bad reliability is a waste or R&D• Why develop if you can’t deliver?
• Operations is always treated as thestepchild of Engineering
• But with no reliability, no company• Fixed amount of time + faster site =
more page views
![Page 44: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/44.jpg)
Speed / Reliability
• Important• Direct correlation between speed and
user interaction• Brand name relies on reliability
![Page 45: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/45.jpg)
![Page 46: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/46.jpg)
Requests /sec
Response time
![Page 47: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/47.jpg)
Requests /sec
Response time
![Page 48: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/48.jpg)
Nothing matters
• This entire conference!• Any cool features!
• Unless it works
![Page 49: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/49.jpg)
Cost benefit
• Cost of deliver• Revenue earned
• Increase cost for more complexity
![Page 50: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/50.jpg)
Metrics you need
• Cost per page view• Cost per specific feature/page
• This is key, what you should prioritize, whatyou should do is, dependent on thesenumbers
• How else can you value it?• Don’t always go for cheap, sometimes it is
better to buy time using money, sometimesnot.
![Page 51: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/51.jpg)
Operational Engineers
• Ops stepchild of development?– Ops is staffed with failed developers
• Fire them
• Hire good ones• Who are passionate to learn and
explore the entire stack
![Page 52: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/52.jpg)
My story
• Software developer• Interested in ops• I always get transferred to ops
– Fixing the same problems every time• (Save me, go to Velocity and learn!)
• I bring engineering to ops, and a way tolook at the entire system
![Page 53: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/53.jpg)
![Page 54: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/54.jpg)
Pyromaniac
Paranoid
![Page 55: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/55.jpg)
![Page 56: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/56.jpg)
Backups / High Availability
• Don’t confuse them• Backups protect your data• High Availability keeps your site running
• Mysql replication is a valid HA solution• But it won’t help you with
– DROP TABLE;
![Page 57: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/57.jpg)
Debugging
• 9 Rules of debugging• http://www.debuggingrules.com/Poster_
download.html– Yes the font is horrible
![Page 58: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/58.jpg)
Rule 1:Understand the system
• Complexity Kills• No excuse• If you write it, you must know it• If you run it, you must know it• If you buy it, you must know it
![Page 59: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/59.jpg)
Rule 3:Quit thinking and look
• "It is a capital mistake to theorize beforeone has data. Insensibly one begins totwist facts to suit theories, instead oftheories to suit facts.”
![Page 60: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/60.jpg)
Rule 3:Quit thinking and look
• What do you look at?• The importance of monitoring• Monitoring• Monitoring• Monitoring
![Page 61: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/61.jpg)
My my, confusing term
• Monitoring• Alerting• Trending
![Page 62: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/62.jpg)
Alerting
• Acts on monitoring data• Severe alerts
– Active– Needs action
• Passive alerts– Things that need to be done but not right now
• DO NOT OVER ALERT• DO NOT CRY WOLF
![Page 63: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/63.jpg)
Wikia alerting strategy
• When the site is slow• Or down• We send emails and do phone calls• Europe and US West coast• Looking to hire in East Asia• No night time
![Page 64: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/64.jpg)
Trending
• Long term• Capacity planning
![Page 65: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/65.jpg)
Ganglia
• We love ganglia• Automatically graphs everything you
want - just works• Large scale clusters• Multicast• Zero config• RRD
![Page 66: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/66.jpg)
http://ganglia.wikimedia.org/
• 270 hosts• 880 CPU• 2 clusters• 1.2 TB of Memory
![Page 67: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/67.jpg)
http://ganglia.wikimedia.org
![Page 68: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/68.jpg)
Custom Ganglia Gmetrics
• Write your own
gmetric --name='Oldest query' --type=int32--units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass |grep -v Sleep | grep -v 'system user' | head -2 |tail -1 | cut -f 6`
![Page 69: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/69.jpg)
Something is wrong
• Don’t worry, data warehouse
![Page 70: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/70.jpg)
Problem found
• If it is critical, start a phone conversation• Use IRC to communicate technical data• One person liasons with non technical
staff• One person specifically in command• Sleep scheduling ( audit log important )
![Page 71: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/71.jpg)
Post crisis
• Root cause analysis– Just find out what went wrong– And how to avoid it– Or fix it faster next time if you can’t
• Keep track of your uptime
![Page 72: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/72.jpg)
Automation
• All machines are created equal• Seriously• If you manually make changes• You are wrong
– Unless you know what you are doing
![Page 73: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/73.jpg)
Best practices
• Version control• Gold images• Centralised authentication• Time Sync ( NTP )• Central logging• ( All of this applies for virtual machines
too!)
![Page 74: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/74.jpg)
Puppet
• New hip kid on the block• Written in ruby• Better support?• Much nicer syntax• Easier to extend
![Page 75: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/75.jpg)
tcpdump / wireshark
• If you suspect the network• Don’t just suspect• LOOK AT IT• Tcpdump / waveshark will tell you
– If your packets are lost, delayed orcorrupted
– Your windowing is wrong
![Page 76: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/76.jpg)
Puppet
• Automated machine configuration• Automation is key
• Our Motd states
“If change change anything locally, I will huntdown and kill you”
![Page 77: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/77.jpg)
Rule 4: Divde and Conquer
• Look at the problems in turn• Split between people• Go in the order you suspect is the most
likely
![Page 78: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/78.jpg)
Rule 5:Change one thing at a time
• I cannot stress this enough• IF YOU DO NOT THEN YOU HAVE
FAILED TO IDENTIFY THE PROBLEM
![Page 79: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/79.jpg)
Rule 6:Keep an audit trail
• You might be making things worse• Good for the root cause analysis• Have your shell log all commands
– Good practice anyway• Version control
![Page 80: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/80.jpg)
Rule 9:If you didn’t fix it, it ain’t fixed
• You must do something to fix a problem• Or it will bite you again• And again• And again• They don’t just appear and disappear• Except BGP route convergence :)
![Page 81: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/81.jpg)
![Page 82: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/82.jpg)
Good Book!
![Page 83: Failure Happens - Reliability and how to run large websites.](https://reader033.fdocuments.in/reader033/viewer/2022052910/559c1a3f1a28ab2c598b478f/html5/thumbnails/83.jpg)
“multiple and unexpectedinteractions of failures are
inevitable”-Charles Perrow