Capacity Management Presentation

download Capacity Management Presentation

If you can't read please download the document

  • date post

  • Category


  • view

  • download


Embed Size (px)


Planning and managing capacity for a fast-growing website can be a balancing act between buying too little/late, and too much/soon. Your process of capacity planning should be *adaptive*, *adjustable*, and include more than just system statistics. Measurement, architecture, and economics are all equally important to having your site perform.

Transcript of Capacity Management Presentation

  • 1. Capacity Management
    • for Web Operations

John Allspaw Operations Engineering 2. the book Im writing 3. ??? 4. Rules of Thumb Planning/Forecasting Stupid Capacity Tricks (with some Flickr statistics sprinkled in) 5.

  • bugs(disguised as capacity problems)
  • edge cases(disguised as capacity problems)
  • security incidents
  • real capacity problems*

* (should be thelastthing you need to worry about) Things that can cause downtime 6. Capacity != Performance

  • Forget about performance for right now
  • Measure what you have right NOW
  • Dont count on it getting any better

7. Thank You HPC Industry!

  • Automated Stuff
  • Scalable Metric Collection/Display

a lot of great deployment and management tricks come from them, adopted by web ops 8. Good Measurement Tools

  • record and store
  • metrics in/out
  • custom metrics
  • easily compare
  • lightweight-ish

I 9. Clouds need planning too

  • Makes deployment and procurement easy and quick
  • But clouds are still resources with costs and limits, just like your own stuff
  • Black-boxes: you may need to pay evenmoreattention than before

10. Metrics

  • System Statistics

11. Metrics

  • Application Level

(photos processed per minute) (average processing time per photo) (apache requests) (concurrent busy apache procs) 12. Metrics

  • App-level meets system-level

here, total CPU = ~1.12 * # busy apache procs (ymmv) 13. 2400 photos per minute being uploaded right NOW (Tuesday afternoon) 14. Ceilings the most amount of work your resources will allow before degradation or failure 15. Forget Benchmarking 16. Find your ceilings what you have left The End 17. Usereallive production datato find ceilings Production:its like a lab, but bigger! 18. Like: database ceilings replication lag: bad! 19. Ceilings waiting on disktoo much sustained disk I/O wait for>40% creates slave lag* *for us, YMMV 20. 35,000 photo requests per second on a Tuesday peak 21. Safety Factors 22. Safety Factors Ceiling * Factor of Safety = UR LIMITZ 23. Safety Factors webserver! 24. safe ceiling @85% CPU Safety Factors 85% total CPU = ~76 busy apache procs what you have left 25. Safety Factors Yahoo Front Page link to Chinese NewYear Photos (photo requests/second) (8% spike) 26. Forecasting 27. Forecasting Fictional Example: webservers 28. Forecasting Fictional example: 15 webservers. 1 week.peak of the week 29. ...bigger sample, 6 weeks....isolate the peaks... Forecasting 30. ...Add a Trendline with some decent correlation... Forecasting now not too shabby 31. Forecasting 15 servers @76 busy apache proc limit = 1140 total procs when is this? this will tell you when it is ceiling what you have left 32. Forecasting (week #10, duh) (1140-726) / 42.751 = 9.68 33.

  • Writing excel macros is boring
  • All we want is days remaining, so all we need is the curve-fit

Forecasting Automation Usehttp://fityk.sf.nettoautomate the curve-fit 34. Forecasting Fictional Example: storage consumption 35. Forecasting Automation actual flickr storage consumption from early 2005, in GB (ceiling is fictional) this will tell you when this is 36. Forecasting Automation cmd line script output jallspaw:~]$cfityk ./ 1> # Fityk script. Fityk version: 0.8.2 2>@0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3>guess Quadratic New function %_1 was created. 4>fit Initial values:lambda=0.001WSSR=464.564 #1:WSSR=0.90162lambda=0.0001d(WSSR)=-463.663(99.8059%) #2:WSSR=0.736787lambda=1e-05d(WSSR)=-0.164833(18.2818%) #3:WSSR=0.736763lambda=1e-06d(WSSR)=-2.45151e-05(0.00332729%) #4:WSSR=0.736763lambda=1e-07d(WSSR)=-3.84524e-11(5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye... 37. Forecasting Automation (SAME) fityk gave: y = 0.786854x 2+ 146.657x + 14147.4( R 2= 99.84) Excel gave: y = 0.7675x 2+ 146.96x + 14147.3( R 2= 99.84) 38. Capacity Health

  • 12,629 nagios checks
  • 1314 hosts
  • 6 datacenters
  • 4 photo farms
  • farm = 2 DCs (east/west)

39. High and Low Water Marks alert if higher alert if lower Per server, squid requests per second 40. A good dashboard looks something like... (yes, fictional numbers) type # limit/box ceiling units limit (total) current (peak) %peak Est days left www 20 80 busy procs 1600 1000 62.50% 36 shard db 20 40 I/O wait 800 220 27.50% 120 squid 18 950 req/sec 17,100 11,400 66.67% 48 41. Diagonal Scaling

  • Image processing machines
  • Replace Dell PE860s with HP DL140G3s

vertically scaling your already horizontal nodes 42. Diagonal Scaling example: image processing 4 cores 8 cores (about the same CPU usage per box) 43. ~45images/min @ peak ~140images/min @ peak (same CPU usage, but ~3x more work) processing means making 4 sizes from originals Diagonal Scaling example: image processing throughput 44. Diagonal Scaling example: image processing 3008.4 Watts 1036.8 Watts went from: 23Dell PE860s 8HP DL140 G3s to: 1035 photos/min 1120 photos/min ( 75%faster, even) 23U rack 8U rack !!! 45. 3.52 terabytes will be consumed today (on a Tuesday) 46. 2nd Order Effects (beware the wandering bottleneck) running hot, so add more 47. 2nd Order Effects (beware the wandering bottleneck) running great now, so more traffic! now these run hot 48. Stupid Capacity Tricks 49. Stupid Capacity Tricks quick and dirtymanagement DSH [root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2 50. Stupid Capacity Tricks quick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh> date executing 'date' www100:Mon Jun 23 14:14:53 UTC 2008 www118:Mon Jun 23 14:14:53 UTC 2008 dbcontacts3:Mon Jun 23 07:14:53 PDT 2008 admin1:Mon Jun 23 14:14:53 UTC 2008 admin2:Mon Jun 23 14:14:53 UTC 2008 dsh> 51. Stupid Capacity Tricks Turn Stuff OFF

  • Disable heavy-ish features of the site(on/off switches)
    • We have195different things to disable in case ofemergency.

52. Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc., etc. 53.

  • Host your outage/status/blog page in more than one datacenter.
  • Tell your users WTF is going on, theyll appreciate it.

Stupid Capacity Tricks Outages Happen 54. Stupid Capacity Tricks Hit the Pause Button

  • Bake the dynamic into static
  • Some Y! properties have a big red button to instantly bake (and un-bake) at will

55. thanks 56. Were Hiring! Come see me! 57. questions?