Capacity Management Presentation

57
Capacity Management for Web Operations John Allspaw Operations Engineering

description

Planning and managing capacity for a fast-growing website can be a balancing act between buying too little/late, and too much/soon. Your process of capacity planning should be *adaptive*, *adjustable*, and include more than just system statistics. Measurement, architecture, and economics are all equally important to having your site perform.

Transcript of Capacity Management Presentation

Page 1: Capacity Management Presentation

Capacity Management

for Web Operations

John AllspawOperations Engineering

Page 2: Capacity Management Presentation

the book I’m writing

Page 3: Capacity Management Presentation

???

Page 4: Capacity Management Presentation

Rules of Thumb

Planning/Forecasting

Stupid Capacity Tricks

(with some Flickr statistics sprinkled in)

Page 5: Capacity Management Presentation

bugs (disguised as capacity problems)

edge cases (disguised as capacity problems)

security incidents

real capacity problems*

* (should be the last thing you need to worry about)

Things that can cause downtime

Page 6: Capacity Management Presentation

Capacity != Performance

Forget about performance for right now

Measure what you have right NOW

Don’t count on it getting any better

Page 7: Capacity Management Presentation

Thank You HPC Industry!

Automated Stuff

Scalable Metric Collection/Display

a lot of great deployment and management trickscome from them, adopted by web ops

Page 8: Capacity Management Presentation

Good Measureme

nt Tools

record and storemetrics in/outcustom metricseasily comparelightweight-ish

I

Page 9: Capacity Management Presentation

Clouds need planning too

Makes deployment and procurement easy and quick

But clouds are still resources with costs and limits, just like your own stuff

Black-boxes: you may need to pay even more attention than before

Page 10: Capacity Management Presentation

Metrics

System Statistics

Page 11: Capacity Management Presentation

Metrics“Application” Level

(photos processed per minute)

(average processing time per photo)

(apache requests)

(concurrent busy apache procs)

Page 12: Capacity Management Presentation

MetricsApp-level meets system-level

here, total CPU = ~1.12 * # busy apache procs (ymmv)

Page 13: Capacity Management Presentation

2400

photos per minute being uploaded right NOW (Tuesday afternoon)

Page 14: Capacity Management Presentation

Ceilings

the most amount of “work” yourresources will allow before

degradationor failure

Page 15: Capacity Management Presentation

Forget Benchmarking

Page 16: Capacity Management Presentation

Find your ceilings

The End

what you have left

Page 17: Capacity Management Presentation

Use real live production data

to find ceilings

Production: “it’s like a lab, but bigger!”

Page 18: Capacity Management Presentation

Like: database ceilings

replication lag: bad!

Page 19: Capacity Management Presentation

Ceilings

waiting on disk too much

sustained disk I/O wait for >40% creates

slave lag**for us, YMMV

Page 20: Capacity Management Presentation

35,000photo requests per second on a Tuesday peak

Page 21: Capacity Management Presentation

Safety Factors

Page 22: Capacity Management Presentation

Safety Factors

Ceiling * Factor of Safety = UR LIMITZ

Page 23: Capacity Management Presentation

Safety Factors

webserver!

Page 24: Capacity Management Presentation

what you have left

“safe” ceiling

@85% CPU

Safety Factors

85% total CPU = ~76 busy apache procs

Page 25: Capacity Management Presentation

Safety FactorsYahoo Front Page

link to Chinese NewYearPhotos

(photo requests/second)

(8% spike)

Page 26: Capacity Management Presentation

Forecasting

Page 27: Capacity Management Presentation

Forecasting

Fictional Example:webservers

Page 28: Capacity Management Presentation

Forecasting

Fictional example: 15 webservers. 1 week.

peak of the week

Page 29: Capacity Management Presentation

...bigger sample, 6 weeks....isolate the peaks...

Forecasting

Page 30: Capacity Management Presentation

...”Add a Trendline” with some decent correlation...

Forecasting

not too shabby

now

Page 31: Capacity Management Presentation

Forecasting

15 servers @76 busy apache proc limit = 1140 total procs

when is this?

this will tell you when it isceiling

what you have left

Page 32: Capacity Management Presentation

Forecasting

(week #10, duh)

(1140-726) / 42.751 = 9.68

Page 33: Capacity Management Presentation

Writing excel macros is boring

All we want is “days remaining”, so all we need is the curve-fit

Forecasting Automation

Use http://fityk.sf.net to automate the curve-fit

Page 34: Capacity Management Presentation

Forecasting

Fictional Example:storage consumption

Page 35: Capacity Management Presentation

Forecasting Automation

actual flickr storage consumption from early 2005, in GB

(ceiling is fictional)

this will tellyou when this is

Page 36: Capacity Management Presentation

Forecasting Automationcmd line script

outputjallspaw:~]$cfityk ./fit-storage.fit

1> # Fityk script. Fityk version: 0.8.22> @0 < '/home/jallspaw/storage-consumption.xy'15 points. No explicit std. dev. Set as sqrt(y)3> guess QuadraticNew function %_1 was created.4> fitInitial values: lambda=0.001 WSSR=464.564#1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%)#2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%)#3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%)#4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%)Fit converged.Better fit found (WSSR = 0.736763, was 464.564, -99.8414%).5> info formula in @0# storage-consumption14147.4+146.657*x+0.786854*x^26> quitbye...

Page 37: Capacity Management Presentation

Forecasting Automation

(SAME)

fityk gave:

y = 0.786854x2 + 146.657x + 14147.4

( R2 = 99.84)

Excel gave:

y = 0.7675x2 + 146.96x + 14147.3

( R2 = 99.84)

Page 38: Capacity Management Presentation

Capacity Health

12,629 nagios checks

1314 hosts

6 datacenters

4 photo “farms”

farm = 2 DCs (east/west)

Page 39: Capacity Management Presentation

High and Low Water Marks

alert if higher

alert if lower

Per server, squid requests per second

Page 40: Capacity Management Presentation

A good dashboard looks something like...

type #limit/box

ceiling units

limit (total)

current

(peak)%

peak

Est daysleft

www 20 80busy procs

1600 100062.50

%36

shard db

20 40I/O

wait800 220

27.50%

120

squid 18 950 req/sec

17,100

11,400

66.67%

48

(yes, fictional numbers)

Page 41: Capacity Management Presentation

Diagonal Scaling

Image processing machines

Replace Dell PE860s with HP DL140G3s

vertically scaling your already horizontal nodes

Page 42: Capacity Management Presentation

Diagonal Scalingexample: image processing

4 cores

8 cores

(about the same CPU “usage” per box)

Page 43: Capacity Management Presentation

~45 images/min @ peak

~140 images/min @ peak

(same CPU usage, but ~3x more work)“processing” means making 4 sizes from originals

Diagonal Scalingexample: image processing

throughput

Page 44: Capacity Management Presentation

3008.4 Watts

1036.8 Watts

went from:

23 Dell PE860s

8 HP DL140 G3s

to:

1035 photos/min

1120 photos/min

(75% faster, even)

23Urack

8Urack

Diagonal Scalingexample: image processing

!!!

Page 45: Capacity Management Presentation

3.52

terabytes will be consumed today (on a Tuesday)

Page 46: Capacity Management Presentation

2nd Order Effects(beware the wandering

bottleneck)

running hot,so add more

Page 47: Capacity Management Presentation

2nd Order Effects(beware the wandering

bottleneck)

running great now,so more traffic!

now these

run hot

Page 48: Capacity Management Presentation

Stupid Capacity Tricks

Page 49: Capacity Management Presentation

Stupid Capacity Tricksquick and dirty management

DSHhttp://freshmeat.net/projects/dsh

[root@netmon101 ~]# cat group.of.servers

www100

www118

dbcontacts3

admin1

admin2

Page 50: Capacity Management Presentation

Stupid Capacity Tricksquick and dirty management

[root@netmon101 ~]# dsh -N group.of.servers

dsh> dateexecuting 'date'www100: Mon Jun 23 14:14:53 UTC 2008www118: Mon Jun 23 14:14:53 UTC 2008dbcontacts3: Mon Jun 23 07:14:53 PDT 2008admin1: Mon Jun 23 14:14:53 UTC 2008admin2: Mon Jun 23 14:14:53 UTC 2008dsh>

Page 51: Capacity Management Presentation

Stupid Capacity TricksTurn Stuff OFF

Disable heavy-ish features of the site(on/off switches)

We have 195 different things to disable in case of emergency.

Page 52: Capacity Management Presentation

Stupid Capacity TricksTurn Stuff OFF

uploads (photo)

uploads (video)

uploads by email

various API things

various mobile things

various search things

etc., etc.

Page 53: Capacity Management Presentation

Host your outage/status/blog page in more than one datacenter.

Tell your users WTF is going on, they’ll appreciate it.

Stupid Capacity TricksOutages Happen

Page 54: Capacity Management Presentation

Stupid Capacity TricksHit the Pause Button

Bake the dynamic into static

Some Y! properties have a big red button to instantly bake (and un-bake) at will

Page 55: Capacity Management Presentation

thankshttp://flickr.com/photos/bondidwhat/402089763/http://flickr.com/photos/74876632@N00/2394833962/http://flickr.com/photos/42311564@N00/220394633/http://flickr.com/photos/unloveable/2422483859/http://flickr.com/photos/absolutwade/149702085/http://flickr.com/photos/krawiec/521836276/http://flickr.com/photos/eschipul/1560875648/http://flickr.com/photos/library_of_congress/2179060841/http://flickr.com/photos/jekkyl/511187885/http://flickr.com/photos/ab8wn/368021672/http://flickr.com/photos/jaxxon/165559708/http://flickr.com/photos/sparktography/75499095/

Page 56: Capacity Management Presentation

We’re Hiring!flickr.com/jobs

Come see me!

Page 57: Capacity Management Presentation

questions?