Capacity Management Presentation

Capacity Management

for Web Operations

John AllspawOperations Engineering

the book I’m writing

Rules of Thumb

Planning/Forecasting

Stupid Capacity Tricks

(with some Flickr statistics sprinkled in)

bugs (disguised as capacity problems)

edge cases (disguised as capacity problems)

security incidents

real capacity problems*

* (should be the last thing you need to worry about)

Things that can cause downtime

Capacity != Performance

Forget about performance for right now

Measure what you have right NOW

Don’t count on it getting any better

Thank You HPC Industry!

Automated Stuff

Scalable Metric Collection/Display

a lot of great deployment and management trickscome from them, adopted by web ops

Good Measureme

nt Tools

record and storemetrics in/outcustom metricseasily comparelightweight-ish

I

Clouds need planning too

Makes deployment and procurement easy and quick

But clouds are still resources with costs and limits, just like your own stuff

Black-boxes: you may need to pay even more attention than before

Metrics

System Statistics

Metrics“Application” Level

(photos processed per minute)

(average processing time per photo)

(apache requests)

(concurrent busy apache procs)

MetricsApp-level meets system-level

here, total CPU = ~1.12 * # busy apache procs (ymmv)

2400

photos per minute being uploaded right NOW (Tuesday afternoon)

Ceilings

the most amount of “work” yourresources will allow before

degradationor failure

Forget Benchmarking

Find your ceilings

The End

what you have left

Use real live production data

to find ceilings

Production: “it’s like a lab, but bigger!”

Like: database ceilings

replication lag: bad!

Ceilings

waiting on disk too much

sustained disk I/O wait for >40% creates

slave lag**for us, YMMV

35,000photo requests per second on a Tuesday peak

Safety Factors

Safety Factors

Ceiling * Factor of Safety = UR LIMITZ

Safety Factors

webserver!

what you have left

“safe” ceiling

@85% CPU

Safety Factors

85% total CPU = ~76 busy apache procs

Safety FactorsYahoo Front Page

link to Chinese NewYearPhotos

(photo requests/second)

(8% spike)

Forecasting

Forecasting

Fictional Example:webservers

Forecasting

Fictional example: 15 webservers. 1 week.

peak of the week

...bigger sample, 6 weeks....isolate the peaks...

Forecasting

...”Add a Trendline” with some decent correlation...

Forecasting

not too shabby

now

Forecasting

15 servers @76 busy apache proc limit = 1140 total procs

when is this?

this will tell you when it isceiling

what you have left

Forecasting

(week #10, duh)

(1140-726) / 42.751 = 9.68

Writing excel macros is boring

All we want is “days remaining”, so all we need is the curve-fit

Forecasting Automation

Use http://fityk.sf.net to automate the curve-fit

http://fityk.sf.net/

Forecasting

Fictional Example:storage consumption


actual flickr storage consumption from early 2005, in GB

(ceiling is fictional)

this will tellyou when this is

Forecasting Automationcmd line script

outputjallspaw:~]$cfityk ./fit-storage.fit

1> # Fityk script. Fityk version: 0.8.22> @0 < '/home/jallspaw/storage-consumption.xy'15 points. No explicit std. dev. Set as sqrt(y)3> guess QuadraticNew function %_1 was created.4> fitInitial values: lambda=0.001 WSSR=464.564#1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%)#2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%)#3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%)#4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%)Fit converged.Better fit found (WSSR = 0.736763, was 464.564, -99.8414%).5> info formula in @0# storage-consumption14147.4+146.657*x+0.786854*x^26> quitbye...


(SAME)

fityk gave:

y = 0.786854x2 + 146.657x + 14147.4

( R2 = 99.84)

Excel gave:

y = 0.7675x2 + 146.96x + 14147.3

( R2 = 99.84)

Capacity Health

12,629 nagios checks

1314 hosts

6 datacenters

4 photo “farms”

farm = 2 DCs (east/west)

High and Low Water Marks

alert if higher

alert if lower

Per server, squid requests per second

A good dashboard looks something like...

type #limit/box

ceiling units

limit (total)

current

(peak)%

peak

Est daysleft

www 20 80busy procs

1600 100062.50

%36

shard db

20 40I/O

wait800 220

27.50%

120

squid 18 950 req/sec

17,100

11,400

66.67%

48

(yes, fictional numbers)

Diagonal Scaling

Image processing machines

Replace Dell PE860s with HP DL140G3s

vertically scaling your already horizontal nodes

Diagonal Scalingexample: image processing

4 cores

8 cores

(about the same CPU “usage” per box)

~45 images/min @ peak

~140 images/min @ peak

(same CPU usage, but ~3x more work)“processing” means making 4 sizes from originals


throughput

3008.4 Watts

1036.8 Watts

went from:

23 Dell PE860s

8 HP DL140 G3s

to:

1035 photos/min

1120 photos/min

(75% faster, even)

23Urack

8Urack


!!!

3.52

terabytes will be consumed today (on a Tuesday)

2nd Order Effects(beware the wandering

bottleneck)

running hot,so add more

2nd Order Effects(beware the wandering

bottleneck)

running great now,so more traffic!

now these

run hot

Stupid Capacity Tricks

Stupid Capacity Tricksquick and dirty management

DSHhttp://freshmeat.net/projects/dsh

[root@netmon101 ~]# cat group.of.servers

www100

www118

dbcontacts3

admin1

admin2

http://dsh.sf.net/

Stupid Capacity Tricksquick and dirty management

[root@netmon101 ~]# dsh -N group.of.servers

dsh> dateexecuting 'date'www100: Mon Jun 23 14:14:53 UTC 2008www118: Mon Jun 23 14:14:53 UTC 2008dbcontacts3: Mon Jun 23 07:14:53 PDT 2008admin1: Mon Jun 23 14:14:53 UTC 2008admin2: Mon Jun 23 14:14:53 UTC 2008dsh>

Stupid Capacity TricksTurn Stuff OFF

Disable heavy-ish features of the site(on/off switches)

We have 195 different things to disable in case of emergency.

Stupid Capacity TricksTurn Stuff OFF

uploads (photo)

uploads (video)

uploads by email

various API things

various mobile things

various search things

etc., etc.

Host your outage/status/blog page in more than one datacenter.

Tell your users WTF is going on, they’ll appreciate it.

Stupid Capacity TricksOutages Happen

Stupid Capacity TricksHit the Pause Button

Bake the dynamic into static

Some Y! properties have a big red button to instantly bake (and un-bake) at will

thankshttp://flickr.com/photos/bondidwhat/402089763/http://flickr.com/photos/74876632@N00/2394833962/http://flickr.com/photos/42311564@N00/220394633/http://flickr.com/photos/unloveable/2422483859/http://flickr.com/photos/absolutwade/149702085/http://flickr.com/photos/krawiec/521836276/http://flickr.com/photos/eschipul/1560875648/http://flickr.com/photos/library_of_congress/2179060841/http://flickr.com/photos/jekkyl/511187885/http://flickr.com/photos/ab8wn/368021672/http://flickr.com/photos/jaxxon/165559708/http://flickr.com/photos/sparktography/75499095/

http://flickr.com/photos/bondidwhat/402089763/

http://flickr.com/photos/74876632@N00/2394833962/

http://flickr.com/photos/42311564@N00/220394633/

http://flickr.com/photos/unloveable/2422483859/

http://flickr.com/photos/absolutwade/149702085/

http://flickr.com/photos/krawiec/521836276/

http://flickr.com/photos/eschipul/1560875648/

http://flickr.com/photos/library_of_congress/2179060841/

http://flickr.com/photos/jekkyl/511187885/

http://flickr.com/photos/ab8wn/368021672/

http://flickr.com/photos/jaxxon/165559708/

http://flickr.com/photos/sparktography/75499095/

We’re Hiring!flickr.com/jobs

Come see me!

questions?

Capacity Management Presentation

Technology

Transcript of Capacity Management Presentation