Scaling the guardian

30
Scaling the Guardian Michael Brunton-Spall (@bruntonspall) [email protected]

description

How does the guardian website scale? With millions of page views per month, we need to think about scaling to an extreme level. But being Agile we did it as we went.

Transcript of Scaling the guardian

Page 1: Scaling the guardian

Scaling the Guardian

Michael Brunton-Spall (@bruntonspall)

[email protected]

Page 2: Scaling the guardian

The Guardian - Some Figures

ABCe Audited (Dec 2009)Unique Users - 36.9m per month, 1.8m per dayPage Impressions - 259m per month, 9.2m per day

Log file analysis37m requests per day, 1.1bn requests per month - not inlcuding images / static files

Page 3: Scaling the guardian

Initial Architecture

Page 4: Scaling the guardian

Scaling Problems

In memory cache is order of magnitude too small at 500Mb

Page 5: Scaling the guardian

Even Worse!

Cache is local to appserverAdding an App Server makes the problem worse

Page 6: Scaling the guardian

Our Solution

Memcached!or more accurately, a distributed cache

Page 7: Scaling the guardian

Our Solution

Page 8: Scaling the guardian

Phase 1

Memcache object cacheMassive reduction in number of DB calls

No significant drop in DB Load

Page 9: Scaling the guardian

Phase 2

Memcached query cacheMassive reduction in DB Load

Page 10: Scaling the guardian

Phase 3

Page 11: Scaling the guardian

Phase 3

Memcached pagesMore reduction in Appserver loadMust handle customisation outside of cacheMemcached for pages is filterPage customisation is a higher filterTime based decache onlyDecache only on direct page edit

Page 12: Scaling the guardian

Getting a Scaling Solution

The problem isn't technicalIt's all about the processAgile doesn't scale well!

Onsite customer doesn't care about scalingDedicated 10% team to look at "platform" issuesStill Agile, Customer is Operations Team & Architects (backend and frontend)

Page 13: Scaling the guardian

Scaling small apps rapidly

On Thursday 15th 2010 there was a historic UK event - a televised national debate.

Page 14: Scaling the guardian

Poll Charts

Always sounds simple:

"Let people viewing the page vote at anytime whether they like or dislike what the party leader is saying. Oh, and lets show it with a real time graph"

Bad words hereanytimereal-time graph

Page 15: Scaling the guardian

Our coverage looked like this...

Page 16: Scaling the guardian

The poll itself

Page 17: Scaling the guardian

The poll itself

PythonGoogle App EngineAn inhouse, inplatform cache

Page 18: Scaling the guardian

The Naive Implementation

class IncrLibDemRequest: def get(self): Poll.get().libdems += 1

Why?Google App Engine has transaction locks, simultaneous threads can't atomically increment a counter (duh)If you wrap in a txn, all threads are serialised.

You just turned Googles massively parallel data center into a very expensive file backed db

Page 19: Scaling the guardian

Our Implementation (Phase 1)

Sharded counters are the way to goFollow the article at code.google.com/appengine on sharded countersGives parallel countersBut beware....

Page 20: Scaling the guardian

Our Results and Numbers

Page 21: Scaling the guardian

Our Results and Numbers

Page 22: Scaling the guardian

Some interesting notes

Average of around 100-120 req/sPeaked at 400 req/sTotal of nearly 1,000,000 requestsSurprisingly little cheating

Only 2000 requests

But...

Page 23: Scaling the guardian

Request Duration

Between 1 sec and 8 seconds!Causes

Thread contentionNot enough shards

Page 24: Scaling the guardian

Our Implementation (2)

Increase shards by factor of 10?Completely reduces transaction failuresEach request still takes 200msThe cost is the datastore write

Replace datastore with memcache?Different architecture

vote does memcache atomic increment/decrementresults get from memcachecronjob 1/min reads from memcache and writes to datastore

requests now take 20 ms

Page 25: Scaling the guardian

The Results?

Page 26: Scaling the guardian

The Results?

Page 27: Scaling the guardian

Some notes

Total of around 2,727,000 requestsAverage of around 454 req/sPeaked at 750 req/s

Page 28: Scaling the guardian

Requests per Second

But...

Page 29: Scaling the guardian

Request Duration

Average 1.2s at firstLive deploy fixed to 300ms

Page 30: Scaling the guardian

Any Questions?

Michael Brunton-Spall (@bruntonspall)

[email protected]