Post on 24-May-2015
description
Scaling the Guardian
Michael Brunton-Spall (@bruntonspall)
michael.brunton-spall@guardian.co.uk
The Guardian - Some Figures
ABCe Audited (Dec 2009)Unique Users - 36.9m per month, 1.8m per dayPage Impressions - 259m per month, 9.2m per day
Log file analysis37m requests per day, 1.1bn requests per month - not inlcuding images / static files
Initial Architecture
Scaling Problems
In memory cache is order of magnitude too small at 500Mb
Even Worse!
Cache is local to appserverAdding an App Server makes the problem worse
Our Solution
Memcached!or more accurately, a distributed cache
Our Solution
Phase 1
Memcache object cacheMassive reduction in number of DB calls
No significant drop in DB Load
Phase 2
Memcached query cacheMassive reduction in DB Load
Phase 3
Phase 3
Memcached pagesMore reduction in Appserver loadMust handle customisation outside of cacheMemcached for pages is filterPage customisation is a higher filterTime based decache onlyDecache only on direct page edit
Getting a Scaling Solution
The problem isn't technicalIt's all about the processAgile doesn't scale well!
Onsite customer doesn't care about scalingDedicated 10% team to look at "platform" issuesStill Agile, Customer is Operations Team & Architects (backend and frontend)
Scaling small apps rapidly
On Thursday 15th 2010 there was a historic UK event - a televised national debate.
Poll Charts
Always sounds simple:
"Let people viewing the page vote at anytime whether they like or dislike what the party leader is saying. Oh, and lets show it with a real time graph"
Bad words hereanytimereal-time graph
Our coverage looked like this...
The poll itself
The poll itself
PythonGoogle App EngineAn inhouse, inplatform cache
The Naive Implementation
class IncrLibDemRequest: def get(self): Poll.get().libdems += 1
Why?Google App Engine has transaction locks, simultaneous threads can't atomically increment a counter (duh)If you wrap in a txn, all threads are serialised.
You just turned Googles massively parallel data center into a very expensive file backed db
Our Implementation (Phase 1)
Sharded counters are the way to goFollow the article at code.google.com/appengine on sharded countersGives parallel countersBut beware....
Our Results and Numbers
Our Results and Numbers
Some interesting notes
Average of around 100-120 req/sPeaked at 400 req/sTotal of nearly 1,000,000 requestsSurprisingly little cheating
Only 2000 requests
But...
Request Duration
Between 1 sec and 8 seconds!Causes
Thread contentionNot enough shards
Our Implementation (2)
Increase shards by factor of 10?Completely reduces transaction failuresEach request still takes 200msThe cost is the datastore write
Replace datastore with memcache?Different architecture
vote does memcache atomic increment/decrementresults get from memcachecronjob 1/min reads from memcache and writes to datastore
requests now take 20 ms
The Results?
The Results?
Some notes
Total of around 2,727,000 requestsAverage of around 454 req/sPeaked at 750 req/s
Requests per Second
But...
Request Duration
Average 1.2s at firstLive deploy fixed to 300ms
Any Questions?
Michael Brunton-Spall (@bruntonspall)
michael.brunton-spall@guardian.co.uk