250M pageviews a month A Case Study of a High-traffic Site Mike Whitaker, CricInfo.com.
-
Upload
blanche-wilkinson -
Category
Documents
-
view
216 -
download
3
Transcript of 250M pageviews a month A Case Study of a High-traffic Site Mike Whitaker, CricInfo.com.
250M pageviews a month
250M page views a month
A Case Study of a High-traffic Site
Mike Whitaker, CricInfo.com
250M pageviews a month
Who Are CricInfo.com?
• Started in 1993• Probably the largest single-sport
website on the ‘Net• Provide cricket scores and news• Global audience
250M pageviews a month
CricInfo.com: history
• Originally an IRC ‘bot• Gopher server• Single web server at OGI (SunOS)• Migrated to Linux• ‘Borrowed’ bandwidth
250M pageviews a month
What’s cricket, anyway?
You have two sides, one out in the field and one in. Each man that's in the side that's in goes out, and when he's out he comes in and the next man goes in until he's out. When they are all out, the side that's out comes in and the side that’s been in goes out and tries to get those coming in, out. Sometimes you get men still in and not out.
When a man goes out to go in, the men who are out try to get him out, and when he is out he goes in and the next man in goes out and goes in. There are two men called umpires who stay all out all the time and they decide when the men who are in are out. When both sides have been in and all the men have out, and both sides have been out twice after all the men have been in, including those who are not out, that is the end of the game!
250M pageviews a month
But seriously…
• International sport followed worldwide (even in the USA)
• Lends itself to web coverage• One event every minute or so• Events easily describable in a
sentence or two• Matches can last several days
250M pageviews a month
Audience – Country of Origin
30%
24%
18% 8% 3%
3%
8%
1%
2%
3%
IndiaAustraliaUKPakistanNew ZealandUSASri LankaSouth AfricaBangladeshRest of World
250M pageviews a month
Audience – Country of Residence
4%
3%
2%
2%
2%
11%
11%19%
19%
27%
AustraliaUKIndiaUSAPakistanNew ZealandCanadaUAESouth AfricaRest Of World
250M pageviews a month
US Audience - Country of Origin
1%5% 2%
1%
2%
3%
5%
8%
15%
58%
IndiaPakistanUSAUKAustraliaSri LankaGuyanaBangladeshSouth AfricaRest of World
250M pageviews a month
Site Traffic
• Steadily increasing over past decade
• Sept 2002: already past 1 billion year-to-date
• Not uniform throughout year
0200400600800
1000120014001600
pag
evie
ws (
M)
250M pageviews a month
Site Traffic – Typical Month
• Atypical traffic pattern
• Dependent on international cricket schedule
• Reasonably predictable
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
1 4 7 10 13 16 19 22 25 28 31
Mill
ion
s
Aug 2002
250M pageviews a month
Site traffic – ‘quiet’ day
• No international games
• General browsing
• No major peaks and troughs
• Much more like a ‘typical’ Web site
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 3 5 7 9 11 13 15 17 19 21 23
Mil
lio
ns
Aug 20 2002
250M pageviews a month
Site traffic – match day
• Sustained continuous access during match(es)
• Traffic can vary a lot depending on state of play 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 3 5 7 9 11 13 15 17 19 21 23
Mil
lio
ns
Aug 23 2002
250M pageviews a month
Server behavior under load
• Linear under low load
• Becomes less linear as load tends to max
• Reverses under excessive load
• Factors affecting server load?
0
25
50
75
100
125
150
175
0 50 100 150Potential
Act
ual
Actual
Theoretical
250M pageviews a month
Why Apache/Debian?
• Tested/popular solution• Free• Hardware availability• Price/performance ratio• Fate • Server configuration
250M pageviews a month
‘Cluster’ setup
• All servers identical – why?• Single servers in some locations• Ease of installation• Easier support
• Crude (but effective) load-balancing via RR DNS
• ‘sub-clusters’ for users from given countries
250M pageviews a month
Mirroring
• rsync• 12,000-120,000 changes a day• 70-18,000 changes an hour• Approx 500,000 mirrorable files out of
2.5M total• Issues
• Memory usage• Performance over bad links
250M pageviews a month
Server Configuration
• Dual PII/PIII• 1GB RAM• Twin hard drives• Debian 2.2 (potato)• Apache 1.3.x
250M pageviews a month
Apache Configuration
• Classic ‘dual’ httpd’s• ‘static’ server handles content• ‘dynamic’ server (+ mod_perl)
handles CGI’s, adverts• Users connect to the static server,
dynamic content requests handled via ProxyPass
250M pageviews a month
ProxyPass
• mod_proxyProxyPass /perl/ http://localhost:8989/perl/
• Doesn’t pass everything we need• Solution: customized
mod_proxy_add_forward• PerlPostReadRequestHandler
• Inspects ALL X-ForwardedFor• Takes IP address off ‘oldest’ that isn’t an
RFC1918 address
250M pageviews a month
Why we need the ‘right’ IP
• Routing user to right ‘sub-cluster’• Performance• Bandwidth costs• Partner agreements
• Advert targeting - AdLib• Target by user’s country
• All done using CQ
250M pageviews a month
CQ
• Maps IP blocks to country code• Perl module calling C code
• List of start of block + country code• Binary chop algorithm• Rewrite to use radix tree (Net::Patricia)?
• Continuously updated• cq.cgi sets cookie with country code
250M pageviews a month
AdLib
• Internally developed advert server• Developed and tuned over 3 years• 5000 lines of OO Perl• Handles rich media ads• Handles ads served on commercial
server networks• Performance matches commercial
servers
250M pageviews a month
Why write our own ad server?
• Cost• Control
• We manage our own ad inventory• Can tailor HTML to suit our site
• Performance• Ads served locally
250M pageviews a month
AdLib architecture
• Central management server• Stores advert campaign configs• Distributes content to web servers• Collects logs from web servers
• mod_perl handlers on web servers• Ads inserted via SSI
<!-- #include virtual=“/adlib/insert.cgi” -->
250M pageviews a month
AdLib – advert insertion
• Ads divided into categories• Specify category to choose from for given
spot on page• Fixed size (overridable)• ‘placement’ specifier for some insertions
• HTML generation• <IFRAME SRC=…>, fallback to…• <SCRIPT SRC=…>, fallback to…• <IMG SRC=…> plus cookie
250M pageviews a month
AdLib: advert insertion 2
• Target by:• Page viewed• Country of user – uses CQ• Time of day• (‘plug in’ scoring functions)
• Select from candidates based on:• Progress of campaign• Weighting• Override (only certain ads can serve on this
page/set of pages)
250M pageviews a month
AdLib: serving creatives• PerlHandler
• URL contains UID• Unpicks UID to find campaign and creative• Logs impression (not insertion)• Redirects to creative
• Cache-busting• Redirect response is not cacheable• Creative image is cacheable
250M pageviews a month
AdLib: serving creatives 2
• Rich media ads• HTML can contain substitutions• 1x1 transparent images (logging)
• Text-only ads
• “Can we serve this kind of ad?”
250M pageviews a month
AdLib: tracking clickthroughs
• Simple PerlHandler• URL contains same UID as creative• Simply logs and redirects
• Problem: Flash/Java ads with embedded URLs
250M pageviews a month
AdLib: load limiting
• check_servers ‘heartbeat’ script• Monitors server load• Switches RewriteMap file depending
on load
• Ad insertions that need to be load limited use /adlib/noload.cgi
250M pageviews a month
AdLib: load limiting 2
From static server config:1) RewriteMap adlib
txt:/etc/cricinfo/rewrite.map2) RewriteLock
/etc/cricinfo/rewrite.map.lock3) RewriteRule ^/adlib/noload\.cgi(.*)$
/${adlib:noload|noload}/insert.cgi$1$24) RewriteRule ^/dummy-adlib/ /no-ad.html5) ProxyPass /adlib
http://localhost:8989/adlib
250M pageviews a month
User registration
• MySQL registration DB server• Authentication
• Version 1• mod_auth_mysql +
mod_auth_cookie_mysql
• Version 2• mod_auth_mda
250M pageviews a month
mod_auth_mda
• User is issued ticket• MD5 hashed function of username, IP,
auth realm• Needs to renew after set time period
• Need to prevent sharing of accounts