250M pageviews a month A Case Study of a High-traffic Site Mike Whitaker, CricInfo.com.

34
250M pageviews a month 250M page views a month A Case Study of a High- traffic Site Mike Whitaker, CricInfo.com

Transcript of 250M pageviews a month A Case Study of a High-traffic Site Mike Whitaker, CricInfo.com.

250M pageviews a month

250M page views a month

A Case Study of a High-traffic Site

Mike Whitaker, CricInfo.com

250M pageviews a month

Who Are CricInfo.com?

• Started in 1993• Probably the largest single-sport

website on the ‘Net• Provide cricket scores and news• Global audience

250M pageviews a month

CricInfo.com: history

• Originally an IRC ‘bot• Gopher server• Single web server at OGI (SunOS)• Migrated to Linux• ‘Borrowed’ bandwidth

250M pageviews a month

What’s cricket, anyway?

You have two sides, one out in the field and one in. Each man that's in the side that's in goes out, and when he's out he comes in and the next man goes in until he's out. When they are all out, the side that's out comes in and the side that’s been in goes out and tries to get those coming in, out. Sometimes you get men still in and not out.

When a man goes out to go in, the men who are out try to get him out, and when he is out he goes in and the next man in goes out and goes in. There are two men called umpires who stay all out all the time and they decide when the men who are in are out. When both sides have been in and all the men have out, and both sides have been out twice after all the men have been in, including those who are not out, that is the end of the game!

250M pageviews a month

But seriously…

• International sport followed worldwide (even in the USA)

• Lends itself to web coverage• One event every minute or so• Events easily describable in a

sentence or two• Matches can last several days

250M pageviews a month

Audience – Country of Origin

30%

24%

18% 8% 3%

3%

8%

1%

2%

3%

IndiaAustraliaUKPakistanNew ZealandUSASri LankaSouth AfricaBangladeshRest of World

250M pageviews a month

Audience – Country of Residence

4%

3%

2%

2%

2%

11%

11%19%

19%

27%

AustraliaUKIndiaUSAPakistanNew ZealandCanadaUAESouth AfricaRest Of World

250M pageviews a month

US Audience - Country of Origin

1%5% 2%

1%

2%

3%

5%

8%

15%

58%

IndiaPakistanUSAUKAustraliaSri LankaGuyanaBangladeshSouth AfricaRest of World

250M pageviews a month

Site Traffic

• Steadily increasing over past decade

• Sept 2002: already past 1 billion year-to-date

• Not uniform throughout year

0200400600800

1000120014001600

pag

evie

ws (

M)

250M pageviews a month

Site Traffic – Typical Month

• Atypical traffic pattern

• Dependent on international cricket schedule

• Reasonably predictable

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

1 4 7 10 13 16 19 22 25 28 31

Mill

ion

s

Aug 2002

250M pageviews a month

Site traffic – ‘quiet’ day

• No international games

• General browsing

• No major peaks and troughs

• Much more like a ‘typical’ Web site

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 3 5 7 9 11 13 15 17 19 21 23

Mil

lio

ns

Aug 20 2002

250M pageviews a month

Site traffic – match day

• Sustained continuous access during match(es)

• Traffic can vary a lot depending on state of play 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 3 5 7 9 11 13 15 17 19 21 23

Mil

lio

ns

Aug 23 2002

250M pageviews a month

Server behavior under load

• Linear under low load

• Becomes less linear as load tends to max

• Reverses under excessive load

• Factors affecting server load?

0

25

50

75

100

125

150

175

0 50 100 150Potential

Act

ual

Actual

Theoretical

250M pageviews a month

Why Apache/Debian?

• Tested/popular solution• Free• Hardware availability• Price/performance ratio• Fate • Server configuration

250M pageviews a month

‘Cluster’ setup

• All servers identical – why?• Single servers in some locations• Ease of installation• Easier support

• Crude (but effective) load-balancing via RR DNS

• ‘sub-clusters’ for users from given countries

250M pageviews a month

Mirroring

• rsync• 12,000-120,000 changes a day• 70-18,000 changes an hour• Approx 500,000 mirrorable files out of

2.5M total• Issues

• Memory usage• Performance over bad links

250M pageviews a month

Server Configuration

• Dual PII/PIII• 1GB RAM• Twin hard drives• Debian 2.2 (potato)• Apache 1.3.x

250M pageviews a month

Apache Configuration

• Classic ‘dual’ httpd’s• ‘static’ server handles content• ‘dynamic’ server (+ mod_perl)

handles CGI’s, adverts• Users connect to the static server,

dynamic content requests handled via ProxyPass

250M pageviews a month

ProxyPass

• mod_proxyProxyPass /perl/ http://localhost:8989/perl/

• Doesn’t pass everything we need• Solution: customized

mod_proxy_add_forward• PerlPostReadRequestHandler

• Inspects ALL X-ForwardedFor• Takes IP address off ‘oldest’ that isn’t an

RFC1918 address

250M pageviews a month

Why we need the ‘right’ IP

• Routing user to right ‘sub-cluster’• Performance• Bandwidth costs• Partner agreements

• Advert targeting - AdLib• Target by user’s country

• All done using CQ

250M pageviews a month

CQ

• Maps IP blocks to country code• Perl module calling C code

• List of start of block + country code• Binary chop algorithm• Rewrite to use radix tree (Net::Patricia)?

• Continuously updated• cq.cgi sets cookie with country code

250M pageviews a month

AdLib

• Internally developed advert server• Developed and tuned over 3 years• 5000 lines of OO Perl• Handles rich media ads• Handles ads served on commercial

server networks• Performance matches commercial

servers

250M pageviews a month

Why write our own ad server?

• Cost• Control

• We manage our own ad inventory• Can tailor HTML to suit our site

• Performance• Ads served locally

250M pageviews a month

AdLib architecture

• Central management server• Stores advert campaign configs• Distributes content to web servers• Collects logs from web servers

• mod_perl handlers on web servers• Ads inserted via SSI

<!-- #include virtual=“/adlib/insert.cgi” -->

250M pageviews a month

AdLib – advert insertion

• Ads divided into categories• Specify category to choose from for given

spot on page• Fixed size (overridable)• ‘placement’ specifier for some insertions

• HTML generation• <IFRAME SRC=…>, fallback to…• <SCRIPT SRC=…>, fallback to…• <IMG SRC=…> plus cookie

250M pageviews a month

AdLib: advert insertion 2

• Target by:• Page viewed• Country of user – uses CQ• Time of day• (‘plug in’ scoring functions)

• Select from candidates based on:• Progress of campaign• Weighting• Override (only certain ads can serve on this

page/set of pages)

250M pageviews a month

AdLib: serving creatives• PerlHandler

• URL contains UID• Unpicks UID to find campaign and creative• Logs impression (not insertion)• Redirects to creative

• Cache-busting• Redirect response is not cacheable• Creative image is cacheable

250M pageviews a month

AdLib: serving creatives 2

• Rich media ads• HTML can contain substitutions• 1x1 transparent images (logging)

• Text-only ads

• “Can we serve this kind of ad?”

250M pageviews a month

AdLib: tracking clickthroughs

• Simple PerlHandler• URL contains same UID as creative• Simply logs and redirects

• Problem: Flash/Java ads with embedded URLs

250M pageviews a month

AdLib: load limiting

• check_servers ‘heartbeat’ script• Monitors server load• Switches RewriteMap file depending

on load

• Ad insertions that need to be load limited use /adlib/noload.cgi

250M pageviews a month

AdLib: load limiting 2

From static server config:1) RewriteMap adlib

txt:/etc/cricinfo/rewrite.map2) RewriteLock

/etc/cricinfo/rewrite.map.lock3) RewriteRule ^/adlib/noload\.cgi(.*)$

/${adlib:noload|noload}/insert.cgi$1$24) RewriteRule ^/dummy-adlib/ /no-ad.html5) ProxyPass /adlib

http://localhost:8989/adlib

250M pageviews a month

User registration

• MySQL registration DB server• Authentication

• Version 1• mod_auth_mysql +

mod_auth_cookie_mysql

• Version 2• mod_auth_mda

250M pageviews a month

mod_auth_mda

• User is issued ticket• MD5 hashed function of username, IP,

auth realm• Needs to renew after set time period

• Need to prevent sharing of accounts

250M pageviews a month

Future Plans

• Better mirroring + change control• LVS-based load balancing• ‘specialist’ servers• Database-backed content• AdLib

• Rewrite in C (version 5.0)• XML config and reporting• Improved inter-server communication• Possible Open Source release