Web Caching

67
Web Caching Elliot Jaffe Presentation for The Seminar on Database and Internet Hebrew University, Fall 2002

description

Web Caching. Elliot Jaffe Presentation for The Seminar on Database and Internet Hebrew University, Fall 2002. Agenda. Caching: Why, Where, How, What Some empirical data: Zipf’s Law Content Delivery Networks Bibliography. Why cache?. Number of unique pages: 800M < X < 2.2B - PowerPoint PPT Presentation

Transcript of Web Caching

Page 1: Web Caching

Web Caching

Elliot JaffePresentation for The Seminar on Database and

InternetHebrew University, Fall 2002

Page 2: Web Caching

Agenda

Caching: Why, Where, How, What

Some empirical data: Zipf’s Law

Content Delivery Networks

Bibliography

Page 3: Web Caching

Why cache?

Number of unique pages: 800M < X < 2.2B

Number of unique web sites: 8,500,000

static pages: %30 - %40

pages revisited: %80

expected hit-rate: %24 - %32

Page 4: Web Caching

Why cache?

Bandwidth

Latency

Performance = Response Time

Server Load

Failure Redundancy

Page 5: Web Caching

Reverse

ProxyReverse

ProxyReverse

Proxy

Intranet

Where

Browser

Local ISP

cacheL4 Switch

Data Center

ISPcdn

cache

cache

Content

ServerContent

ServerContent

ServerContent

Server

Reverse

Proxy

Browsercache

Browsercache

cdn

Page 6: Web Caching

Hot-potato routing

Get traffic off of your network as soon as possible

Bounces traffic around the internet

Increases chance of dropped packet

Increases latencyYou are here

Destination

Page 7: Web Caching

How: Types of Caches

Simple Proxy

Transparent Proxy

Reverse Proxy

Adaptive Caching

Push Caching

Active Caching

Streaming Caches

Page 8: Web Caching

How: Simple Proxy

Harvest/Squid

Provide web content for a fixed user base

Standalone operation

May be transparent

Commodity product/technology

Easy to get 90% correct

Page 9: Web Caching

How: Transparent Proxy

No client configuration

Violates end-to-end paradigm

Client thinks it is talking directly to server

Server thinks it is talking to cache

Implemented as

Pass-through unit

L4 switch

Page 10: Web Caching

How: Reverse Proxy

Designed to offload duties from one or more

specific servers

Data size is limited to size of static content on

the server

Challenge is fast, disk-less operation

Cache consistency is easy

Single point of failure

Page 11: Web Caching

How: Adaptive Caching

ISP Level caching

Cooperating multiple distributed caches

Operate as a cache-mesh based on content

demand

Multicast for group membership (GCS)

Content Routing Protocol sends request to

the appropriate cache within the mesh

Page 12: Web Caching

How: Push Caching

Send the data out proactively

Content Delivery Networks

Paid for by data providers

More on this later!

Page 13: Web Caching

How: Active Caching

Use an applet inside of the cache to

customize dynamic pages on the fly

How do you identify dynamic pages?

Where does the custom data come from?

Who is going to pay for this service?

Page 14: Web Caching

How: Streaming Caches

What about streaming content

Movies

Audio

Proprietary streaming protocols

Challenge is to maintain Quality of content

and service

Who pays for this?

Page 15: Web Caching

What: Content and Protocols

Mostly Static Content HTML XML GIF AVI EXE Etc.

Page 16: Web Caching

What: Content and Protocols

HTTP 1.0 Basic protocol Send Request based on fix number of verbs

GET HEAD POST

Receive response, meta-data, content

Page 17: Web Caching

What: Content and Protocols

HTTP Request

Request = Simple-Request | Full-Request

Simple-Request = "GET" SP Request-URI CRLF

Full-Request = Request-Line ; * ( General-Header ;

| Request-Header ;| Entity-Header ) ;

CRLF[ Entity-Body ]

Page 18: Web Caching

What: Content and Protocols

Example: GET /pub/www/index.html HTTP/1.0

Response:HTTP/1.1 200 OKServer: Microsoft-IIS/5.0Date: Sat, 19 Oct 2002 05:46:53 GMTExpires: Sun, 20 Oct 2002 16:00:00 GMTContent-Length: 2291Content-Type: text/htmlCache-control: private

Page 19: Web Caching

What: Content and Protocols

Example “if-modified-since”:GET /pub/www/index.html HTTP/1.0If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT

Response:HTTP/1.1 200 OKServer: Microsoft-IIS/5.0Date: Thu, 13 Jul 2000 05:46:53 GMTExpires: Sun, 20 Oct 2002 16:00:00 GMTContent-Length: 2291Content-Type: text/htmlCache-control: private

Page 20: Web Caching

What: Content and Protocols

Example “if-modified-since”:

GET /pub/www/index.html HTTP/1.0

If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT

Response:

HTTP/1.1 304 Not Modified

Page 21: Web Caching

Basic caching algorithm

Pages may be

Fresh: up-to-date

Expired: current date > expiration date

Stale: “old”

Page 22: Web Caching

Basic caching algorithm - #2

If (page is in the cache)

if ( page is expired or stale )

Get from server - if-modified-since

If not modified, Get from cache

Get from cache

Else

Get from Server

Page 23: Web Caching

Basic caching algorithm - #3

If cache has space

Store the file

Else

1. Delete expired from cache

2. Delete stale from cache

3. Delete LRU from cache

4. Delete largest/smallest from cache?

Page 24: Web Caching

Agenda

Caching: Why, Where, How, What

Some empirical data: Zipf’s Law

Content Delivery Networks

Bibliography

Page 25: Web Caching

Zipf’s law

Zipf’s law: The frequency of an event P as a function of rank i is a power law function:

Pi = Ω / iα where α ≤ 1

Page 26: Web Caching

Zipf’s law

Observed to be true for

Frequency of written words in English texts

Population of cities Income of a company as a function of

rank

Page 27: Web Caching

Zipf’s law and web access

For a given server, page access by rank follows Zipf’s law

Web requests from a fixed population of users follows Zipf’s law 0.64 < α < 0.83

Page 28: Web Caching

Observations

Top %1 of all documents account for %20 - %35 of proxy requests

Top %10 account for %45 - %55 of requests

It takes %25 to %40 of all documents to account for %70 of requests

It takes %70 to %80 of all documents to account for %90 of requests

Page 29: Web Caching

Observations

Page 30: Web Caching

Observations

For an infinite sized cache, the hit-ratio for a web-proxy grows in a log-like fashion as a function of the client population of the proxy and the number of requests seen by the proxy.

Page 31: Web Caching

Observations

The hit-ratio of a web cache grows in a log-like fashion as a function of the cache size.

Page 32: Web Caching

Observations

Locality of Reference

The probability that a document will be referenced k requests after it was last referenced is roughly proportional to 1/k.

Page 33: Web Caching

Observations - NOT

There is very little correlation between access frequency and document size

There is no correlation between access frequency and the change rate of a document

No single web server contributes to most of the popular pages

Page 34: Web Caching

Zipf’s Law and Caching

Discussion

How does this help in cache design?

Are there any business implications?

Page 35: Web Caching

Agenda

Caching: Why, Where, How, What

Some empirical data: Zipf’s Law

Content Delivery Networks

Bibliography

Page 36: Web Caching

CDN

“Traditional” CDN

Dirty Secrets

P2P content delivery systems

Page 37: Web Caching

Reverse

ProxyReverse

ProxyReverse

Proxy

Intranet

Why use a CDN?

Browser

Local ISP

cacheL4 Switch

Data Center

ISPcdn

cache

cache

Content

ServerContent

ServerContent

ServerContent

Server

Reverse

Proxy

Browsercache

Browsercache

cdn

Page 38: Web Caching

What is CDN?

Content Deliver Networks = PUSH

PUSH = Prefetch

Page 39: Web Caching

CDN Mechanisms

DNS redirection Complete Partial

URL rewrite

Page 40: Web Caching

A DNS-redirecting CDN

DNSredirector

Client

HTTPserver

HTTPserver

HTTPserver

A

B

C

example.com ?

B

GET http://example.com/foo

http://example.com/foo

Network Model

Network Model

Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

Originalserver

Page 41: Web Caching

CDN DNS Full Redirection

(Semi)automatic mechanism to replicate original site on CDN servers

Replace original DNS entry with enhanced DNS server that uses knowledge of network and server load to direct clients to appropriate CDN server

TTL on DNS entries are very short

Adero, NetCaching, IntelliDNS

Page 42: Web Caching

CDN DNS Partial Redirection

Statically modify selected URL’s within pages to point to CDN service

Replicate selected objects to CDN service

Redirect clients of selected URL’s using enhanced DNS server that uses knowledge of network and server load

Akamai, Digital Island, MirrorImage, SolidSpeed, Speedera

Page 43: Web Caching

CDN rewrite

Modify pages at the origin server on the fly

Change embedded URL’s based on up-to-date knowledge of the network and CDN server loads

Does not require additional DNS lookups

Fasttide, Clearway

Page 44: Web Caching

Measuring a CDN’s performance

Two papers

K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000.

B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001.

Page 45: Web Caching

The measured performance of content distribution networks

Client Actions

R: Resolve domain name F: Fetch content Ordinary client use of CDN: RF Instead of doing (RF)+ we do R+ then F+

This allows us to compare the server chosen to some other servers that could have been chosen, over a large number of fetches.

Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

Page 46: Web Caching

The measured performance of content distribution networks

Procedure

R+: Collect a set of servers by repeated DNS queries to a variety of name servers over a number of hours

F+: Fetch a particular piece of content from each member of the set, measuring latency

Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

Page 47: Web Caching

The measured performance of content distribution networks

Important Details

Interleaved fetches Fetch1 at server1, fetch1 at server2, etc. Not fetch1 at server1, fetch2 at server1, etc.

Unmeasured fetch before measured fetch Avoids cache misses

Measure only HTTP fetch latency CDN not penalized for cost of DNS resolution

Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

Page 48: Web Caching

The measured performance of content distribution networks: Looking at these graphs

Note: log plot of latency Gray line: cumulative

distribution at one server Red line: cumulative

distribution at all servers Blue line: cumulative

distribution at CDN

Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

Page 49: Web Caching

The measured performance of content distribution networks

Cumulative Distribution

Right way to look at this data Want to understand frequency and magnitude

of bad choices Consistent = vertical Fast = to the left

Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

Page 50: Web Caching

The measured performance of content distribution networks

Results

Akamai does a better job than Digital Island

Neither does a particularly good job of selecting the optimal server

Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

Page 51: Web Caching

The measured performance of content distribution networks What’s wrong with this study?

Focus is on choice of server

Cost of DNS is explicitly excluded

How does this relate to client performance?

Page 52: Web Caching

Measuring a CDN’s performance

Two papers

K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000.

B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001.

Page 53: Web Caching

On the Use and Performance of Content Distribution Networks

Focus is on client perceived performance

Build canonical web page with images from CDN server

Page 54: Web Caching

On the Use and Performance of Content Distribution Networks

If each CDN serves different content, then how did they create comparable pages?

Size matters!

Select images of (almost) identical sizes from each of the CDN services

Page 55: Web Caching

On the Use and Performance of Content Distribution Networks Step 1:

For services using only DNS redirection, get an IP address from the DNS server

For services using rewriting, get the page and extract the CDN content server from the page

Amortize DNS lookup time over all images in this page

Page 56: Web Caching

On the Use and Performance of Content Distribution Networks Step 2:

Download all the images from the IP address of the identified server

Throw this data away The purpose is to make sure that there are no

cache misses

Page 57: Web Caching

On the Use and Performance of Content Distribution Networks Step 3:

Download all the images from the IP address of the identified server just like a browser would (4 in parallel)

Repeat every 30 minutes over a period of 24 hours with a 10 minute jitter

Page 58: Web Caching

On the Use and Performance of Content Distribution Networks

Results

Page 59: Web Caching

On the Use and Performance of Content Distribution Networks

Four Conclusions

1. Forcing a DNS lookup in the critical path of resource retrieval, does not generally result in better server choices

2. The download time from a previously selected server is often better than from the download time from the newly selected server

3. CDN servers are generally not loaded so frequent DNS lookup is not helpful

4. It makes sense for CDNs to increase the DNS TTL given to a client unless the servers are known to be loaded

Page 60: Web Caching

On the Use and Performance of Content Distribution Networks Is this a better study?

More detailed results

Relates to observed performance

A good marketing white paper

What did we learn?

Page 61: Web Caching

Dirty Secrets of the CDN world

CDNs are tremendously underutilized

CDNs are over-architected

The value of a CDN is its remote presence in the ISP. Not in its ability to load balance

Remember the ISP Interconnect?

Page 62: Web Caching

P2P content delivery systems

PUSH content to the leaf nodes

Server other leaf nodes from the edges

Kontiki

Content Manager

clientclient

1

3

2

Page 63: Web Caching

P2P CDN

Four Challenges

1. Aggregate input streams

2. Deal with unstable peers

3. Manage Malicious peers

4. Who really pays for this?

Page 64: Web Caching

P2P Caching?

Discussion:

Is this a good idea? What are the issues? Where is the payback?

Page 65: Web Caching

Agenda

Caching: Why, Where, How, What

Some empirical data: Zipf’s Law

Content Delivery Networks

Bibliography

Page 66: Web Caching

Bibliography Gray, Shenoy, Rules of Thumb in Data Engineering 1999, Revised March 2000.

Microsoft Research MS-TR-99-100 Berners-Lee, Fielding, Frystyk, Hypertext Transfer Protocol -- HTTP/1.0, IETF

RFC 1945, http://www.w3.org/Protocols/rfc1945/rfc1945 Fielding, Gettys, Mogul, Frystyk, Masinter, Leach, Berners-Lee, Hypertext

Transfer Protocol -- HTTP/1.1, ftp://ftp.isi.edu/in-notes/rfc2616.txt Greg Barish and Katia Obraczka. World Wide Web Caching: Trends and

Techniques. IEEE Communications, May 2000. http://www.isi.edu/people/katia/cache-survey.pdf.

Breslau, Cao, Fan, Phillips, Shenker, Web Caching and Zipf-like distributions: Evidence and Implications, IEEE Infocom 1999

K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000. www.terena.nl/conf/wcw/Proceedings/S4/S4-1.pdf

B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001. http://www.icir.org/vern/imw-2001/imw2001-papers/10.pdf

Page 67: Web Caching

Bibliography

“Zipf Distribution of Web Site Popularity”, http://www.useit.com/alertbox/zipf.html

S. Gribble, E. Brewer, “System Design Issues for Internet Middleware Services: Deductions from a Large Client Trace”, Proceedings of the USENIX Symposium on Internet Technologies and Systems Monterey,California,December 1997

“The Internet is a little bit broken”, http://www.internap.com/about/theproblem.html

“Reliable Internet Connectivity with BGP, Chapter 7, Influencing Entrance Selection”, http://www.bgpbook.com/archpolicyenter.html

A. Cockburn, B. McKenzie, “What do Web Users Do? An Empirical Analysis of Web Use”, http://www.cosc.cantebury.ac.nz/~andy/papers/ijhcsAnalysis.pdf