Web Caching
description
Transcript of Web Caching
Web Caching
Elliot JaffePresentation for The Seminar on Database and
InternetHebrew University, Fall 2002
Agenda
Caching: Why, Where, How, What
Some empirical data: Zipf’s Law
Content Delivery Networks
Bibliography
Why cache?
Number of unique pages: 800M < X < 2.2B
Number of unique web sites: 8,500,000
static pages: %30 - %40
pages revisited: %80
expected hit-rate: %24 - %32
Why cache?
Bandwidth
Latency
Performance = Response Time
Server Load
Failure Redundancy
Reverse
ProxyReverse
ProxyReverse
Proxy
Intranet
Where
Browser
Local ISP
cacheL4 Switch
Data Center
ISPcdn
cache
cache
Content
ServerContent
ServerContent
ServerContent
Server
Reverse
Proxy
Browsercache
Browsercache
cdn
Hot-potato routing
Get traffic off of your network as soon as possible
Bounces traffic around the internet
Increases chance of dropped packet
Increases latencyYou are here
Destination
How: Types of Caches
Simple Proxy
Transparent Proxy
Reverse Proxy
Adaptive Caching
Push Caching
Active Caching
Streaming Caches
How: Simple Proxy
Harvest/Squid
Provide web content for a fixed user base
Standalone operation
May be transparent
Commodity product/technology
Easy to get 90% correct
How: Transparent Proxy
No client configuration
Violates end-to-end paradigm
Client thinks it is talking directly to server
Server thinks it is talking to cache
Implemented as
Pass-through unit
L4 switch
How: Reverse Proxy
Designed to offload duties from one or more
specific servers
Data size is limited to size of static content on
the server
Challenge is fast, disk-less operation
Cache consistency is easy
Single point of failure
How: Adaptive Caching
ISP Level caching
Cooperating multiple distributed caches
Operate as a cache-mesh based on content
demand
Multicast for group membership (GCS)
Content Routing Protocol sends request to
the appropriate cache within the mesh
How: Push Caching
Send the data out proactively
Content Delivery Networks
Paid for by data providers
How: Active Caching
Use an applet inside of the cache to
customize dynamic pages on the fly
How do you identify dynamic pages?
Where does the custom data come from?
Who is going to pay for this service?
How: Streaming Caches
What about streaming content
Movies
Audio
Proprietary streaming protocols
Challenge is to maintain Quality of content
and service
Who pays for this?
What: Content and Protocols
Mostly Static Content HTML XML GIF AVI EXE Etc.
What: Content and Protocols
HTTP 1.0 Basic protocol Send Request based on fix number of verbs
GET HEAD POST
Receive response, meta-data, content
What: Content and Protocols
HTTP Request
Request = Simple-Request | Full-Request
Simple-Request = "GET" SP Request-URI CRLF
Full-Request = Request-Line ; * ( General-Header ;
| Request-Header ;| Entity-Header ) ;
CRLF[ Entity-Body ]
What: Content and Protocols
Example: GET /pub/www/index.html HTTP/1.0
Response:HTTP/1.1 200 OKServer: Microsoft-IIS/5.0Date: Sat, 19 Oct 2002 05:46:53 GMTExpires: Sun, 20 Oct 2002 16:00:00 GMTContent-Length: 2291Content-Type: text/htmlCache-control: private
What: Content and Protocols
Example “if-modified-since”:GET /pub/www/index.html HTTP/1.0If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT
Response:HTTP/1.1 200 OKServer: Microsoft-IIS/5.0Date: Thu, 13 Jul 2000 05:46:53 GMTExpires: Sun, 20 Oct 2002 16:00:00 GMTContent-Length: 2291Content-Type: text/htmlCache-control: private
What: Content and Protocols
Example “if-modified-since”:
GET /pub/www/index.html HTTP/1.0
If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT
Response:
HTTP/1.1 304 Not Modified
Basic caching algorithm
Pages may be
Fresh: up-to-date
Expired: current date > expiration date
Stale: “old”
Basic caching algorithm - #2
If (page is in the cache)
if ( page is expired or stale )
Get from server - if-modified-since
If not modified, Get from cache
Get from cache
Else
Get from Server
Basic caching algorithm - #3
If cache has space
Store the file
Else
1. Delete expired from cache
2. Delete stale from cache
3. Delete LRU from cache
4. Delete largest/smallest from cache?
Agenda
Caching: Why, Where, How, What
Some empirical data: Zipf’s Law
Content Delivery Networks
Bibliography
Zipf’s law
Zipf’s law: The frequency of an event P as a function of rank i is a power law function:
Pi = Ω / iα where α ≤ 1
Zipf’s law
Observed to be true for
Frequency of written words in English texts
Population of cities Income of a company as a function of
rank
Zipf’s law and web access
For a given server, page access by rank follows Zipf’s law
Web requests from a fixed population of users follows Zipf’s law 0.64 < α < 0.83
Observations
Top %1 of all documents account for %20 - %35 of proxy requests
Top %10 account for %45 - %55 of requests
It takes %25 to %40 of all documents to account for %70 of requests
It takes %70 to %80 of all documents to account for %90 of requests
Observations
Observations
For an infinite sized cache, the hit-ratio for a web-proxy grows in a log-like fashion as a function of the client population of the proxy and the number of requests seen by the proxy.
Observations
The hit-ratio of a web cache grows in a log-like fashion as a function of the cache size.
Observations
Locality of Reference
The probability that a document will be referenced k requests after it was last referenced is roughly proportional to 1/k.
Observations - NOT
There is very little correlation between access frequency and document size
There is no correlation between access frequency and the change rate of a document
No single web server contributes to most of the popular pages
Zipf’s Law and Caching
Discussion
How does this help in cache design?
Are there any business implications?
Agenda
Caching: Why, Where, How, What
Some empirical data: Zipf’s Law
Content Delivery Networks
Bibliography
CDN
“Traditional” CDN
Dirty Secrets
P2P content delivery systems
Reverse
ProxyReverse
ProxyReverse
Proxy
Intranet
Why use a CDN?
Browser
Local ISP
cacheL4 Switch
Data Center
ISPcdn
cache
cache
Content
ServerContent
ServerContent
ServerContent
Server
Reverse
Proxy
Browsercache
Browsercache
cdn
What is CDN?
Content Deliver Networks = PUSH
PUSH = Prefetch
CDN Mechanisms
DNS redirection Complete Partial
URL rewrite
A DNS-redirecting CDN
DNSredirector
Client
HTTPserver
HTTPserver
HTTPserver
A
B
C
example.com ?
B
GET http://example.com/foo
http://example.com/foo
Network Model
Network Model
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
Originalserver
CDN DNS Full Redirection
(Semi)automatic mechanism to replicate original site on CDN servers
Replace original DNS entry with enhanced DNS server that uses knowledge of network and server load to direct clients to appropriate CDN server
TTL on DNS entries are very short
Adero, NetCaching, IntelliDNS
CDN DNS Partial Redirection
Statically modify selected URL’s within pages to point to CDN service
Replicate selected objects to CDN service
Redirect clients of selected URL’s using enhanced DNS server that uses knowledge of network and server load
Akamai, Digital Island, MirrorImage, SolidSpeed, Speedera
CDN rewrite
Modify pages at the origin server on the fly
Change embedded URL’s based on up-to-date knowledge of the network and CDN server loads
Does not require additional DNS lookups
Fasttide, Clearway
Measuring a CDN’s performance
Two papers
K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000.
B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001.
The measured performance of content distribution networks
Client Actions
R: Resolve domain name F: Fetch content Ordinary client use of CDN: RF Instead of doing (RF)+ we do R+ then F+
This allows us to compare the server chosen to some other servers that could have been chosen, over a large number of fetches.
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content distribution networks
Procedure
R+: Collect a set of servers by repeated DNS queries to a variety of name servers over a number of hours
F+: Fetch a particular piece of content from each member of the set, measuring latency
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content distribution networks
Important Details
Interleaved fetches Fetch1 at server1, fetch1 at server2, etc. Not fetch1 at server1, fetch2 at server1, etc.
Unmeasured fetch before measured fetch Avoids cache misses
Measure only HTTP fetch latency CDN not penalized for cost of DNS resolution
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content distribution networks: Looking at these graphs
Note: log plot of latency Gray line: cumulative
distribution at one server Red line: cumulative
distribution at all servers Blue line: cumulative
distribution at CDN
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content distribution networks
Cumulative Distribution
Right way to look at this data Want to understand frequency and magnitude
of bad choices Consistent = vertical Fast = to the left
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content distribution networks
Results
Akamai does a better job than Digital Island
Neither does a particularly good job of selecting the optimal server
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content distribution networks What’s wrong with this study?
Focus is on choice of server
Cost of DNS is explicitly excluded
How does this relate to client performance?
Measuring a CDN’s performance
Two papers
K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000.
B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001.
On the Use and Performance of Content Distribution Networks
Focus is on client perceived performance
Build canonical web page with images from CDN server
On the Use and Performance of Content Distribution Networks
If each CDN serves different content, then how did they create comparable pages?
Size matters!
Select images of (almost) identical sizes from each of the CDN services
On the Use and Performance of Content Distribution Networks Step 1:
For services using only DNS redirection, get an IP address from the DNS server
For services using rewriting, get the page and extract the CDN content server from the page
Amortize DNS lookup time over all images in this page
On the Use and Performance of Content Distribution Networks Step 2:
Download all the images from the IP address of the identified server
Throw this data away The purpose is to make sure that there are no
cache misses
On the Use and Performance of Content Distribution Networks Step 3:
Download all the images from the IP address of the identified server just like a browser would (4 in parallel)
Repeat every 30 minutes over a period of 24 hours with a 10 minute jitter
On the Use and Performance of Content Distribution Networks
Results
On the Use and Performance of Content Distribution Networks
Four Conclusions
1. Forcing a DNS lookup in the critical path of resource retrieval, does not generally result in better server choices
2. The download time from a previously selected server is often better than from the download time from the newly selected server
3. CDN servers are generally not loaded so frequent DNS lookup is not helpful
4. It makes sense for CDNs to increase the DNS TTL given to a client unless the servers are known to be loaded
On the Use and Performance of Content Distribution Networks Is this a better study?
More detailed results
Relates to observed performance
A good marketing white paper
What did we learn?
Dirty Secrets of the CDN world
CDNs are tremendously underutilized
CDNs are over-architected
The value of a CDN is its remote presence in the ISP. Not in its ability to load balance
Remember the ISP Interconnect?
P2P content delivery systems
PUSH content to the leaf nodes
Server other leaf nodes from the edges
Kontiki
Content Manager
clientclient
1
3
2
P2P CDN
Four Challenges
1. Aggregate input streams
2. Deal with unstable peers
3. Manage Malicious peers
4. Who really pays for this?
P2P Caching?
Discussion:
Is this a good idea? What are the issues? Where is the payback?
Agenda
Caching: Why, Where, How, What
Some empirical data: Zipf’s Law
Content Delivery Networks
Bibliography
Bibliography Gray, Shenoy, Rules of Thumb in Data Engineering 1999, Revised March 2000.
Microsoft Research MS-TR-99-100 Berners-Lee, Fielding, Frystyk, Hypertext Transfer Protocol -- HTTP/1.0, IETF
RFC 1945, http://www.w3.org/Protocols/rfc1945/rfc1945 Fielding, Gettys, Mogul, Frystyk, Masinter, Leach, Berners-Lee, Hypertext
Transfer Protocol -- HTTP/1.1, ftp://ftp.isi.edu/in-notes/rfc2616.txt Greg Barish and Katia Obraczka. World Wide Web Caching: Trends and
Techniques. IEEE Communications, May 2000. http://www.isi.edu/people/katia/cache-survey.pdf.
Breslau, Cao, Fan, Phillips, Shenker, Web Caching and Zipf-like distributions: Evidence and Implications, IEEE Infocom 1999
K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000. www.terena.nl/conf/wcw/Proceedings/S4/S4-1.pdf
B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001. http://www.icir.org/vern/imw-2001/imw2001-papers/10.pdf
Bibliography
“Zipf Distribution of Web Site Popularity”, http://www.useit.com/alertbox/zipf.html
S. Gribble, E. Brewer, “System Design Issues for Internet Middleware Services: Deductions from a Large Client Trace”, Proceedings of the USENIX Symposium on Internet Technologies and Systems Monterey,California,December 1997
“The Internet is a little bit broken”, http://www.internap.com/about/theproblem.html
“Reliable Internet Connectivity with BGP, Chapter 7, Influencing Entrance Selection”, http://www.bgpbook.com/archpolicyenter.html
A. Cockburn, B. McKenzie, “What do Web Users Do? An Empirical Analysis of Web Use”, http://www.cosc.cantebury.ac.nz/~andy/papers/ijhcsAnalysis.pdf