175 High Performance P2P Web Caching

7/28/2019 175 High Performance P2P Web Caching

1/21

High Performance

P2P Web Caching

Erik GarrisonJared Friedman

CS264 PresentationMay 2, 2006


2/21

SETI@Home

Basic Idea: people donate computer time to look foraliens

Delivered more than 9 million CPU-years

Guinness BWR largest computation ever Many other successful projects (BOINC, Google

Compute) The point: many people are willing to donate

computer resources for a good cause


3/21

Wikipedia

About 200 servers required to keep the sitelive

Hosting & Hardware costs over 1$M per year All revenue from donations Hard to make ends meet Other not-for-profit websites in similar

situation


4/21

HelpWikipedia@Home

What if people could donate idle computerresources to help host not-for-profitwebsites?

They probably would! This is the goal of our project


5/21

Prior Work

This doesn't exist But some things are similar

Content Distribution Networks (Akamai) Distributed web hosting for big companies

CoralCDN/CoDeeN P2P web caching, like our idea, But a very different design Both have some problems


6/21

Akamai, the opportunity

Internet traffic is 'bursty' Expensive to build infrastructure to handle

flash crowds International audience, local servers

Sites run slowly in other countries


7/21

Akamai, how it works

Akamai put >10,000 servers around theglobe

Companies subscribe as Akamai clients Client content (mostly images, other media)

is cached on Akamai's servers Tricks with DNS make viewers download

content from nearby Akamai servers Result: Website runs fast everywhere, no

worries about flash crowds But VERY expensive!


8/21

CoralCDN

P2P web caching Probably the closest system to our goal Currently in late-stage testing on PlanetLab Uses an overlay and a 'distributed sloppy

hash table' Very easy to use just append '.nyud.net' to

a URL and Coral handles it Unfortunately ...


9/21

Coral: Problems

Currently very slow This might improve in later versions Or it might be due to the overlay structure

Security: volunteer nodes can respond withfake data

Any site can use Coral to help reduce load Just append .nyud.net to their internal links

Decentralization makes optimization hard more on this later


10/21

Our Design Goals

Fast: Akamai level performance Secure: Pages served are always genuine Fast updates possible Must greatly reduce demands on main site

But this cannot compromise first 3


11/21

Our Design

Node/Supernode structure Take advantage of extremely heterogeneous

performance characteristics

Custom DNS server redirects incomingrequests to nearby super node

Super node forwards request to nearbyordinary node

Node replies to user


12/21

Our Design

User goes to wikipedia.org

DNS server resolveswikipedia.org to a super node

Super node forwards request toordinary node that has therequested document

Node retrieves documentand sends to user


13/21

Performance

Requests are answered in only 2 hops DNS server resolves to a geographically

close supernode Supernode avoids sending requests to slow

or overloaded nodes All parts of a page (e.g., html and images)

should be served by a single node


14/21

Security

Have to check nodes' accuracy First line of defense: encrypt local content May delay attacks, but won't stop them


15/21

Security

More serious defense: let users check thevolunteer nodes!

Add a javascript wrapper to the website that

requests the pages using AJAX With some probability, the AJAX script will

compute the MD5 of the page it got and sendit to a trusted central node

Central node kicks out nodes that frequentlyget invalid MD5sum's

Offload processing not just to nodes, but tousers, with zero-install


16/21

A Tricky Part

Supernodes get requests, have to decidewhat node should answer what requests

Have to load-balance nodes no overloading Popular documents should be replicated

across many nodes But don't want to replicate unpopular

documents much conserve storage space Lots of conflicting goals!


17/21

On the plus side...

Unlike Coral & CoDeeN, supernodes know alot of nodes (maybe 100-1000?)

They can track performance characteristics

of each node Make object placement decisions from a

central point Lots of opportunity to make really intelligent

decisions Better use of resources Higher total system capacity Faster response times


18/21

Object Placement Problem

This kind of problem is known as an objectplacement problem What nodes do we put what files on?

Also related to the request routing problem Given the files currently on the nodes, what

node do we send this particular request to?

These problems are basically unsolved for

our scenario Analytical solutions have been done for very

simplified, somewhat different cases We suspect a useful analytic solution is

impossible here


19/21

Simulation

Too hard to solve analytically, so do asimulation

Goal is to explore different object placement

algorithms under realistic scenarios Also want to model the performance of the

whole system What cache hit ratios can we get?

How does number/quality of peers affect cachehit ratios?

How is user latency affected?

Built a pretty involved simulation in Erlang


20/21

Simulation Results

So far, encouraging! Main results using a heuristic object

placement algorithm Can load-balance without creating hotspots

up to about 90% of theoretical capacity Documents rarely requested more than once

from central server Close to theoretical optimum


21/21

Next Steps

Add more detail to simulation Node churn Better internet topology

Explore update strategies Obviously, an actual implementation would

be nice, but not likely to happen this week What do you think?

175 High Performance P2P Web Caching

Documents

Transcript of 175 High Performance P2P Web Caching