1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching...

1

The Mystery of Cooperative Web Caching

2

Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces the delay in retrieving a document from the Internet by decreasing the number of

request directed to it.

3

Cooperative Web Caching : it consists on a set of web caching located in a different places in the Internet and cooperate to each other to improve the performance of the system

4

• The main entities in a cooperative web caching are:

Proxy Router group proxies and router

Entities requirements:

proxy: the proxy must acts as proxy cache

Router must implement interior gateway – exterior gateway

Group proxies- router the main requirement is the inter-cache communication.

The mystery of cooperative web caching is : the inter- cache communication technique

5

The inter – cache communication techniques

There are many protocol proposed for the inter- cache communication for a cooperative web caching.

ICP Internet Cache protocol was proposed by Duane Wessels, K.Claffy 1997

Cache digest was proposed by Alex Rousskov, Duane wessels 1998

Summary Cache was proposed by Pei Cao in 1998

HTCP Hyper Text Caching Protocol was proposed by P.Vixie, D.Wessels 2000

CARP Cache Array Routing Protocol was proposed by Vinod Valloppillil , Keith W.Ross 1998

6

1. Internet Cache Protocol

ICP is a message format protocol when, each cache collects information about the existence of a particular web object in the cache of its neighbours by sending an ICP_query message .

The message is composed on fixed 20 octets header followed by a variable payload size.

7

0 8 16 32 Opcode field ,8 bit, it is an integer

number that indicates the state of the message : query- hit – miss- denied

Version field indicate the number of ICP version used

Message length = header length + payload length at maximum 16 Kbytes

Payload : that contains the URL of the requested document , to which is depend the payload length

OPCODEVersion Message length

Request Number

Option

Option Data

Sender Addresses

Payload

The message Format

8

Message Specification

A cache send an ICP_query ( Opcode= 1) to all its neighbours to collect information about a particular document.

The cache that receives the query extracts the URL of the document from the payload and sends a ICP response message ( Opcode =2 ,3).

The cache that generate the query collects all the responses and select the best one to send an HTTP request to retrieve the document.

There are two kinds of message hit- response:

ICP_OP_HIT

ICP_OP_HIT_Obj

9

Peer Selection :

The selection of the best peer to retrieve the document can be done by selection algorithms based on the following parameters:

RTT measurement : that measure the congestion between two nodes .

it is variable with the time.

Hop count : it is a constant measure .

10

Comparison between ICP format and HTTP message

ICP HTTP

Message lengthFixed header size of 20 octets inbinary representation

Header size is arbitrary in ASCII text

Functions

- Is an object location protocol.Is a simple query -response message.

- doesn't support an implementationuseful in the caching.

- uses UDP as transport protocol.unreliable

- More generic message containsinformation about the other caches.

- Support implementation for cachinglike: if-modified-since, pragma.

- uses TCP as transport protocol reliable

11

2. Cache Digest

Cache digest provides a mechanism for the communication among web caching.

The digest contain a list of the URLs of the documents stored in the cache

Digest Construction: The URLs of the document stored in the cache are indexed in the digest by a keys (

set of bits ) stored in a bloom filter.

The keys are extracted from the URL by a number of hash functions that determines which bit must turn on and which must turn off.

a bit turn on if its state change from 0 to 1

a bit turn off if its state change from 1 to 0

12

Bloom filter :

Is a hash coding method , proposed by Burton H.Bloom in 1970

is based on the idea to reduce the hash area size that allows a small number of test to be falsely identified without increasing the reject time.

Reject time :is the time needed to classify that an element does not belong the set of elements stored in the hash.

The hash area is organised in N cells with N differences keys o…N-1 , the document must be codified in N bits .

Initially all the cells gas empty, all the bits are set of 0 , to insert an element it is necessary to generate a set of hash addresses a1…….ad all are set of 1.

To search an element , it is necessary to generate in the same way a set of hash addresses. If all are set of 1 that means the document is accepted and if any of these addresses are o that means the element is rejected

13

The calculation of the public keys

The URLs is transformed by the MD5 in a public key (128 bits) which is composed on two parts: a numeric part 1-7 bits , the second parts represent the transformation of the URL.

The hash function then, assign to each key an index extracted from the URL by doing the following computation :

1. Splitting the 128 bits in N parts

2. Finding the index to each part by calculating the modulo of the digest value to

the digest size

the digest size = (the number of bits for entry+ the public keys) cache

capacity.

3. Combining the indices of each part to compose the index of the correspondent public key

14

Digest Accuracy:

The calculation of the public keys allows some possibility of errors . There are two kinds of errors:

1. False miss

2. False hit

15

Digest Requirement:

- The digest is a large data structure. 200MB-2MB needed to store all the URLs of the documents stored in the cache.

- It is necessary to do two copies of the digest one stored on the disk and the other in memory for the fast update.

How does it work?

- the cache exchange its own digest with its neighbours.

- the cache digest message is composed on fixed 128 bytes in binary representation in the header which contain the digest specifications followed by the entire digest.

- When a miss occurs in the local cache , it fetch in the other digest.

- In the case of miss, the cache send an HTTP request to retrieve the document from the opportune location.

16

Conclusion

Cache digest eliminate the ICP_Query -response message used for the

collection of the information about the requested document but, it requires

a lot of memory to store it, and it transfers a large quantity of information

over the network is proportioned with the size of the digest

17

3. Summary Cache

It is proposed by Pei Cao and group of their student to reduce the internal

traffic created by ICP_Query . Each proxy keeps a summary of the URLs of the document stored in each

participating proxy. It scale well , because it can employs a large number of proxies to reduce the

web traffic. Two main factors influence in the scalability :

1. Updating delay

2. Memory requirement

18

Updating delay: the summary is updated periodically or after a determined threshold of the documents is not reflected in the summary.

Memory requirement : is depend on the way to represent the summary. The summary can be represented in the following way: exact directory : it requires a lot of memory, for 100 proxies of 8GB cache and 1

million of documents with average URL length is 50 bytes the space needed to represent the summary is 2MB.

Server name: it reduces the summary size but, increase the possibility of error. Bloom filter : is proposed by Pei Cao to reduce the memory requirement of the

summary . The documents are stored in the filter in the same way as cache digest with a difference in the calculation of index when , the hash function doing the following computation:

1. 128 bits are divided in four 32bit word to each is extracted an index by the modulo on the summary size .

2. Each proxy maintains a counter C (l) for each location l

19

There are three kind of errors:

False hit

False miss

Remote hit stale

20

BLOOM FILTER ICP SERVER NAMES EXACTDIRECTORY

Number ofmessages

Generate moremessages thanbloom filter andexact directory

Generate thesame messagequantity as ICP

Message size32 bytes for header + 4 bytes forupdates

20 bytes forheader + 50 bytesfor average URL

20 bytes forheader + 16 bytesfor change

20 bytes forheader + 16bytes forchange

Comparison between the summary representation methods and ICP

22

4. Protocols comparison Comparison of the three previous protocols in term of network traffic

ICP CACHEDIGEST

SUMMARYCACHE

Number ofexchangedmessages

Generate more traffic.The number of messagesexchanged depends onthe number ofneighbours

Generate less messagethan ICP

Reduce the number ofexchanged messages25-60% than ICP

Length ofmessages

Fixed header size 20octet + 50 bytes averagelength of URL.No control of theamount of informationtransferred. The traffic generatedepend on the number ofrequest

Fixed header with 128bytes + the digest sizeLarge quantity ofinformation transferred.The information can becontrolled.The traffic generatesdepend on the digest size

20 bytes for header +16 bytes for updating

Transportprotocol

UDP No requirement forspecific transportprotocol

Uses UDP toupdating exchange

Reliability unreliable

For the reliability usesTCP ………………….

1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching...

Documents

Transcript of 1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching...