Http Caching Proxy Server
-
Upload
thanga-raj -
Category
Documents
-
view
355 -
download
3
Transcript of Http Caching Proxy Server
HTTP CACHING PROXY SERVERSynopsis
The objective of this project is to implement caching mechanism in a web server in order to minimize page seek latency thereby enabling faster page download from the server to client.
What is a web cache?
1
In their simplest form, web caches store temporary copies of web objects. They are
designed primarily to improve the accessibility and availability of this type of data to
end users. Caching is not an alternative to increased connectivity, but instead
optimises the usage of available bandwidth.
How will a cache benefit us?
Caching minimises the number of times an identical web object is transferred from its
host server by retaining copies of requested objects in a repository or cache. Requests
for previously cached objects result in the cached copy of the object being returned to
the user from the local repository rather than from the host server. This results in little
or no extra network traffic over the external link and increases the speed of delivery.
Caches are limited by the amount of disk space – when a cache is full, older objects are
removed and replaced with newer content. Some systems may implement 'persistency'
measures, however, to preserve certain types of content at the discretion of the
administrator.
Where are web caches used?
Caches may be installed in different locations on networks for a variety of reasons:
Local caches are the most common type; they sit on the edge of the LAN just
before the Internet connection. All outbound web requests are directed through
them in an effort to fulfil web requests locally before passing traffic over the
Internet connection.
ISP caches are used on the networks of most Internet Service Providers
(ISPs). They provide customers with improved performance and conserve
bandwidth on their own external connections to the Internet.
Reverse caches are used to reduce the workload of content provider’s web
servers. They position the cache between the web server and its internet
connection, so that when a remote user requests a web page, the request must
first pass through the cache before reaching the web server. If the cache has a
stored copy of the requested item, it delivers it direct rather than passing the
request through to the web server.
What are the advantages of caching?
Fast performance on cached content – if content is already in the cache it is
returned more quickly, even for multiple users wanting to access the same
content.
2
Improved user perception and productivity – quicker delivery of content
means less waiting time and increased user satisfaction with the performance of
the system.
Less bandwidth used – if content is cached locally on the LAN, web requests do
not consume Internet connection bandwidth.
User monitoring and logging – if a cache manages all web requests
(behaving in some ways like a proxy), a centralised log can be kept of all user
access. Care must be taken that any information held is in accordance with
appropriate privacy regulations and the institution's policy.
Caching benefits both the single end user and the content providers –
ISPs and other users of the same infrastructure all benefit greatly from the
reduction in bandwidth usage.
How does a cache differ from a proxy?
A cache server is not the same as a proxy server. Cache servers have a proxy function
with regard to requests for certain content from the World Wide Web. When a client
passes all their requests for web objects via a cache, this cache is effectively acting as a
proxy server. Caching is a common function of proxy servers.Proxy servers perform a
number of other functions, too, mainly centred on security and administrative control.
Broadly speaking, a proxy server sits between a number of clients and the Internet. Any
requests made to the Internet from a LAN computer are forwarded to the proxy server
which will then make the requests itself.
The key differences between a proxy and caches are:
1. A proxy server will handle more requests than just those for web content.
2. A proxy server does not by default cache any data that passes through it.
3. There are certain security benefits based on the fact that proxy servers hide
other computers on the network from the Internet making it is impossible for
individual machines to be targeted for attack.
4. The requirement for 'public' IP addresses is also removed, so that any number of
computers can share one public address that is configured to the proxy rather
than each computer needing a unique IP address.
The response time of a WWW service often plays an important role in its success
or demise. From a user's perspective, the response time is the time elapsed from
when a request is initiated at a client to the time that the response is fully loaded
by the client. This paper presents a framework for accurately measuring the
client-perceived response time in a WWW service. Our framework provides
feedback to the service provider and eliminates the uncertainties that are
3
common in existing methods. This feedback can be used to determine whether
performance expectations are met, and whether additional resources (e.g. more
powerful server or better network connection) are needed. The framework can
also be used when a consolidator provides Web hosting service, in which case
the framework provides quantitative measures to verify the consolidator's
compliance to a specified Service Level Agreement. Our approach assumes the
existing infrastructure of the Internet with its current technologies and
protocols. No modification is necessary to existing browsers or servers, and we
accommodate intermediate proxies that cache documents. The only requirement
is to instrument the documents to be measured, which can be done automatically
using a tool we provide.
The number of servers and the amount of information available in the World Wide Web
have been growing exponentially in the last five years. The use of World Wide Web as an
information retrieving mechanism has also become popular. As a consequence, popular
Web servers have been receiving an increasing number of requests. Some servers
receive up to 100 million requests daily which results in more than one request per
millisecond on average. Thus, in order for a Web server to be able to respond at such a
rate, it should reduce the overhead of handling the requests to a minimum.
Currently, the greatest fraction of server latency for document requests (excluding the
execution of CGI-scripts) comes from disk accesses. When there is a request for a
document at a Web server, the server makes one or more file system calls to open, read
and close the requested file. These file system calls result in disk accesses, and when
the file is not on the local disk, file transfers through a network.
4
Hence, it is interesting to cache the files in the main memory so as to reduce access to
the local and remote disks. Indeed, RAM is much faster (by several orders of magnitude)
than magnetic disks. Such an idea has already been used in some software (for example,
harvest httpd accelerator. Such a caching mechanism was called main memory caching
or document caching. In this project we shall refer it as server caching. Server caching
might appear to have less impact on the quality of Web applications than the client
caching (or proxy caching), which aims at reducing network delay by caching remote
files. This indeed seems to be true in traditional networks where the retrieval time of a
document is dominated by transfer time due to the low-bandwidth interconnections.
However, even in such a situation, a significant portion of requests of a Web server may
be from local users at academic institutions or large companies. These clients are
typically connected to the server through high bandwidth LANs (e.g. FDDI or ATM) so
that the retrieval time is likely to be dominated by the server's latency. In the near
future, with the deployment of ATM WANs, the information retrieval time is also likely
to be dominated by the latency at the server.
5
While client caching is characterized by relatively low hit rates (varying from 20% to
40%, the server caching, however, can achieve very high hit rates due to the
particularity of Web traffic where a small number of documents of a Web server
dominates the requirements of clients. It is shown in by analyzing request traces of
several Web servers that even a small amount of main memory (512 Kbytes) can hold a
significant portion (60%) of the documents required.
In existing system there is no exact cache present at the server. Most caches are
maintained at the proxies itself. Moreover caching policies in the existing system use
Least Recently Used (LRU) page replacement algorithm in their server cache (if one is
present), but the throughput level is low in LRU when compared with our caching
policy.
6
SYSTEM DESIGN
FLOW CHART FOR PROPOSED SYSTEM
Configure the web server. Handle the client Request. Identify the request.
Process the request get the exact page required by the client.
Check for the page in the cache
If not found check for the page in the Server.
Increment hit counter.If present in the cache then get the time stamp of the page in the cache.
A
B
Page Present
If not present
C
H
7
A
B
Is Time stamp at Cache<=
Time stamp at server
page.
Yes then fetch page from server and cache it in cache for future.
If no fetch the page from the cache.
D
Is Cache
full
Yes No
E
F
8
E
Get the Hit counts of all pages visited.
Select the page which is more in size and replace it with page from the server.
Yes then Get the size of all pages.
No. Then Place the Fetched page from server in the cache.
Select the least hit page.
Is there more
than one page
G
9
G
F
Then Place the page from server in the cache for future.
Listen for next request.
Dispatch the requested page to the client.
Calculate the total turnaround time.
Calculate the cache penalty.
D
H
C
Flash Page Not found.
10
IMPLEMENTATION DETAILS
WHAT HAS BEEN DONE:
Tomcat 4.1 is Jakarta’s Web Server which implements Servlet 2.3 and Java Server
pages. It provides a platform for developing and deploying web applications and web
services. This server is used as the web server in our project. This server is responsible
for handling all transactions between the client and the server. The cache that we
designed inside the server can also be viewed as a middle man between the client and
the server. It can be compared with the high speed cache in the memory systems.
Client Request Processing:
In this phase we configure the Servlet program to handle client request. Tasks like
sending pages from server to the client using input output stream is implemented.
Pages can have images also. Dynamic file size generation is a part of next step which
gives us the file size details which is an important criteria needed for coming phases. In
the next step the mapping of URL ‘s are implemented along with URL navigation. The
page name is obtained from the client. The content type of the response is set in the
response header and the objects for the input and output streams are created. The page
is first located and then fetched either from the cache or from the server (explained in
the next phase) and dispatched to the client.
Implementing the Processing Logic:
Processing logic is the main phase of the project where caching algorithm is
implemented. In the processing logic the following tasks are performed. Firstly the time
stamp of the page in the cache (main memory) is checked with its original copy in the
Server(secondary memory). If the page is not found in the Cache the page is fetched
from the Root if found there and is cached in the Cache. A page replacement logic is
applied to the page contents in the Cache so that new page can be accommodated
within the same Cache space by replacing the intended page if the cache lacks space to
make way for the new page.
If the page is found in the cache even then the following steps are executed before
delivering the page to the client to ensure cache consistency.
Check for time stamp
Incrementing hit count
Then the page is swooped into a stream and sent to the client.
If the page is absent in the cache, then we check for the page in the server itself. When
the search is successful, the page is dispatched to the client and a copy is put in the
11
cache for future use. If the search is unsuccessful then “Page not found” message is
flashed to the client.
CACHE DESIGN:
In their simplest form, web caches store temporary copies of web objects. They
are designed primarily to improve the accessibility and availability of this type of data to
end users. Caching is not an alternative to increased connectivity, but instead
optimizes the usage of available bandwidth. After the initial access/download, users can
access a single locally stored copy of the content rather than repeatedly requesting the
same content from the origin
server.
Here we have constructed a cache of predefined size to hold WebPages within
this size. This is a volatile cache that is the cache is present only as long as the server
runs. Once the server goes down and is started again a cache comes up. Again the
cache gets populated based on user requests.
The cache here is divided into two parts. The two parts are key and value. The
key acts as the index to the value part. This diagram best explains the cache.
Key(Page name)
Value(Page contents)
Index.html
Page2.html
Home.html
0100000011110101110101 ……………………
1011100101110101010010……………………….
1010111101010001010111…………………...
The binary data of the page is packed into an user defined object which also contains
the following
Size of the page
Time when the file was modified
12
CACHING POLICY:
In this system we present a cache that can cache static pages and we apply Least
Frequently Used (LFU) page replacement algorithm as our caching policy. Here you
will have a cache that scoops every page requested by the user. It then checks for the
presence of the page if the page is found the time of the page is checked with that of the
same page in the server. Depending on the time stamp the page is fetched either from
the server or from the cache. That is the recent page is fetched. This is to check
whether the user has modified the page contents in the server. If so there is no point in
fetching the page from cache as it contains the stale copy. By this way we see that
whatever page user gets from the cache is the same copy that is present in the server.
And again because the cache size is fixed we have to suggest a caching policy in order
to maintain the cache with the required pages. We use the LFU algorithm for page
replacement. This algorithm actually suits this environment as cache means hits and
misses. LFU also makes use of hits in its implementation so we justify its use here. The
proposed system has shown that the cache penalty is always low and same is the case
of misses. We have also implemented the time difference between request of page from
cache and the same page from server.
LFU Algorithm
•Algorithm: Least Frequently Used
•Least frequently used documents are removed first.
•Advantages: Simplicity., to reduce latency so that the client requests
will get served fast then LFU would be the best choice.
LFU algorithm: when free space in cache is smaller than S, repeat the following until
free cache space is at least S: replace LFU document.
LFU algorithm
LFU(req_file RF, size_of_RF){
while (free_space < size_of_RF) {
locate document with smallest use
count, remove that document from
cache, update free_space
if (free_space still < size_of_RF)
continue
else
add RF to cache, update free_space
}
return ‘document cached’ status
}
13
MONITORING HITS:
KEY
(PAGE NAME)
VALUE
(HITS)
Index.html
Page2.html
Home.html
Results.html
Admit.html
News.html
22
10
37
500
61
7
The hit counter decides the fate of the page in the cache. The number of hits is directly
proportional to the page ‘s stay in cache. That is page with more hits is likely to stay in
cache than the page with fewer hits.
Whenever the cache is found full and a new page is to be placed in it the following steps
are taken.
Hit counts of all the pages are obtained. The page with the least hit count is selected.
This page is replaced with the incoming page. What if two pages have the same hit?
Then we replace the larger page.
SCREEN SHOTS:
INPUT
14
OUTPUT
15
PAGE FETCHED FROM SERVER
16
PAGE FETCHED FROM CACHE
17
CACHE STATISTICS:
18
19