Machine learning in Web proxy caching

Machine Learning Approach in Web Proxy Cache

Replacement.Sivaraj Nimishan

2011/CSC/016

Superviser Sriskandarajah Shriparen

Web Proxy Caching• Solution for improving the performance of Web-based systems is Web

proxy caching

Cache Replacements• In the proxy cache replacement, the proxy cache must effectively decide which objects are worth caching or replacing with other objects.

LRU

LFU

LFU-DA

GDSF

The least recently used objects are removed first.

Dynamic aging factor is incorporated into LFU.

Size, Cost of fetching, Dynamic aging factor integrated with frequency

The least frequently used objects are removed first.

SquidSquid log format

LRU : The LRU policies keeps recently referenced objects.heap GDSF : The heap GDSF policy optimizes object hit rate by keeping smaller popular objects in cacheheap LFUDA : The heap LFUDA policy keeps popular objects in cache regardless of their sizeheap LRU : LRU policy implemented using a heap

timestamp

response time

client address

status codes

size

request method

URL client identity

Hierarchy Code

content type

Machine LearningSupport Vector Machine Decision tree

Data collection Billion Triples Challenge 2012 Dataset

The dataset was crawled during May/June 2012. Several seed sets collected from mulitple sources.

Datahub A Data Ecosystem for Individuals, Teams and People

DBpedia DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web.

Freebase A community-curated database of well-known people, places, and things

Rest The seed set for the Rest crawl contained all other URIs involved in a relation in the DBpedia

Timbl Timbl crawl consisted of Tim Berners-Lee's Friend of a Friend (FOAF)project.(2 files)

PreprocessingData Set Size from to

Datahub 136.8MB [Thu Apr 26 20:07:13 2012] [Fri Apr 27 16:20:16 2012]

DBpedia 170.3MB [Tue May 1 07:46:29 2012] [Fri Apr 27 21:19:02 2012]

Freebase 123.6MB [Fri Apr 27 07:18:03 2012] [Mon Apr 30 12:31:49 2012]

Rest 32MB [Mon Apr 30 13:34:06 2012] [Mon Apr 30 18:46:04 2012]

Timbl 1 138.5MB [Sat May 5 21:05:02 2012] [Tue May 8 07:50:56 2012]

Timbl 2 179.5MB [Tue May 15 20:29:22 2012] [Wed May 23 04:53:27 2012]

Data Set Requests Cacheable requests %

Datahub 398547 181850 45.63 %

DBpedia 1382090 537038 38.86 %

Freebase 333956 145010 43.42 %

Rest 71972 18942 26.32 %

Timbl 1 889591 323451 36.36 %

Timbl 2 1675106 680952 40.65 %

Total 4751262 1887243 39.72 %

Preprocessing...

successful entries with status codes 200

Preprocessing...

SWL Sliding Window Length of 30 minutes-( Romano and ElAarag)Target attribute is obtained by backward-looking sliding window

1 ; if the object is revisited within the sliding windowTarget attribute =

0 ; otherwise

Attributes Valuestime 1335442301duration 379 client 127.0.0.1result_code TCP_MISS/200size 1609method GET URL http://www.opencalais.com/robots.txt

{

a perl command used to convert the unix time-stamp to human-readable timestamptail access.log | perl -p -e 's/^([0-9]*)/"[".localtime($1)."]"/e'

Preprocessing...

access.log

connection.java

Labelinsert.java

InsertMongoDB.java

access.csv

mongoexport

Preprocessing...

Methodology

Performance Measure

Hit Ratio is the factor widely used in evaluating the performance of web caching

i.e, Hit Ratio is defined as the percentage of requests that can be satisfied by the cache.

Hit Ratio = * 100 Hit RatioCacheable requests

Machine LearnerWSO2 Machine Learner is a product which

helps to manage and explore the data, build machine learning models after analyzing the data using machine learning algorithms, compare and manage generated machine learning models and predict using the built models.

Apache Spark is a fast and general engine for large-scale data processing.

Easy graphical user interface for human-friendly viewing

Access the ML UI from a Web browser using the following URL: https://<ML_HOST>:<ML_PORT>/ml

to run ML : <PRODUCT_HOME>/bin/wso2server.sh

SVM Decision Tree

Parameters100 : Iterations

0.001 : Learning Rate1 : SGD Data Fraction

L1 : Reg Type0.001 : Reg Parameter

ParametersMax Depth : 30Max Bins : depend on unique featuresImpurity : gini/entropy

Data set

Total requests

Number of hits

Hit ratio

Datahub2

54557 45357 83.13

Dbpedia 181114 105883 58.46Freebase 43507 32527 74.76

Rest 5685 4428 77.88Timbl 97039 42390 43.68

Timbl2 206708 135149 66.15

Data set

Total requests

Number of hits

Hit ratio

Datahub2

54557 25470 46.68

Dbpedia 181114 118418 65.38Freebase 43507 26359 60.58

Rest 5685 1519 26.71Timbl 97039 58243 60.02Timbl2 204288 96822 47.39

ConclusionData Set Requests Cacheable

requests Hit Ratio(%)

Datahub 398547 181850 83.13

DBpedia 1382090 537038 65.38

Freebase 333956 145010 74.76

Rest 71972 18942 77.88

Timbl 1 889591 323451 60.02

Timbl 2 1675106 680952 66.15

In this study SVM and Decision Tree approches were used to train proxy logs files to classify the contents of Web proxy cache.

The hit ratio calculated by the classification decisions made by the trained SVM and trained Decision tree

The performance of Web caching can be improved using supervised machine learning.Classifiers can be utilized to improve the hit ratio of traditional Web caching policies.

ReferencesS. Romano and H. ElAarag, "A neural network proxy cache replacement strategy and its implementation in the Squid proxy server", Neural Computing & Applications, Vol. 20, No. 1, (2011), pp. 59-78.

A. I. Vakali, "LRU-based algorithms for Web Cache Replacement"

W. Ali S. Sulaiman, and N. Ahmad "Performance Improvement of Least-Recently Used Policy in Web Proxy Cache Replacement Using Supervised Machine Learning" Int. J. Advance. Soft Comput. Appl., Vol. 6, No.1 ,(2014)

Introducing Machine Learner https://docs.wso2.com/display/ML100/Introducing+Machine+Learner

Squid: Optimising Web Delivery http://www.squid-cache.org/

Machine learning in Web proxy caching

Data & Analytics

Transcript of Machine learning in Web proxy caching