[IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India...

8
2009 IEEE International Advance Computing Conference (IACC 2009) Patiala, India, 6-7 March 2009 Page Ranking Algorithms: A Survey Neelam Duhan, A. K. Sharma, Komal Kumar Bhatia YMCA Institute ofEngineering, Faridabad, India E-mail. (neelam_duhan, ashokkale, komal_bhatia1}@rediffmail.com Abstract--Web mining is an active research area in present pages. The downloaded pages are routed to an indexing scenario. Web Mining is defined as the application of data mining module that parses the web pages and builds the index based techniques on the World Wide Web to find hidden information, upon the keywords present in the pages. Index is generally This hidden information i.e. knowledge could be contained in maintained alphabetically considering the keywords. content of web pages or in link structure of WWW or in web server logs. Based upon the type of knowledge, web mining is usually divided in three categories: web content mining, web User structure mining and web usage mining. An application of web mining can be seen in the case of search engines. Most of the Results" Query search engines are ranking their search results in response to users' queries to make their search navigation easier. In this Web xawler paper, a survey of page ranking algorithms and comparison of FQuelyInteifac some important algorithms in context of performance has been Webpages carried out. Indexer Web Mining Search " ' I t ~~~~Engie Keywords-WWW; Data mining; Web mining; Search engine; Q Enlgeineo Page ranking E Q Pe I. INTRODUCTION Figure 2. Architecture of a Search Engine. WWW is a vast resource of hyperlinked and heterogeneous When a user fires a query in the form of keywords on the information including text, image, audio, video, and metadata. interface of a search engine, it is retrieved by the query It is estimated that WWW has expanded by about 2000 % processor component, which after matching the query since its evolution and is doubling in size every six to ten keywords with the index returns the URLs of the pages to the months [1]. With the rapid growth of information sources user. But before representing the pages to the user, some available on the WWW and growing needs of users, it is ranking mechanism (web mining) either in back end or in becoming difficult to manage the information on the web and front end is used by most of the search engines to make the satisfy the user needs. Actually, we are drowning in data but user search navigation easier between the search results. starving for knowledge. Therefore, it has become increasingly Important pages are put on the top leaving the less important necessary for users to use some information retrieval pages in the bottom of the result list. Such kind of mechanism techniques to find, extract, filter and order the desired is used by a popular search engine Google that uses the information. PageRank algorithm to rank its result pages. Majority of the users use information retrieval tools like search engines to find information from the WWW. Some In this paper, a survey of various page ranking algorithms commonly used search engines are Google, msn, yahoo search has been done and a comparison is carried out. This paper is etc. They download, index and store hundreds of millions of structured as follows: in section 2, web mining concepts, web pages. They answer tens of millions of queries every day. categories and technologies have been discussed. Section 3 They act like content aggregators (Fig. 1) as they keep a provides a detailed overview of some page ranking algorithms record of every information available on the WWW. and section 4 discusses the limitations and strengths of each algorithm discussed. Finally in section 5, the paper is concluded with a light on future suggestions. Search Engine -> I xwww 77A II. WEB MINING Content aggregators Content Consumers Extraction of interesting (non-trivial, implicit, previously (Users) unknown and potentially useful) information or patterns from Figure 1. Concept of a Search Engine, large databases is called Data Mining. Web Mining [2, 3] is The ostimprtat coponnt f te serchengne see the application of data mining techniques to discover and retrieve useful information (knowledge) from the WWW Fig. 2) iS a crawler also called a robot or spider that traverses .. the hypertext structure in the web and downloads the web dcmnsadsrie.Wbmnn a edvddit he 978-1-4244-1888-6/08/f$25.00 Q 2008 IEEE 1530

Transcript of [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India...

Page 1: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Page Ranking Algorithms:

2009 IEEE International Advance Computing Conference (IACC 2009)Patiala, India, 6-7 March 2009

Page Ranking Algorithms: A SurveyNeelam Duhan, A. K. Sharma, Komal Kumar Bhatia

YMCA Institute ofEngineering, Faridabad, IndiaE-mail. (neelam_duhan, ashokkale, komal_bhatia1}@rediffmail.com

Abstract--Web mining is an active research area in present pages. The downloaded pages are routed to an indexingscenario. Web Mining is defined as the application of data mining module that parses the web pages and builds the index basedtechniques on the World Wide Web to find hidden information, upon the keywords present in the pages. Index is generallyThis hidden information i.e. knowledge could be contained in maintained alphabetically considering the keywords.content of web pages or in link structure of WWW or in webserver logs. Based upon the type of knowledge, web mining isusually divided in three categories: web content mining, web Userstructure mining and web usage mining. An application of webmining can be seen in the case of search engines. Most of the Results" Querysearch engines are ranking their search results in response tousers' queries to make their search navigation easier. In this Web xawlerpaper, a survey of page ranking algorithms and comparison of FQuelyInteifacsome important algorithms in context of performance has been Webpagescarried out. Indexer Web Mining Search

" ' I t ~~~~EngieKeywords-WWW; Data mining; Web mining; Search engine;

Q EnlgeineoPage ranking E Q Pe

I. INTRODUCTIONFigure 2. Architecture of a Search Engine.

WWW is a vast resource of hyperlinked and heterogeneous When a user fires a query in the form of keywords on theinformation including text, image, audio, video, and metadata. interface of a search engine, it is retrieved by the queryIt is estimated that WWW has expanded by about 2000 % processor component, which after matching the querysince its evolution and is doubling in size every six to ten keywords with the index returns the URLs of the pages to themonths [1]. With the rapid growth of information sources user. But before representing the pages to the user, someavailable on the WWW and growing needs of users, it is ranking mechanism (web mining) either in back end or inbecoming difficult to manage the information on the web and front end is used by most of the search engines to make thesatisfy the user needs. Actually, we are drowning in data but user search navigation easier between the search results.starving for knowledge. Therefore, it has become increasingly Important pages are put on the top leaving the less importantnecessary for users to use some information retrieval pages in the bottom of the result list. Such kind of mechanismtechniques to find, extract, filter and order the desired is used by a popular search engine Google that uses theinformation. PageRank algorithm to rank its result pages.

Majority of the users use information retrieval tools likesearch engines to find information from the WWW. Some In this paper, a survey of various page ranking algorithmscommonly used search engines are Google, msn, yahoo search has been done and a comparison is carried out. This paper isetc. They download, index and store hundreds of millions of structured as follows: in section 2, web mining concepts,web pages. They answer tens of millions of queries every day. categories and technologies have been discussed. Section 3They act like content aggregators (Fig. 1) as they keep a provides a detailed overview of some page ranking algorithmsrecord of every information available on the WWW. and section 4 discusses the limitations and strengths of each

algorithm discussed. Finally in section 5, the paper isconcluded with a light on future suggestions.

Search Engine -> I

xwww 77A II. WEB MINING

Content aggregators Content Consumers Extraction of interesting (non-trivial, implicit, previously(Users) unknown and potentially useful) information or patterns from

Figure 1. Concept of a Search Engine, large databases is called Data Mining. Web Mining [2, 3] is

The ostimprtat coponnt f te serchengne see the application of data mining techniques to discover andretrieve useful information (knowledge) from the WWWFig. 2) iS a crawler also called a robot or spider that traverses ..

the hypertext structure in the web and downloads the web dcmnsadsrie.Wbmnn a edvddit he

978-1-4244-1888-6/08/f$25.00 Q 2008 IEEE 1530

Page 2: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Page Ranking Algorithms:

categories [2, 3] namely web content mining, web structure III. PAGE RANKING ALGORITHMSmining and web usage mining as shown in Fig 3. The size ofWWW is growing rapidly and at the same time,

Web Content Mining (WCM) means mining the content of the number of queries, the search engines can handle hasweb pages. It can be applied on web pages itself or on the grown incredibly too. With increasing number of users on theresult pages obtained from a search engine. WCM can be web, the number of queries submitted to the search engines aredifferentiated from two different views: Information Retrieval also increasing exponentially. Therefore, the search engine(IR) View and Database (DB) View. In IR view, almost all must be able to process these queries efficiently. Thus, somethe researches use bag of words to represent unstructured text, web mining technique must be employed in order to extractwhile for the semi-structured data, the HTML structure inside only relevant documents from the database and providethe documents can be used. Intelligent web agents can be used intended information to the users.here for web mining purpose. In DB view, a web site can betransformed to represent a multi-level database and web To present the documents in an ordered manner, Pagemining tries to infer the structure of the web site from this ranking methods are applied, which can arrange thedatabase. documents in order of their relevance, importance and content

Webining

Web CStructure StructureMining Web UsageMining

Web Page Web Search General access CustomizedContent Mining Result Mining t tracking Usage tracking

Agent Based Database Agent Based DatabaseApproach Approach Approach Approach

Figure 3. Taxonomy of Web Mining

Web Structure Mining (WSM) tries to discover the link score and use web mining techniques to order them. Somestructure of the hyperlinks at the inter-document level in algorithms rely only on the link structure of the documents i.e.contrast to WCM that focuses on the structure of inner- their popularity scores (web structure mining), whereas othersdocument. It is used to generate structural summary about the look for the content in the documents (web content mining),web pages in the form of web graph where web pages act as while some use a combination of both i.e. they use links asnodes and hyperlinks as edges connecting two related pages. well as content of the document to assign a rank value to the

concerned document. Some of the common page rankingWeb Usage Mining (WUM) is used to discover user algorithms have been discussed as follows.

navigation patterns and the useful information from the webdata present in server logs, which are maintained during the A. PageRankAlgorithminteraction of the users while surfing on the web. It can be Surgey Brin and Larry Page[5, 6] developed a rankingfurther categorized in finding the general access patterns or in algorithm used by Google, named PageRank (PR) after Larryfinding the patterns matching the specified parameters. Page (cofounder of Google search engine), that uses the link

The three categories of web mining described above have structure of the web to determine the importance of webits own application areas including site improvement, business pages. Google[7] uses PageRank to order its search results sointelligence, Web personalization, site modification, usage that documents that are seem more important move up in thecharacterization and classification, ranking of pages etc. The results of a search accordingly. This algorithm states that if apage ranking is generally used by the search engines to find page has some important incoming links to it then its outgoingmore important pages. Different page ranking algorithms have links to other pages also become important. Therefore, it takesbeen reported in the available literature [4, 5, 6, 8, 9, 10]. In backlinks into account and propagates the ranking throughthe next section, four important page ranking algorithms: links. Thus, a page obtains a high rank if the sum of the ranksPageRank, Weighted PageRank, HITS and Page Content of its backlinks is high.Rank, have been discussed giving details of their working.

2009 IEEE Inxternational Advanxce Computing Conference (IACC 2009) 153:1

Page 3: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Page Ranking Algorithms:

The PageRank algorithm considers more than 25 billion PR(A)=(1-d)+d((PR(B)/2+PR(C)/1 ) (2a)web pages on the WWW to assign a rank score [7]. When PR(B)=(1-d)+d( PR(A)/2+PR(C)/1) (2b)some query is given, Google combines precomputed PR(C)=(1-d)+d( PR(B)/2) (2c)PageRank scores with text matching scores [11] to obtain anoverall ranking score for each resulted web page in responseto the query. Although many factors are considered while A Bdetermining the overall rank but PageRank algorithm is theheart of Google.

CA simplified version [5] of PageRank is defined in Eq. 1:

Figure 5. Example Hyperliked Structure

PR (u)=c z PR (v) 1) By calculating the above equations with d=0.5 (say), theveB(u) Nv page ranks of pages A, B and C become:

where u represents a web page, B(u) is the set of pages that PR(A)=1.2, PR(B)=1.2, PR(C)=0.8point to u, PR(u) and PR(v) are rank scores of page u and vrespectively, Nv denotes the hnumber of outgoing links of page 2) Iterative Method ofPage Rankv, c is a factor used for normalization.

It is easy to solve the equation system, to determine pageIn PageRank, the rank score of a page (say p) is equally rank values, for a small set of pages but the web consists of

divided among its outgoing links. The values assigned to the billions of documents and it is not possible to find a solutionoutgoing links of page p are in turn used to calculate the ranks by inspection method. In iterative calculation, each page isof the pages pointed to by p. An example showing the assigned a starting page rank value of 1 as shown in Table I.distribution of page ranks is illustrated in Fig. 4. These rank values are iteratively substituted in page rank

equations to find the final values. In general, many iterationscould be followed to normalize the page ranks.

106

5 [ 3 ; 4 TABLE 1 ITERATION METHOD OF PAGERANK

14 Iteration PR(A) PR(B) PR(C)

20 0611>1 1 1.25 0.818 8 2 1.21 1.2 0.8

3 1.2 1.2 0.8

Figure 4. Distribution of page ranks 4 1.2 1.2 0.8

Later PageRank was modified observing that not all usersfollow the direct links on WWW. The modified version is It may be noted that in this example, PR(A)=PR(B)>PR(C).given in Eq. 2. Experiments have shown that rank value of a page converges

to reasonable tolerance in roughly logarithmic (log n) [5, 6].

PR(u)=(1-d)+d E PR (v) (2) B. WeightedPageRankAlgorithmveB(u) Nv Wenpu Xing and Ali Ghorbani [10] proposed an extension

to standard PageRank called Weighted PageRank (WPR). Itwhere d is a dampening factor that is usually set to 0.85 d can assumes that if a page is more popular, more linkages otherbe thought of as the probability of users' following the direct web pages tend to have to it or are linked to by it. Thislinks and (1 -l d as the page rank distribution from non- algorithm does not divide the rank value of a page evenlydirectly linked pages. among its outgoing linked pages, rather it assigns larger rank

values to more important pages. Each outgoing link gets a1) Example Illustrating Working ofPR value proportional to its popularity or importance. The

popularity of a page is measured by its number of incomingTo explain the wokngoand outgoing links. The popularity is assigned in terms ofexample hyperlinked structure shown in Fig. 5, where A, B weight vaue to the i n A outoin links A

and C are three web pages. denoted as Wvt(v,u) and W9ut(v,u) respectively. W'1 (V,U) (givenin Eq. 3) is the weight of link(v, u) calculated based on theThe PageRanks for pages A, B and C can be calculated by nme ficmn ik fpg n h ubro

using Eq. 2:nubro nom nkofpguanthnmerf

11532 2009 IEEE Internactionalz Advance Computing Conference (IACC 2009)

Page 4: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Page Ranking Algorithms:

incoming links of all reference (outgoing linked) pages of 2) Comparison ofWPR andPRpage v.

To compare the WPR with standard Page Rank, the authorsI3) categorized the resultant pages of a query into four categories

W(V,u) based on their relevancy to the given query:peR(v) * Very Relevant pages (VR): Pages containing very

important information related to the given query.where I. and Ip represent the number of incoming links of page * Relevant pages (R): Pages having relevant, notu and page p, respectively. R(v) denotes the reference page list important information about the given query.of page v. VU"t (v,u) (given in Eq. 4) is the weight of link(v, u) * Weak Relevant pages (WR): Pages which do not havecalculated based on the number of outgoing links of page u relevant information about the given query but containand the number of outgoing links of all reference pages of the query keywords.page v. * Irrelevant pages (IR): Pages neither containing the

query keywords nor relevant information about it.

W°ut = ° (4) The WPR and the standard PR algorithms both provideZ 0p sorted lists to users based on the given query. The followingpeR(v) rule has been adopted to calculate the relevance score of each

where O, and 9Op represent the number of outgoing links of page in the list of pages, which differentiates WPR with PR.

page u and page p, respectively. Considering the importance Relevancy Rule: The relevancy of a page depends on itsof pages, the original PageRank formula (Eq. 2) is modified as category and its position in the result list. The larger thegiven in Eq. 5. relevancy value of result list, better it is ordered. The

relevancy K iS given in (6):WPR (u) = (1- d) + d E WPR (v) W(VU) W(V,u) (5)

vEB(u)

K = Y(n-i)x W, (6)1) Example illustrating working ofWPR ieR(p)

To illustrate the working of WPR refer again to Fig 5. The where i denotes the ith page in the resultant page list R(p), n

PageRank equations become: represents the first n pages chosen from the list and Wi is theweight of ith page as given below:

WPR(A) =(1-d)+d(WPR(B)WBA)W(°BUtA) WPR(C)W A) W(°CU)) vWj = Vl, v2, v3, v4}WPR(B) = (1- d) + d(WPR(A).WA B).W(°AUB) + WPR(C).W(CB) t°B))

where vl, v2, v3 and v4 are the values assigned to a page ifWPR(C)=(1 -d)±d(WPR(B).f4'BC).WBc)) the page is VR, R, WR and IR respectively. Also the valuesare chosen in such a way so that vl > v2 > v3 > v4. The value

The weights of incoming as well as outgoing links can be of Wi could be decided through experimental studies.calculated as: Experiments show that WPR produces larger relevancy values,

Wn (B,A)~ IA/(IA~IC)- 2/(2±1)~ 2/3 which indicate that it performs better than PR.W1 (B,A)= IA/(IA+lc)= 2/(2+1)= 2/3Wout (B,A)= OA/(OA+OC)= 1/(1+2)= 1/3 C. Page ContentRankAlgorithm

Similarly other values after calculation are: Jaroslav Pokorny and Jozef Smizansky[4] gave a newWin (C,A)=1/2 and Wout (C,A)=1/3 ranking method of page relevance ranking employing WCMWin (A,B)=1 and Wout (A,B)=I technique, called Page Content Rank (PCR). This methodWin (C,B)=1/2 and Wout (C,B)=2/3 combines a number of heuristics that seem to be important forWin (B,C)=1/3 and Wout (B,C)=2/3 analyzing the content of web pages. Here, page importance is

determined on the basis of the importance of terms containedAfter substituting d 0.5 and above calculated weight values, in the page; while the importance of a term is specified withpage ranks ofA, B andCbecome: respect to a given query q. PCR uses a neural network as its

WPR(A)=0.65 , WPR(B)=0.93, WPR(C)=0.60 inner classification structure.In PCR, let for a given query q and a usual search engine, a

Here WPR(B)>WPR(A)>WPR(C). It shows that the resulting set Rq of ranked pages is resulted, which are in turn classifiedorder of pages obtained by PR (Section A) and WPR is according to their importance. Here a page is represented in adifferent. similar way as in the vector model [12] and frequencies of

terms in the page are used.

2009 IEEE Inxternational Advanxce Computing Conference (IACC 2009) 1533

Page 5: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Page Ranking Algorithms:

1) Working ofPCR 3) Parameter calculations in PCR

The calculation of the importance of a term t, denoted byimportance(t), is carried out on the basis of 5+(2*NEIB)

(i) Term extraction: An HTML parser extracts terms from parameters, where NEIB denotes the number of neighboringeach page in Rq. An inverted list [13] is built in this terms included into the calculation. The calculation dependsstep which is used in step (iv). on attributes such as database D, query q and the number n of

(ii) Parameter Calculation: Statistical parameters such as a pages considered. Further a classification function classifyo isTerm Frequency (TF) and occurrence positions; as well used with 5+(2*NEIB) parameters returning the importance ofas linguistic parameters such as frequency of words in t The importance of a term t is considered to be influenced bythe natural language are calculated and synonym the parameters described below:classes are identified.clase ar identified.

(i) Occurrencefrequency: It determines the total number of(iii) Term classification: Based on parameter calculations in occurren frequen th lstep (ii), the importance of each term is determined. A occurrences of term t in Rq.neural network is used as a classifier that is learnt on atraining set of terms. Each parameter corresponds to fteq (t) = TF (P, t) (8)excitation of one neuron in the input level and the p Rqimportance of a term is given by excitation of theoutput neuron in the time of termination of propagation. (ii) Distances of occurrences of t from occurrences of

(iv) Relevance Calculation: Page relevance scores are terms in Q: If a term t occurs very often or close to the termsdetermined on the basis of importance of terms in the contained in Q, then it can be significant for the given topic.page, which have been calculated in step (iii). The new Let QW (Eq. 9) is the set of all occurrence positions of termsscore of a page P is equal to the average importance of from Q in all pages PERqn.terms in P.

PCR asserts that the importance of a page P is proportional QW = U Pos (P, t) (9)to the importance of all terms in P. This algorithm uses the t Q, Pe Rqusual aggregation functions like Sum, Min, Max, Average,Count and also a function called Sec moment given in Eq. 7. The distance of any term t from the query terms is the

minimum of all distances of t from query terms, i.e.

Sec _moment(S) i (7) dist(t) =min(tlr -s:r e Pos(P, t) and s c QW}) (10)n=1

(iii) Incidence of pages: This value denoted by occur(t) is awhere S={xi i=l..n}, n = ISI. Sec_moment is used in PCR ratio of the number of pages containing a term t to the totalreason being that it increases the influence of extreme values number of pages. Here a term having less DF regardless of itsin the result in contrast to Average function. high TF is not considered important.

2) Symbols used in PCR occur(t) = DF(t) / Rq,n (11)

PCR algorithm considers the following symbols, which are (iv) Frequency in the natural language: In PCR, a databaseused while discussing the parameter calculations in next of frequent words is assumed and let F(t) be a function

section: assigning to all these words an integer value representing itsfrequency in the given database. Then the frequency of t in the

D: Set of all pages indexed by a search engine. language can be defined as:q: A conjunctive boolean query fired by the user.Q: Set of all terms in query q. common(t) F(t) (12)Rq cl)J: Set of pages that are considered relevant by the search Obviously, a term t is considered less important if it belongs toengine with respect to q. one of the frequent words of used natural language.Rq,ngRq: Set of n top ranked pages from Rq. If n>lRql, then (v) Term importance: The importance of all terms from Rq,nq,n- q can be determined temporarily as:

TF(P, t): Term frequency i.e. the number of occurrences ofterm t in page P. importance(t)classify(freq(t), dist(t),occur(t),common(t),O,DF(t): Document frequency i.e. the number of pages which 0.0) (13)contain the term t. (vi) Synonym Classes: A database of synonym classes isPos(P, t): Set of positions of term t in page P. used and for each synonym class S, an aggregate importanceTerm(P, i): A function returning the term at the ith position in SC(S) is calculated as shown in Eq. 14:page P.

SC(S)= sec_moment({importance(t '):t'cS}) (14)

11534 2009 IEEE Internactionalz Advance Computing Conference (IACC 2009)

Page 6: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Page Ranking Algorithms:

A term becomes important if it is the synonym of an that for every query given by the user, there is a set ofimportant term. This importance SC(S) is propagated to the authority pages that are relevant and popular focusing on theterm t by another aggregation over all its meanings: query and a set of hub pages that contain useful links to

relevant pages/sites including links to many authorities (seesynclass(t) = sec_moment({SC(S,'): t' E SENSE(t)}) (15) Fig. 6). HITS assumes that if the author of page p provides awhere SENSE(t) contains all meanings t'of t. link to page q, thenp confers some authority on page q.

(vii) Importance ofneighboring term: The neighboring termsalways affect the importance of a term i.e. if a term issurrounded by important terms, the term becomes important. Itis described by (2 *NEIB) parameters, that is an aggregation ofthe importance of terms surrounding the term t. LetRelPosNeib(t, i), given in Eq. 16, be the set of terms whichare the ith neighbour of term t in all pages PERq,n, over all Authoritieoccurrences of t. If i < 0 left neighbours are got while i > 0 Hubsgives the right ones. The predicate Inside(P, n) is satisfied, if n Figure. 6. Hubs and Authoritiesis an index into the page P. Then,

The HITS algorithm considers the WWW as a directedrelPosNeikt,i)= U{Ter8P,j+1): jePos(P,t)eInside(P,j+1)} (16) graph G(V,E), where V is a set of vertices representing pages

PCRq,n and E is a set of edges that correspond to links. A directededge (p, q) indicates a link from page p to page q. The search

The parameters neib(t, i) for i := -NEIB,-(NEIB-l), ... -1, engine may not retrieve all relevant pages for the query;1,.NEIB are defined as follows: therefore the initial pages retrieved by the search engine are a

neib(t, i) = sec moment(RelPosNeib(t, i)) good starting point to move further. But relying only on theinitial pages does not guarantee that authority and hub pages

Based on these parameters the resultant importance of the term are also retrieved efficiently. To remove this problem, HITSt is defined as: uses a proper method to find the relevant information

importance(t)=classify(freq(t),dist(t),occur(t),common(t), regarding the user query.synclass(t),neib(t, -NEIB).,neib(t,NEIB)) (17) 1) Working ofHITS

4) Page Classification and Importance Calculation The HITS algorithm works in two major steps:

In PCR, a layered neural network is used as a classification Step J-Sampling Steptool, which is denoted by NET. The NET has weights set up In this step, a set of relevant pages for the given query arefrom previous experiments. Assuming that the network has collected i.e. a subgraph S of G is retrieved which is rich in5+(2*NEIB) neurons in the input and one neuron in the output authority pages. The algorithm starts with a root set R (say, oflayer, let the calculation of a general neural network NET with 200-300 pages) selected from the result list of a usual searchthe input vector v is denoted as NET(v) and NET[i] is an engine. Starting with R, a set S is obtained keeping in mindexcitation of the ith neuron in the output layer of NET after that S is relatively small, rich in relevant pages about theterminating calculation. The classifj() function can be defined query and contains most of the strongest authorities. Theas follows: pages in root set R must contain links to other authorities if

ClassOj(pI.,P5+(2*NEIB)) = NET(p1, p5+(2*NEIB)) there are any. HITS algorithm expands the root set R into abase set S by using the algorithm (see Fig. 7).

The importance of a page P is considered to be anaggregate value of the importance of all terms in P i.e Input: Root set R; Output: Base set S

Page importance(P)=sec moment({importance(t): tEP} (18) Let S=R1. For each page p E S, do Steps 3 to 5

The importance value of every page (Eq. 18) in Rq,n imparts 2. Let T be the set of all pages S points to.a new order to the n topped ranked pages in the result list of a 3. LetF be the set of all pages that point to S.search engine. This new order truly represents the pages 4. LetS=S+T+some or all of F.according to their content scores in opposition to the PR and 6 . Delete all links with the same domain name.

WPR.Figure. 7. Algorithm to determine Base Set

D. HITS AlgorithmSet S may contain approximately 100-3000 pages. OneKleinberg [8] developed a WSM based algorithm called sipeapoc.o idighb n uhrtefrm.eSi

Hyperlink-Induced Topic Search (HITS) [9] which assumes

2009 IEEE Inxternational Advanxce Computing Conference (IACC 2009) 1535

Page 7: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Page Ranking Algorithms:

ordering them by the count of their outgoing and incoming * Topic drift: If there is a tightly connected arrangementlinks. This works well in some situations but does not work of documents on the web, they may accumulate in thewell always. Before starting the second step of the algorithm, results of HITS, while they may not be most relevant toHITS removes all links between pages on the same web site or the user query in some instances.same domain in Step 5 of algorithm, reasoning being that links * Automatically generated links: Some links arebetween pages on the same site are for navigational purposes, automatically generated and represent no humannot for contributing authority. Furthermore, ifmany links from judgment, but HITS gives them equal importance.a domain are pointing to a single page outside the domain then * Non-relevant documents: Some queries can return nononly a small number of these links are counted instead of all. relevant documents in the result list and it can lead to

erroneous results from the HITS as root set will not beStep 2-Iterative Step: Finding Hubs and Authorities appropriate.

* Efficiency. The performance of the algorithm is notThis step finds hubs and authorities using the output of good in the real time.

sampling step. The algorithm for finding hubs and authoritiesis shown in Fig. 8.

Input: Base set S, Output:A set of hubs and a set of authorities.

1. Let a page p have a non-negative authority weight xp and hub | -weight yp. Pages with relatively large weights xp will be classified tobe the authorities, similarly hubs with large weights yp.

2. The weights are normalized so the squared sum for each type of c Nweight is 1.

3. For a page p, the value of xp is updated to be the sum of Yq over allpages q linking top. Xp =yA+yB+YC YP XL+XM+N

4. The value of yp is updated to be the sum of Xq over all pages q Figure. 9. An example of HITS operationlinked to byp.

5. Continue with step 2 unless a termination condition has been A number of proposals like Probabilistic HITS, Weightedreached. HITS etc. [15 16 17] have been proposed in the literature for6. Output the set of pages with the largest xp weights i.e. authorities ,and those with the largest yp weights i.e. hubs. modifying HITS.

Figure. 8. Algorithm to determine Hubs and Authorities E. A comparison Study

A critical look at the available literature highlights severalHubs and authorities are assigned relative weights- an differences in the basic concepts used in each algorithm.

authority pointed to by several highly scored hubs is HITS, like PageRank and WPR, is an iterative algorithmconsidered a strong authority while a hub that points to several based on the link structure of the documents on the web,highly scored authorities is considered to be a popular hub. If however it does have some major differences: it is executed atB(p) and R(p) denote the set of referrer and reference pages of query time, not at indexing time; it calculates two scores perpage p, respectively. The scores of hubs and authorities are document as opposed to a single score; it is processed on acalculated as follows: small subset of relevant documents, not all documents.

PCR also choses some subset of documents in the result list as

Xp = ZYq (18) HITS does and applies a classification scheme which is notqeB(p) there in HITS. PR and WPR differentiate themselves with

PCR and HITS as they mainly focus on the hyperlink structureof the pages intead of their contents. The comparisonsummary of algorithms discussed in Section 3 is shown in

yp = Xq (19) Table II.q e R(p)

F. ConclusionFig. 9 shows how to calculate the authority and hub scores. Web Mining is used to extract the useful information from

The page results are ranked according to their hub and very large amount of web data. The usual search enginesauthority scores and given to the user. usually result in a large number of pages in response to users'

queries, while the user always wants to get the best in a shortspan of time so he/she does not bother to navigate through all

There researches have shown that while the algorithm the pages to get the required ones. The page rankingworks well for most queries, it does not work well for others. algorithms, which are an application of web mining, play aThere are a number of reasons for this [14]: major role in making the user search navigation easier in the

* Hubs and authorities. A well defined distinction results of a search engine. The PageRank and Weighted Pagebetween hubs and authorities is not there since many Rank algorithm give importance to links rather than thesites are hubs as well as authorities, content of the pages, the HITS algorithm stresses on the

11536 2009 IEEE Internactionalz Advance Computing Conference (IACC 2009)

Page 8: [IEEE 2009 IEEE International Advance Computing Conference (IACC 2009) - Patiala, India (2009.03.6-2009.03.7)] 2009 IEEE International Advance Computing Conference - Page Ranking Algorithms:

TABLE II COMPARISON OF PAGE RANKING ALGORITHMS

Algorithm PageRank WeightedPageRank Page Content Rank HITS

Main Technique Web Content Mining Web Structure Mining,Used Web Structure Mining Web Structure Mining Web content mining

Description Computes scores at Computes new scores of Computes hub andComputes scores at indexing time, unequal the top n pages on the authority scores ofn highlyindexing time not at distribution of score, fly. Pages returned are relevant pages on the fly.query time. Results arequerytim . Resulpages are sorted related to the query i.e. Relevant as well assorted according to according to relevant documents are important pages areimportance of pages. importance. returned. returned.

I/P Parameters Backlinks Backlinks, forward Content Backlinks, forward links,links content

Working levels N 1 1 < NComplexity O(log N) <O(log N) O(m) <(O(logaN)

(higher than WPR)Relevancy Less Ls More Mr

0 Relevancy TLess l (higher than PR) More (less than PCR)Importance More More less less

Quality ofresult Medium Higher than PR Approx equal to WPR Less than PRLimitations Computes scores at

indexing time not on fly. Relevancy is ignored. Importance of pages is Topic drift and efficiencyResults are sorted Method computes totally ignored problems

according to importance scores at a single level.of pages.

*n: number of pages chosen by the algorithm, N: number ofweb pages, m: Total number of occurrences of query terms in n pages

content of the web pages as well as links, while the Page [10] Wenpu Xing and Ali Ghorbani, "Weighted PageRank Algorithm",Content Rank algorithm considers only the content of the Proceedings of the Second Annual Conference on Communication

pages. Depending upon the technique used, the rankingNetworks and Services Research (CNSR'04), 2004 IEEE.

palgorithmsges.pen ding fuontt dertoechn e resuselth rankingThe [11] http://www.google.com/technology/index.html, Our Search: Googlealgorithms give a different order to the resultant pages. The Technology.PageRank and WPR return the important pages on the top of [12] Salton G. and Buckley, C., "Weighting Approaches in Automatic Textthe result list while others return the relevant ones on the top. Retrieval". In Information Processing and Management, 1998, Vol. 24,A typical search engine may deploy a particular ranking No. 5, pp. 513-523.algorithm depending upon the user needs. As a future [13] Zdravko Markov and Daniel T. Larose, "Mining the Web: Uncovering

Patterns in Web Content, Structure, and Usage Data". Copyright 2007guidance, the algorithms which equally consider the relevancy John Wiley & Sons, Inc.as well as importance of a page should be developed so that [14] S. Chakrabarti et al., "Mining the Web's Link Structure". Computer,the quality of search results can be improved. 32(8):60-67, 1999.

[15] D. Cohn and H. Chang, "Learning to Probabilistically identifyREFERENCES Authoritative Documents". In Proceedings of 17th International Conf.

[1] Naresh Barsagade, "Web Usage Mining And Pattern Discovery: A on Machine Learning, pages 167-174. Morgan Kaufmann, SanSurvey Paper", CSE 8331, Dec.8,2003. Francisco, CA, 2000.

[2] R.Cooley, B.Mobasher and J.Srivastava, "Web Mining: Information and [16] Saeko Nomura, Satoshi Oyama, Tetsuo Hayamizu, Analysis andPattern Discovery on the World Wide Web". In Proceedings of the 9th Improvement of HITS Algorithm for DetectingWeb Communities.IEEE International Conference on Tools with Artificial Intelligence [17] Longzhuang Li, Yi Shang, and Wei Zhang, "Improvement of HITS-(ICTAI'97), 1997. based Algorithms on Web Documents", WWW2002, May 7-11, 2002,

[3] Companion slides for the text by Dr. M. H. Dunham, "Data Mining: Honolulu, Hawaii, USA. ACM 1-58113-449-5/02/0005.Introductory and Advanced Topics", Prentice Hall, 2002.

[4] Jaroslav Pokorny, Jozef Smizansky, "Page Content Rank: An Approachto the Web Content Mining".

[5] L. Page, S. Brin, R. Motwani, and T. Winograd, "The Pagerank CitationRanking: Bringing order to the Web". Technical report, Stanford DigitalLibraries SIDL-WP-1999-0120, 1999.

[6] C. Ridings and M. Shishigin, "Pagerank Uncovered". Technical report,2002.

[7] http://WWW.webrankinfo.com/english/seo-news/topic-16388.htm.January 2006, Increased Google index size.

[8] Kleinberg J., "Authorative Sources in a Hyperlinked Environment".Proceedings of the 23rd annual International ACM SIGIR Conferenceon Research and Development in Information Retrieval, 1998.

[9] C. Ding, X. He, P. Husbands, H. Zha, and H. Simon, "Link Analysis:Hubs and Authorities on the World". Technical report:47847, 2001.

2009 IEEE Inxternational Advanxce Computing Conference (IACC 2009) 1537