Google PageRanking Report

download Google PageRanking Report

of 15

Transcript of Google PageRanking Report

  • 8/10/2019 Google PageRanking Report

    1/15

    University Of RUGKT IIIT Nuzvid

    Faculty of Graph Theory and Applications

    Google PageRank Algorithm

    (Project Report)

    Prepared By

    Kalishavali Shaik

    N091390-CSE-I

    Supervised By

    Kalpana Gangwar Madam

    Graph theory and Applications

    Dated: December 10, 2014

  • 8/10/2019 Google PageRanking Report

    2/15 2

    Table of Contents

    1 Cover

    Page..1

    2

    Abstract

    .3

    3

    Abbreviations

    4

    4

    Chapters

    ..4

    4.1

    Introduction4

    4.2 Review of Literature

    .............................................5

    4.3 Summary and Conclusions

    15

    5

    References

    15

  • 8/10/2019 Google PageRanking Report

    3/153

    Abstract

    This paper presents different parallel implementations of Google's PageRank algorithm.

    The purpose is to compare different methods for computing PageRank on large domains

    Of the Web. The iterative algorithms used are the Power method and the Arnoldi method.

    We have implemented these algorithms in a parallel environment and created a basic

    WebCrawler to gather test data. Tests have then been carried out with the different algorithms

    Using various test data.

    The explicitly restarted Arnoldi method was shown to be superior to the normal Arnoldi

    Method as well as the Power method for high values of the dampening factor . Results

    Also show that load balancing our parallel implementation was usually quite ineffective.

    For smaller values of , including 0.85 as Google uses, the Power method is preferable. It

    Is usually somewhat slower, but the memory used is significantly less. For higher values

    Of , if very accurate results are needed, the restarted Arnoldi method is preferable.

    The importance of a Webpage is an inherently subjective matter, which depends on the

    readers interests, knowledge and attitudes. But there is still much that can be said objectively

    about the relative importance of Webpages. This paper describes PageRank, a method for

    rating Webpages objectively and mechanically, effectively measuring the human interest and

    attention devoted to them.

  • 8/10/2019 Google PageRanking Report

    4/154

    Abbreviations

    HITS: Hypertext Induced Topic Search

    DWPR: Distance Weighted Page Ranking

    WWW: World Wide Web

    Contents

    1.

    Introduction

    The World Wide Web creates many new challenges for information retrieval.

    It is very large and heterogeneous. Current estimates are that there areover150million

    webpages with a doubling life of less than one year. More importantly, the web pages are

    extremely diverse, ranging from "What is Joe having for lunch today?" to journals about

    information retrieval. In addition to these major challenges, search engines on the Web

    must also contend with inexperienced users and pages engineered to manipulate search

    engine ranking functions.

    However, unlike "at" document collections, the World Wide Web is

    Hypertext and provides considerable auxiliary information on top of the text of the

    web pages, such as link structure and link text. In this paper, we take advantage of the

    link structure of the Web to produce a global importance" ranking of every web page.

    This ranking, called PageRank, helps search engines and users quickly make sense of the

    vast heterogeneity of the World Wide Web

    Today search engines are becoming best friends of most of the people for

    navigation on Internet. User needs to enter keyword or combination of keywords to

    trigger the search. Search engine searches the web pages available on Internet and returns

    the result as number of ordered pages. This order should depend on relevancy of pages

    and importance of pages.

    Different search engines use different techniques to decide importance of page on

    the web. Alta Vista uses HITS (Hyperlink Induced Topic Search) algorithm to rank pages

  • 8/10/2019 Google PageRanking Report

    5/15 5

    while Google uses PageRank algorithm to rank the pages. Yahoo also uses some

    variation of PageRank algorithm that Google uses. There are many variations of Page

    Ranking algorithm. Like confidence based page ranking, weighted page rank algorithm,

    query sensitive self-adaptable web page ranking algorithm etc. But, this paper covers

    only one algorithm that is based on formula given by Bin and Page, the founders of

    Google.

    2. What is PageRank?

    Most of the people start their web navigation by search engine. Google is the most

    famous search engine used now days. While presenting search results, they should be

    ordered by their relevancy and importance on the web. User cannot go through all the

    pages presented as output of search. Thus all the pages in the collection should be

    weighted and represented in the order of their weights.One of the most important factors that Google uses is PageRank. PageRank is a

    numeric value that represents how important a page is on the web. Off course PageRank

    is not the only factor, which decides importance of page, but still it is one of them.

    PageRank is described by one mathematical formula that seems very difficult at first, but

    actually it is not.

    2.1 Formula:

    The citation graph of the web is main resource for calculation of PageRank. In the

    paper The Anatomy of a Large-Scale Hyper textual Web Search Engine founders of

    Google, Sergey Brin and Lawrence Page defined PageRank as:

    We assume page A has pages T1...Tn which point to it (i.e., are citations). The

    parameter d is a damping factor, which can be set between 0 and 1. We usually set d to

    0.85.. C (A) is defined as the number of links going out of page A. The PageRank

    of a page A is given as follows:

    PR (A) = (1-d) + d (PR (T1)/C (T1) + ... + PR (Tn)/C (Tn))

    It's the original formula that was published when PageRank was being developed,

    and it is probable that Google uses a variation of it but they aren't telling us what it is.

  • 8/10/2019 Google PageRanking Report

    6/15 6

    Thus, when one page links to other page, first page votes some PageRank to the second

    page. PageRank of a page is the addition of one constant (1-d=0.15) and the damped

    value of addition of votes by all pages pointing to it. Value of vote by a particular page

    depends on PageRank and total number of out links of that page. Thus, higher the

    PageRank, higher is the value of vote and higher the number of out links, lower is the

    value of vote.

    So every page distributes 85% of its original PageRank evenly among all pages to

    which it points. E.g. if page A is pointing to four other pages then factor d * PR (A)/4

    will come in PageRank equation of all those four pages. That four factors will add up to d

    * PR (A), i.e. 85% of PageRank of A. (From here onwards Ill refer PageRank as PR)

    Note that when page votes for another page, it doesnt give anything from its own PR.

    Its just voting, only the difference is that weight of vote of a page depends on its own

    PR. It is same as shareholders meeting where weight of vote of shareholder depends on

    the shares held.

    2.2 How to use formula?

    If we look at the equation, its obvious that PR of a page depends on PR of other

    pages, which are pointing to it, which may depend on PR of original page if it has back

    link to that page. e.g. If there are only two pages A and B which points to each other, then

    PR of each page will depend on PR of other, which is yet uncalculated. So where to start

    from? Surprisingly, we can start with any assumed values of PR. And repeat the

    calculation of PR values of all pages iteratively till they become stable.

    Letssee an example: consider there are only two pages A and B, which points to

    each other. We dont know what their PR should be to begin with, so lets take a guess at

    1 and do some calculations.

    PR (A) = (1-d) + d * PR (B)

    = (1-0.85) + 0.85 * 1= 1

    PR (B) = (1-d) + d * PR (A)

    = (1-0.85) + 0.85 * 1

    = 1

  • 8/10/2019 Google PageRanking Report

    7/15 7

    Since they are not changing we can stop here. Now letsdo the calculations with initial

    guess at 0:

    PR (A) = 0.15 + 0.85 * 0

    = 0.15

    PR (B) = 0.15 + 0.85 * 0.15

    = 0.2775

    After second iteration:

    PR (A) = 0.15 + 0.85 * 0.2775

    = 0.385875

    PR (B) = 0.15 + 0.85 + 0.385875

    = 0.47799375

    After third iteration:

    PR (A) = 0.15 + 0.85 * 0.47799375

    = 0.5562946875

    PR (B) = 0.15 + 0.85 * 0.5562946875

    = 0.622850484375

    and so on. They will keep changing till they reach value of 1. If we start with values

    greater than 1, then also values will degrade iteration by iteration and will

    settle down to 1. If we start calculation with different values of PR (A) and PR(B), then

    also we will end up with PR(1) = PR(B) = 1. Here we considered symmetrical linkstructure between A and B, thus with any value of PR (A) and PR (B) to start, we ends up

    with PR(A) = PR(B) = 1.

    If we take asymmetrical structure e.g. only A is pointing to B and B is not pointing to A,

    then we will end up with consistent values of PR (A) and PR (B), whatever the values we

    start with.

  • 8/10/2019 Google PageRanking Report

    8/15 8

    For more practice with different numbers of pages with different configurations of

    citation graphs, visit:

    www.webworkshop.net/pagerank_calculator.php?lnks=2,10,15&iblprs=0.15,0.15,0.15,0.

    15&pgnms=&pgs=2&initpr=1&its=100&type=simple

    3. How is PageRank used?

    Google uses PR as one of the important factor in search process to fine the

    relevancy of page. So while searching, web pages are accessed in the decreasing order of

    PR. One can find PR of its own webpage by using Google toolbar

    (http://toolbar.google.com/). But output of this toolbar is somewhat unexpected. As ournormal PR calculation can yield PR ranging from 0.15 to unlimited. But toolbar gives PR

    of any page in the range 1 to 10. So actual PR values are divided into intervals and are

    represented as one of the value ranging from 1 to 10. But the question is whether these

    intervals are equidistant? Means is the scale linear? No one outside Google knows it. For

    this question, there are different answers from different researchers. According to Ian

    Rogers, intervals cannot be equidistant. Means the scale used by Google must be

    logarithmic. This yields a result that efforts needed to increase PR from 2 to 3 are very

    less as compared to increase PR from 8 to 9. What these efforts are? At the end of this

    paper anyone can answer this question.

    4. Examples and Observations:

    Google does not provide any information about methods to improve PR values of

    page or site. But we can solve different examples and come up with different

    observations. Before starting examples I would like to explain difference between PR of

    page and PR of site. Google assigns PR for each page on the web. But from the site point

    of view, total PR of all pages of site is important. Thus PR of site is nothing but sum of

    PRs of all pages of site.

    Example 1.

    http://www.webworkshop.net/pagerank_calculator.php?lnks=2,10,15&iblprs=0.15,0.15,0.15,0.15&pgnms=&pgs=2&initpr=1&its=100&type=simplehttp://www.webworkshop.net/pagerank_calculator.php?lnks=2,10,15&iblprs=0.15,0.15,0.15,0.15&pgnms=&pgs=2&initpr=1&its=100&type=simplehttp://www.webworkshop.net/pagerank_calculator.php?lnks=2,10,15&iblprs=0.15,0.15,0.15,0.15&pgnms=&pgs=2&initpr=1&its=100&type=simplehttp://toolbar.google.com/http://toolbar.google.com/http://toolbar.google.com/http://toolbar.google.com/http://www.webworkshop.net/pagerank_calculator.php?lnks=2,10,15&iblprs=0.15,0.15,0.15,0.15&pgnms=&pgs=2&initpr=1&its=100&type=simplehttp://www.webworkshop.net/pagerank_calculator.php?lnks=2,10,15&iblprs=0.15,0.15,0.15,0.15&pgnms=&pgs=2&initpr=1&its=100&type=simple
  • 8/10/2019 Google PageRanking Report

    9/15 9

    PR (A) = (1-d) + d (0)

    = 0.15

    PR (B) = (1-d) + d (0)

    = 0.15

    In this example A and B dont have any inbound and outbound link. They come up

    with PR values equal to 0.15. Thus we can conclude that: Minimum value of PR of any

    page is 0.15.

    Example 2.

    PR (A) = (1-d) + d (PR (B) / C (B))

    = 0.15 + 0.85 (1/1)

    = 1

    PR (B) = (1-d) + d (0)

    = 0.15

    Actually this is not the correct result. There are two concepts: orphan pages and

    dangling links. Orphan pages are those, which dont have any inbound link. For a page to

    be indexed by Google it must have at least one page linking to it. If a page is not in the

    Google index, it and its links don't exist as far as the calculations are concerned.

    Therefore, it can't have a PR value and it can't share PR with other pages. Page B is

    orphan page and thus will be eliminated during PR calculation. And thus page A will notreceive any vote from page B.

    Dangling links are links that go to pages that don't have any outbound links. These

    links are dropped for the duration of the calculations. Link from B to A is a dangling link

    and will be eliminated during PR calculation. But here onwards we will consider only

    small part of whole web. Thus even there are no inbound links shown for a particular

    A B

    A B

  • 8/10/2019 Google PageRanking Report

    10/15 10

    page, they are assumed to be present there. Thus we will not consider these two concepts

    anymore in our examples.

    From here onwards Ill not show all calculations. Instead Ill show only final

    settled values of PR after large number of iterations.

    Example 3

    Example 4.

    A1.0

    B1.0

    C1.0

    A1.0

    B1.0

    C1.0

    B0.575

  • 8/10/2019 Google PageRanking Report

    11/15 11

    Fig.1.Analysis of Page Rank [1]

    This is hierarchical structure. In general homepage of website should have maximum PR.

    We need our PR to be concentrated at homepage. Thus we can use hierarchical structure

    and can channel large proportion of PR of site to where we want.

    Example 5:

    Example 6:

    A1.85

    C0.575

    C1.255

    A2.6

    B1.255

    External Site 1

    1.0 External Site 2

    1.215

    External Site1

    1.0

    A1.0

    B0.575

    C0.575

    External Site 2

    0.638

  • 8/10/2019 Google PageRanking Report

    12/15 12

    Fig 2. Analysis of Page Rank [1]

    Compare example 5 with example 6. Page A, B and C are pages of same site. Externalsite 1 is pointing to page A and page C is pointing to page C. In both of these examples,

    PR of external site 1 is assumed to be equal to 1. And then PR values of remaining pages

    are calculated. Can we infer something from these examples? Definitely. There is PR

    leak in example 5, which is avoided in example 6. In example 5 page is C gives its entire

    vote to external site. While in example 6 page C gives part of it to external site and part

    of it to page A. This part of vote from C increases PR value of A and thus increases

    weight of vote of A. In turn PR of B and C increase which, in turn again increases weight

    of vote of C. This is iterative process, which finally results into increased value of PR of

    all A, B and C. This is definitely in the favor of PR of whole site.

    From these examples we can conclude that if any page of site is pointing to

    external page, then we can reduce the PR leak by increasing citation network.

    Can we decrease PR leak by introducing reciprocating links between B and C? Letstry

    this.

    Fig.3.Analysis of Page Rank [1]

    Thats great!!! Average PR of site increased as we expected.

    5. How to increase PageRank?

    External Site 1

    1.0

    A2.146

    B1.549

    C1.720

    External Site 2

    1.215

  • 8/10/2019 Google PageRanking Report

    13/15 13

    There are different ways to increase PR of your site. From examples illustrated

    above we can come up with some ideas to increase PR of web site. Some standard ideas

    are given below:

    5.1 Add spam pages:

    As shown in figure, we can increase total PR of website by adding more pages into

    site. No matter, what are the contents of these pages? Total PR of site increases, but we

    cant increase average PR of site (i.e. PR per page). Upper limit on average PR of site is

    1.

    Fig.4.Analysis of Page Rank [1]

    5.2 Join forums:

    A331.0

    B281.6

    Spam 1

    0.39

    Spam 2

    0.39

    Spam

    1000

    0.39

  • 8/10/2019 Google PageRanking Report

    14/15 14

    Forums are a great way to achieve links to your website. In most forums you are

    allowed to have a signature and in your signature you can put a link to your website. But

    another important note to look on is making sure the forum is somewhat related to your

    website. You will still get credit if it's not, but if it's related to your website then you will

    be accomplishing two tasks at once. From this people will come to know about your site

    and this will help to increase popularity of your site.

    5.3 Submit to search engine directories.

    Search engine directories are a good way to get a free link to your website. They

    also increase your chances at being listed higher on popular search engines like Google,

    and others. Most search engine directories allow you to submit to their website for free.

    This will allow you to increase your web presence by being listed on another search

    engine, and it will also be a free link. Remember the more links you have the higher your

    PR will be.

    5.4 Reciprocating Links:

    You can search for sites related to same topic as that of your website. If PR of that

    site is higher than your site, then you can pay that site and can create reciprocating links

    among two. There are link-farming sites, which exchanges links with other sites. Actually

    this is illegal. Google is banning the sites participating in link farming. Thus be aware oflink farming!!

    5.5 Contents:

    Last and the most important way to increase PR of your site is to keep solid

    contents on your sites. Other sites will automatically link your site if you are having good

    contents. Really there is no substitute for good contents!!

    Conclusion

  • 8/10/2019 Google PageRanking Report

    15/15

    Even though formula for calculating PageRank seems to be difficult, it is easy to

    understand. But when a simple calculation is applied hundreds of times over the results

    can seem complicated. And we cannot predict the result of these iterations. Surely, more

    practice can yield more observations.

    PageRank is important factor considered in Google ranking, but it is only one ofthe important factors considered. E.g. now a days Google is paying a lot of attention to

    the links anchor text while deciding relevancy of target page.

    But as PageRank is also one of the important factor, one should be well aware of

    PageRank while designing the website.

    References

    Fig. Analysis of Page Rank [1]--- Wikipedia, http://pr.efactory.de/

    Bing Liu Web Data Mining Springer International Edition

    IEEE Conference Paper Research on PageRank and Hyperlink Induced Topic

    Search in Web Structure Mining

    Website : Google, Wikipedia, http://pr.efactory.de/

    www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.html