Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

19
Understanding Google’s PageRank™ 1

Transcript of Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

Page 1: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

Understanding Google’s PageRank™

1

Page 2: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

Review: The Search Engine

2

Page 3: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

Goals and assumptions

The results from the query module are still excessively large sets, despite the boolean operations and content index operations.

We still don’t know which pages should be ranked highest.

Assume the pages with the most in-bound links are the best; a link is a vote.

3

Page 4: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

An Elegant Formula

ππS + (1-) E)

Google’s (Brin & Page) PageRank™ equation.

US Patent #6285999, filed 1998, granted 2001

This formula resolves the world’s largest matrix calculation.

4

Page 5: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

ππS + (1-) E)

Derived from a formula B&P worked out in graduate school (itself derived from traditional bibliometrics research literature).

r(Pi) =

Essential characteristic: high-ranking pages associate with high-ranking pages

r (Pj)

|Pj|_____

Pj BPi

5

Page 6: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

ππS + (1-) E)

r(Pi) =

Must be applied to a set of linked pages, or a graph.To do this we analyze the graph to see it’s out-links and back-links.

Therefore. . .

r (Pj)

|Pj|_____

Pj BPi

r(Pi) : the rank of a given pagePj Bpi : the ranks of the set of back-

linking pagesr (Pj) : the rank of a given page|Pj| : the number of out-links on

a page

6

Page 7: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

ππS + (1-) E)

A site diagram like this:1

23

5

4 6

7

Page 8: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

ππS + (1-) E)becomes a directed graph like this:

11 22

33

66

44 55

8

Page 9: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

But there’s a problem

Nothing’s ranked!

r (Pj)

|Pj|_____

Pj BPi

r(Pi) : the rank of a given pagePj Bpi : the ranks of the set of back-

linking pagesr (Pj) : the rank of a given page|Pj| : the number of out-links on

a page

r(Pi) =

11 22

33

66

44 55

9

Page 10: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

The solution. . . sort ofStart by assuming all the ranks are equal. In this example each page is just 1 of 6, so the initial rank is expressed as 1/6

Then, you keep feeding the number through the formula until you get a ranking.

This results in a rank , but you have to calculate these ranks one page at a time. That’s slow.

11 22

33

66

44 55

10

Page 11: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

Directed graph iterative node values

r0 r1 r2

Rank(i2)

P1 1/6 1/18 1/36 5

P2 1/6 5/36 1/18 4

P3 1/6 1/12 1/36 5

P4 1/6 1/4 17/72 1

P5 1/6 5/36 11/72 3

P6 1/6 1/6 14/72 2

11 22

33

66

44 55

11

Page 12: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

CMS matrixThis can’t go on foreverIn the interest of speed and efficiency, we need to know if the ranks converge—that is, we need to know if there are clear rankings, or can we keep doing this indefinitely and never have a decisive ranking?

To determine this, the formula must be transformed using binary adjacency transformation, and Markov chain theory.

11 22

33

66

44 55

12

Page 13: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

Convert the iterative calculation to a matrix calculation using binary adjacency transformation for a 1Xn matrix

P1 P2 P3 P4 P5 P6

P1 0 ½ ½ 0 0 0

P2 0 0 0 0 0 0

P3 1/3 1/3 0 0 1/3 0

P4 0 0 0 0 ½ ½

P5 0 0 0 ½ 0 ½

P6 0 0 0 1 0 0

[ ]13

Page 14: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

Now, you can treat a row as a vector, or set of values

P1 P2 P3 P4 P5 P6

P1 0 ½ ½ 0 0 0

P2 0 0 0 0 0 0

P3 1/3 1/3 0 0 1/3 0

P4 0 0 0 0 ½ ½

P5 0 0 0 ½ 0 ½

P6 0 0 0 1 0 0

[ ]π

14

Page 15: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

This is a sparse matrix. That’s good. We store only the non-zero elements and represent the entire matrix as H

P1 P2 P3 P4 P5 P6

P1 0 ½ ½ 0 0 0

P2 0 0 0 0 0 0

P3 1/3 1/3 0 0 1/3 0

P4 0 0 0 0 ½ ½

P5 0 0 0 ½ 0 ½

P6 0 0 0 1 0 0

[ ]15

Page 16: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

ππS + (1-) E)

So now this:

Has become this: ππ) through the transformation and reduction of the form (power method transform, eigenvector computation, stochasticity adjustment).

We only need a couple more adjustments.

r (Pj)

|Pj|_____

Pj BPi

r(Pi) =

16

Page 17: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

ππS + (1-) E)

Sometimes, people teleport to a page. They just enter the URL and go. And just as easily, they can teleport out. To account for this, B&P added two adjustments:

S accounts for people who reach a dead end and jump to another page within a site. is a weighted probability that someone will leave.

S is a matrix of probable page destinations.

17

Page 18: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

ππS + (1-) E)

What about people who jump out to a completely new destination? To account for this, B&P added the final adjustments:

1- is the inverted weighted probability that someone will leave and go to a completely new site.

E is a random teleportation matrix of probable page destinations.

18

Page 19: Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

Summary

ππS + (1-) E)

A page’s rank is equal to the summed and transformed ranks of the referring pages, tempered by the ability (a weighted probability) to teleport within a site, plus the inverted probability to teleport out of a site, multiplied by the probability matrix of teleporting to a popular site.

19