The Mathematics Behind Google's PageRank
Transcript of The Mathematics Behind Google's PageRank
![Page 1: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/1.jpg)
The MathematicsBehind Google’s PageRank
Ilse Ipsen
Department of Mathematics
North Carolina State University
Raleigh, USA
Joint work with Rebecca Wills
Man – p.1
![Page 2: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/2.jpg)
Two Factors
Determine where Google displays a web page onthe Search Engine Results Page:
1. PageRank (links)
A page has high PageRankif many pages with high PageRank link to it
2. Hypertext Analysis (page contents)
Text, fonts, subdivisions, location of words,contents of neighbouring pages
Man – p.2
![Page 3: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/3.jpg)
PageRank
An objective measure of the citation importance ofa web page [Brin & Page 1998]
• Assigns a rank to every web page• Influences the order in which Google displays
search results• Based on link structure of the web graph• Does not depend on contents of web pages• Does not depend on query
Man – p.3
![Page 4: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/4.jpg)
More PageRank More Visitors
Man – p.4
![Page 5: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/5.jpg)
PageRank
. . . continues to provide the basis for all of our websearch tools http://www.google.com/technology/
• “Links are the currency of the web”• Exchanging & buying of links• BO (backlink obsession)• Search engine optimization
Man – p.5
![Page 6: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/6.jpg)
Overview
• Mathematical Model of Internet• Computation of PageRank• Is the Ranking Correct?• Floating Point Arithmetic Issues
Man – p.6
![Page 7: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/7.jpg)
Mathematical Model of Internet
1. Represent internet as graph
2. Represent graph as stochastic matrix
3. Make stochastic matrix more convenient=⇒ Google matrix
4. dominant eigenvector of Google matrix=⇒ PageRank
Man – p.7
![Page 8: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/8.jpg)
The Internet as a Graph
Link from one web page to another web page
Web graph:
Web pages = nodesLinks = edges
Man – p.8
![Page 9: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/9.jpg)
The Internet as a Graph
Man – p.9
![Page 10: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/10.jpg)
The Web Graph as a Matrix
3
2
55
4
1
S =
0 12
0 12
0
0 0 13
13
13
0 0 0 1 0
0 0 0 0 1
1 0 0 0 0
Links = nonzero elements in matrix
Man – p.10
![Page 11: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/11.jpg)
Properties of Matrix S
• Row i of S: Links from page i to other pages• Column i of S: Links into page i
• S is a stochastic matrix:All elements in [0, 1]Elements in each row sum to 1
• Dominant left eigenvector:
ωT S = ωT ω ≥ 0 ‖ω‖1 = 1
• ωi is probability of visiting page i
• But: ω not uniqueMan – p.11
![Page 12: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/12.jpg)
Google Matrix
Convex combination
G = αS + (1 − α)11vT
︸ ︷︷ ︸
rank 1
• Stochastic matrix S
• Damping factor 0 ≤ α < 1e.g. α = .85
• Column vector of all ones 11• Personalization vector v ≥ 0 ‖v‖1 = 1
Models teleportation
Man – p.12
![Page 13: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/13.jpg)
PageRank
G = αS + (1 − α)11vT
• G is stochastic, with eigenvalues:
1 > α|λ2(S)| ≥ α|λ3(S)| ≥ . . .
• Unique dominant left eigenvector:
πT G = πT π ≥ 0 ‖π‖1 = 1
• πi is PageRank of web page i
[Haveliwala & Kamvar 2003, Eldén 2003,
Serra-Capizzano 2005] Man – p.13
![Page 14: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/14.jpg)
How Google Ranks Web Pages
• Model:Internet → web graph → stochastic matrix G
• Computation:PageRank π is eigenvector of G
πi is PageRank of page i
• Display:If πi > πk thenpage i may∗ be displayed before page k
∗ depending on hypertext analysis
Man – p.14
![Page 15: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/15.jpg)
Facts
• The anatomy of a large-scale hypertextual websearch engine [Brin & Page 1998]
• US patent for PageRank granted in 2001• Google indexes 10’s of billions of web pages
(1 billion = 109)• Google serves ≥ 200 million queries per day• Each query processed by ≥ 1000 machines• All search engines combined process more
than 500 million queries per day
[Desikan, 26 October 2006] Man – p.15
![Page 16: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/16.jpg)
Computation of PageRank
The world’s largest matrix computation[Moler 2002]
• Eigenvector• Matrix dimension is 10’s of billions• The matrix changes often
250,000 new domain names every day• Fortunately: Matrix is sparse
Man – p.16
![Page 17: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/17.jpg)
Power Method
Want: π such that πT G = πT
Power method:
Pick an initial guess x(0)
Repeat[x(k+1)]T := [x(k)]T G
until “termination criterion satisfied”
Each iteration is a matrix vector multiply
Man – p.17
![Page 18: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/18.jpg)
Matrix Vector Multiply
xT G = xT[αS + (1 − α)11vT
]
Cost: # non-zero elements in S
A power method iteration is cheapMan – p.18
![Page 19: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/19.jpg)
Error Reduction in 1 Iteration
πT G = πT G = αS + (1 − α)11vT
[x(k+1) − π]T = [x(k)]T G − πT G
= α[x(k) − π]T S
Error: ‖x(k+1) − π‖1︸ ︷︷ ︸
iteration k+1
≤ α ‖x(k) − π‖1︸ ︷︷ ︸
iteration k
Man – p.19
![Page 20: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/20.jpg)
Error in Power Method
πT G = πT G = αS + (1 − α)11vT
Error after k iterations:
‖x(k) − π‖1 ≤ αk ‖x(0) − π‖1︸ ︷︷ ︸
≤2
[Bianchini, Gori & Scarselli 2003]
Error bound does not depend on matrix dimension
Man – p.20
![Page 21: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/21.jpg)
Advantages of Power Method
• Simple implementation (few decisions)• Cheap iterations (sparse matvec)• Minimal storage (a few vectors)• Robust convergence behaviour• Convergence rate independent of matrix
dimension• Numerically reliable and accurate
(no subtractions, no overflow)
But: can be slowMan – p.21
![Page 22: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/22.jpg)
PageRank Computation
• Power methodPage, Brin, Motwani & Winograd 1999Bianchini, Gori & Scarselli 2003
• Acceleration of power methodKamvar, Haveliwala, Manning & Golub 2003Haveliwala, Kamvar, Klein, Manning & Golub 2003Brezinski & Redivo-Zaglia 2004, 2006Brezinski, Redivo-Zaglia & Serra-Capizzano 2005
• Aggregation/DisaggregationLangville & Meyer 2002, 2003, 2006Ipsen & Kirkland 2006
Man – p.22
![Page 23: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/23.jpg)
PageRank Computation
• Methods that adapt to web graphBroder, Lempel, Maghoul & Pedersen 2004 Kamvar,Haveliwala & Golub 2004Haveliwala, Kamvar, Manning & Golub 2003Lee, Golub & Zenios 2003Lu, Zhang, Xi, Chen, Liu, Lyu & Ma 2004Ipsen & Selee 2006
• Krylov methodsGolub & Greif 2004Del Corso, Gullí, Romani 2006
Man – p.23
![Page 24: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/24.jpg)
PageRank Computation
• Schwarz & asynchronous methodsBru, Pedroche & Szyld 2005Kollias, Gallopoulos & Szyld 2006
• Linear system solutionArasu, Novak, Tomkins & Tomlin 2002Arasu, Novak & Tomkins 2003Bianchini, Gori & Scarselli 2003Gleich, Zukov & Berkin 2004Del Corso, Gullí & Romani 2004Langville & Meyer 2006
Man – p.24
![Page 25: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/25.jpg)
PageRank Computation
• Surveys of numerical methods:Langville & Meyer 2004Berkhin 2005Langville & Meyer 2006 (book)
Man – p.25
![Page 26: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/26.jpg)
Is the Ranking Correct?
πT =(.23 .24 .26 .27
)
• xT =(.27 .26 .24 .23
)
‖x − π‖∞ = .04
Small error, but incorrect ranking
• yT =(0 .001 .002 .997
)
‖y − π‖∞ = .727
Large error, but correct ranking
Man – p.26
![Page 27: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/27.jpg)
What is Important?
Numerical value ↔ ordinal rank
ordinal rank:position of an element in an ordered list
Very little research on ordinal rankingRank-stability, rank-similarity
[Lempel & Moran, 2005][Borodin, Roberts, Rosenthal & Tsaparas 2005]
Man – p.27
![Page 28: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/28.jpg)
Ordinal Ranking
Largest element gets Orank 1
πT =(.23 .24 .26 .27
)
Orank(π4) = 1, Orank(π1) = 4
• xT =(.27 .26 .24 .23
)
Orank(x1) = 1 6= Orank(π1) = 4
• yT =(0 .001 .002 .997
)
Orank(y4) = 1 = Orank(π4)
Man – p.28
![Page 29: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/29.jpg)
Problems with Ordinal Ranking
When done with power method:• Popular termination criteria
do not guarantee correct ranking• Additional iterations can destroy ranking• Rank convergence depends on:
α, v, initial guess, matrix dimension,structure of web graph
• Even if successive iterates have the sameranking, their ranking may not be correct
[Wills & Ipsen 2007]Man – p.29
![Page 30: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/30.jpg)
Ordinal Ranking Criterion
Given:
Approximation x, x ≥ 0Error bound β ≥ ‖x − π‖1
Criterion: xi > xj + β =⇒ πi > πj
Why?
(xi − πi) − (xj − πj) ≤ ‖x − π‖1 ≤ β
xi − (xj + β) ≤ πi − πj
0 < xi − (xj + β) =⇒ 0 < πi − πj
[Kirkland 2006]Man – p.30
![Page 31: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/31.jpg)
Applicability of Criterion
20 40 60 80 100 120 140 160 1800
10
20
30
40
50
60
70
80
90
100
Iteration Number
Perc
enta
ge o
f E
lem
ents
B1 FP Bound AppliesElement Matches Successor
Man – p.31
![Page 32: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/32.jpg)
Properties of Ranking Criterion
• Applies to any approximation,provided error bound is available
• Requires well-separated elements• Tends to identify ranks of larger elements• Determines partial ranking• Top-k, bucket and exact ranking• Easy to use with power method
Man – p.32
![Page 33: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/33.jpg)
Top-k Ranking
Given:
Approximation x to PageRank πPermutation P so that x̃ = Px with
x̃1 ≥ . . . ≥ x̃n
Write: π̃ = Pπ
Suppose x̃k > x̃k+1 + β
Man – p.33
![Page 34: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/34.jpg)
Top-k Ranking
x̃1 Orank(π̃1) ≤ k... ...
x̃k Orank(π̃k) ≤ k
↑β
↓x̃k+1 Orank(π̃k+1) ≥ k + 1
... ...x̃n Orank(π̃n) ≥ k + 1
Man – p.34
![Page 35: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/35.jpg)
Exact Ranking
Given:
Approximation x to PageRank πPermutation P so that x̃ = Px with
x̃1 ≥ . . . ≥ x̃n
Write: π̃ = Pπ
If x̃k−1 > x̃k + β and x̃k > x̃k+1 + β then
Orank(π̃k) = k
Man – p.35
![Page 36: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/36.jpg)
Exact Ranking
x̃1...
x̃k−1
l β Orank(π̃k) ≥ k
x̃k
l β Orank(π̃k) ≤ k
x̃k+1...
x̃n
Man – p.36
![Page 37: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/37.jpg)
Bucket Ranking
Given:
Approximation x to PageRank πPermutation P so that x̃ = Px with
x̃1 ≥ . . . ≥ x̃n
Write: π̃ = Pπ
Suppose x̃k−i > x̃k + β and x̃k > x̃k+j + β
π̃k is in bucket of width i + j − 1
Man – p.37
![Page 38: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/38.jpg)
Bucket Ranking
x̃1...
x̃k−i
l β Orank(π̃k)≥ k − i + 1
x̃k
l β Orank(π̃k)≤ k + j − 1
x̃k+j...
x̃n
Man – p.38
![Page 39: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/39.jpg)
Experiments
n # buckets 1. bucket last bucket9,914 4,307 1 7%
3,148,440 34,911 1 95%
n exact rank exact top 100 lowest rank9,914 32% 79 9,215
3,148,440 0.76% 100 151,794
Man – p.39
![Page 40: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/40.jpg)
Buckets for Small Matrix
0 500 1000 1500 2000 2500 3000 3500 40000
20
40
60
80
100
120
140
160
180
200
Bucket Number
Num
ber
in E
ach
Buc
ket
Man – p.40
![Page 41: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/41.jpg)
Power Method Ranking
Simple error bound:
‖x(k) − π‖1 ≤ 2αk︸︷︷︸
β
Simple ranking criterion:
If x(k)i > x
(k)j + 2αk then πi > πj
But: 2αk is too pessimistic (not tight enough)
Man – p.41
![Page 42: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/42.jpg)
Power Method Ranking
Tighter error bound:
‖x(k) − π‖1 ≤α
1 − α‖x(k) − x(k−1)‖1
︸ ︷︷ ︸
β
More effective ranking criterion:
If x(k)i > x
(k)j + β then πi > πj
Man – p.42
![Page 43: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/43.jpg)
Floating Point Ranking
If x(k)i > x
(k)j + β then πi > πj
β =α
1 − α‖x(k) − x(k−1)‖1 + e
e is round off error from single matvec
IEEE double precision floating point arithmetic:
e ≈ cm10−16
m ≈ max # links into any web page
Man – p.43
![Page 44: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/44.jpg)
Expensive Implementation
To prevent accumulation of round off
• Explicit normalization of iteratesx(k+1) = x(k+1)/‖x(k+1)‖1
• Compute norms, inner products, matvecswith compensated summation
• Limited by round off error from single matvec
• Analysis for matrix dimensions n < 1014
in IEEE arithmetic (ǫ ≈ 10−16)
Man – p.44
![Page 45: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/45.jpg)
Difficulties for XLARGE Problems
• Catastrophic cancellation when computing
β =α
1 − α‖x(k) − x(k−1)‖1 + e
• Bound β dominated by round off e
• Compensated summation insufficient to reducehigher order round off O(nǫ2)
• Doubly compensated summation tooexpensive: O(n log n) flops
Man – p.45
![Page 46: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/46.jpg)
Possible Remedies
• Lump dangling nodes [Ipsen & Selee 2006]Web pages w/o outlinks:pdf & image files, protected pages, web frontierUp to 50%-80% of all web pages
• Remove unreferenced web pages• Use faster converging method
Then 1 power method iteration for ranking• Relative ranking criteria?
Man – p.46
![Page 47: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/47.jpg)
Summary
• Google orders web pages according to:PageRank and hypertext analysis
• PageRank = left eigenvector of G
G = αS + (1 − α)11vT
• Power method: simple, robust, accurate• Convergence rate depends on α
but not on matrix dimension• Criterion for ordinal ranking• Round off serious for XLarge problems
Man – p.47
![Page 48: The Mathematics Behind Google's PageRank](https://reader033.fdocuments.in/reader033/viewer/2022042707/58a2ee7a1a28ab722c8b9512/html5/thumbnails/48.jpg)
User-Friendly Resources
• Rebecca Wills:Google’s PageRank: The Math Behind theSearch EngineMathematical Intelligencer, 2006
• Amy Langville & Carl Meyer:Google’s PageRank and BeyondThe Science of Search Engine RankingsPrinceton University Press, 2006
• Amy Langville & Carl Meyer:Broadcast of On-Air Interview, November 2006Carl Meyer’s web page
Man – p.48