Lecture 4
description
Transcript of Lecture 4
Lecture 4
CS492 Special Topics in Computer ScienceDistributed Algorithms and Systems
“The PageRank Citation Ranking:Bringing Order to the Web”
L. Page, S. Brin, R. Motwani, T Winograd1998
Fall 2008 CS492
3Fall 2008 CS492
Origin of “Google” Googol
10^100 Motivation behind
Human maintained indices such as Yahoo! Explosive growth
http://news.netcraft.com
HostnamesActive
4Fall 2008 CS492
Design Goals of Google Improved search quality
In 1997, 1 out of 4 top search engines found itself High precision in finding relevant document was necessary
Academic search engine research Search engine technology went commercial: an black art To build systems that a good number of people could use To build an architecture to support novel research on
large-scale Web data
5Fall 2008 CS492
Weakness of Existing Approaches Calculate similarities
Based on flat, vector-space model of each page Prone to cheating (Web spamming or search engine per-
suasion)
6Fall 2008 CS492
Basic Idea of PageRank Exploit the topological structure of hypertextual sys-
tems
Fall 2008 CS492 7
Simple Example
A
C
B0.2
0.4
0.4
8Fall 2008 CS492
Related Work Academic citation analysis
Similarities Graph structure; paper = node, web page = node citation = link, URL = link “node” authority independent of “node” content
Differences Uniform unit of info (paper) versus great variability in quality, usage, citations, and length Equal link weight vs variable importance A backlink from Yahoo! vs. from a friend
Fall 2008 CS492 9
Which Page Should Be Ranked Higher?
A B
John Doe
10Fall 2008 CS492
Simple Expression
page rank of set of pages pointing at
out-degree of
Question: role of c?Answer: total rank of all web pages constant
11Fall 2008 CS492
Dangling links Pages without outgoing pointers
Example: Pages not yet downloaded Do not affect the calculation much
Remove them, calculate ranks, and add them back
12Fall 2008 CS492
Loop
A
C
B
Question: ranks of A, B, and C?Answer: infinite! (rank sink)
13Fall 2008 CS492
Basic Algorithm
page rank of set of pages pointing at
out-degree of
dumping factor
14Fall 2008 CS492
Matrix Representation
Question: Where to start?
where and
15Fall 2008 CS492
Iterative Algorithm
where and
Question: Will it converge?
16Fall 2008 CS492
Example
[LM04]
17Fall 2008 CS492
Turn the Problem into a Markov Process
[LM04]
18Fall 2008 CS492
Evenly Split Rank of Dangling Links
[LM04]
19Fall 2008 CS492
Final Solution Eigenvector of P = steady state rank
20Fall 2008 CS492
Spam Rank
[BGS05]
21Fall 2008 CS492
Questions Where to start?
Find a nondegenerate start vector What if there are two pages that point to each other
and no one else and there is a page that points to one of them? Role of dumping factor guarantees no rank sink
22Fall 2008 CS492
References[BP98] Sergey Brin, Lawrence Page, “The anatomy of a large-scale hypertextual Web search en-
gine,” Computer Networks and ISDN Systems, Vol. 30, 1998.[BGS05] Monica Bianchini, Marco Gori, Franco Scarselli, “Inside PageRank,” ACM Transactions on
Internet Technology, Vol. 5, No. 1, Feb. 2005.[LM04] Amy N. Langville, Carl Meyer, “Deeper inside PageRank,” Internet Mathematics, Vol. I, No.
3, 2004.[K99] Jon Kleinberg, “Authoritative sources in a Hyperlinked Environment,” Journal of the ACM
46:5 (1999).