I/O-Efficient Techniques for Computing Pagerank

25
I/O-Efficient Techniques for Computing Pagerank Yen-Yu Chen Qingqing Gan Torsten Suel Polytechnic University, Brooklyn NY

Transcript of I/O-Efficient Techniques for Computing Pagerank

Page 1: I/O-Efficient Techniques for Computing Pagerank

I/O-Efficient Techniques for Computing Pagerank

Yen-Yu Chen Qingqing Gan Torsten Suel

Polytechnic University, Brooklyn NY

Page 2: I/O-Efficient Techniques for Computing Pagerank

Web Graph

• URL as a node• Hyperlink as a

directed edge

• The graph structure represents the World Wide Web

Page 3: I/O-Efficient Techniques for Computing Pagerank

Page Rank

• Random Surfer model– A person who surf the

web by randomly clicking links on visited pages.

• PageRank of a page is proportional to the frequency with which a random surfer would visit it.

R2=0.286

R3=0.143

R4=0.143R5=0.143

R1=0.286

Page 4: I/O-Efficient Techniques for Computing Pagerank

Practical PageRank

• Two problems:– Rank leak– Rank sink

• Pruning• Add back edges• Random Jump

R2=0.142

R3=0.101

R4=0.313R5=0.290

R1=0.154

d=0.8

Page 5: I/O-Efficient Techniques for Computing Pagerank

Topic Sensitive PageRank

• Modified Random Jump

• Only jump to certain pages which are related to a specific topic

• ODP-biasing

Topic T

Page 6: I/O-Efficient Techniques for Computing Pagerank

Challenge

• 3.5 Billion pages on the web

• 49 Billion hyperlinks in betweens

• Require 14G bytes to store 4-byte pagerank values – Hard to fit in memory

• Calculate the pagerank value in an I/O efficient way

Page 7: I/O-Efficient Techniques for Computing Pagerank

I/O Efficient Algorithms

• Naïve Algorithm

• Haveliwala’s Algorithm

• Our contribution:

– Sort-Merge Algorithm

– Split-Accumulate Algorithm

Page 8: I/O-Efficient Techniques for Computing Pagerank

Related Work

Page 9: I/O-Efficient Techniques for Computing Pagerank

Naïve Algorithm

• Two vectors of 32-bits floating point numbers.

• Source vector is on disk

• Destination vector is in memory.

LVVLVCnaive +⋅=++= 2'

Page 10: I/O-Efficient Techniques for Computing Pagerank

Haveliwala’s Algorithm

• Partition destination vector into d blocks Vi’ that each fit into main memory.

• Partition link file into d files Li , each only contains links pointing to nodes in Vi’ .

∑∑<≤<≤

⋅++⋅+=++⋅=di

idi

ih LVdLVVdC00

)1()1(' ε

Page 11: I/O-Efficient Techniques for Computing Pagerank
Page 12: I/O-Efficient Techniques for Computing Pagerank

Sort-Merge Algorithm

• Link file is identical to the on in naïve algorithm.

• Creating for each link a packet that contains the line number of the destination and an amount of rank value that has to be transmitted to that destination.

• 8-byte packet : 4-byte id + 4-byte floating number

Page 13: I/O-Efficient Techniques for Computing Pagerank

Sort-Merge Algorithm (continue)

• Route packets by sorting them by destination and combining the ranks into the destination node.

• |P| is the total size of the generated packets that need to be written in and out once.

PLVVC mergesort ⋅+++=− 2'

Page 14: I/O-Efficient Techniques for Computing Pagerank

Split-Accumulate Algorithm

• Splits the source vector into d blocks Vi, such that 4-byte rank values of all node in a block fit into memory.

• Link file contains information on all links with source node in block Vi.

• It likes reverse of Li in Haviliwala’s, but we remove the out-degree information to another files.

iL

Page 15: I/O-Efficient Techniques for Computing Pagerank
Page 16: I/O-Efficient Techniques for Computing Pagerank

Split-Accumulate Algorithm (continue)

• File Oi is a vector of 2-byte integers, storing out-degree for each element in source vector.

• File is defined as containing all packets of rank values with destination in block Vi, in arbitrary order.

iP

Page 17: I/O-Efficient Techniques for Computing Pagerank

Split-Accumulate Algorithm (continue)

• For each iteration i:– Initial block Vi in memory

– Accumulate phase:• Scan with destinations in Vi , add rank values

in each packet to appropriate entry in Vi.

– Scan Oi and divide each rank value in Vi by its out-degree.

iP

Page 18: I/O-Efficient Techniques for Computing Pagerank

Split-Accumulate Algorithm (continue)

– Split phase:• Read and for each record in consisting of

several sources in Vi and a destination in Vj, we write one packet with this destination node and the total amount of rank to be transmitted to it from these sources into output file ( which will become file in the next iteration).

Combining packets is simpler and more efficient. No in-memory sorting of packets is needed.

iL iL

'jPjP

Page 19: I/O-Efficient Techniques for Computing Pagerank

Split-Accumulate Algorithm (continue)

• In a nutshell, it split packets into different buckets by destination, and then directly accumulating rank values using a table.

PL

PLV

iPiLOiCdi

split

⋅++=

⋅+++⋅=

⋅++= ∑<≤

2)1(

2)'1(5.0

)2(0

ε

ε

Page 20: I/O-Efficient Techniques for Computing Pagerank

Experimental Setup

• Sun Blade 100 (500 MHz Ultra Sparc IIe) running Solaris 8 with 100GB, 7200 RPM hard disk.

• Various physical memory configurations: 128M, 256M, 512M, 1G, 2G

• Simulated 32M and 64M setting under 128M memory.

Page 21: I/O-Efficient Techniques for Computing Pagerank

Results for Real Data

• 120 M web pages crawled

• 327 M URLs and 1.33 Billion links parsed out.

• After pruning:– 44.8 M nodes– 653M edges– 15.3 edges/node

Page 22: I/O-Efficient Techniques for Computing Pagerank

Result for Real Data (continue)

• No pruning.• Add back edges

for nodes which has 0 out-degree.

• 327 M nodes• 1.96 Billion

edges

Page 23: I/O-Efficient Techniques for Computing Pagerank

Results for Scaled Data

Page 24: I/O-Efficient Techniques for Computing Pagerank

Results for Topic-Sensitive PR

0500

1000150020002500300035004000

10 T

opi c

s(51

2M)

20 T

opi c

s(51

2M)

10 T

opi c

s(25

6M)

20 T

opi c

s(25

6M)

Nai ve

Havel i wal a' s

Spl i t -Accumul at e

Page 25: I/O-Efficient Techniques for Computing Pagerank

• Basic:

• Random Jump:

• Topic-Sensitive:

{

Page Rank

∑→

=pq qd

prpr

)(

)()(

∑→

⋅+−=pq

ii

qd

qr

n

Rpr

)(

)()1()(

)1()0()( αα

=)()( pr i∑→

⋅+−pq

i

qd

qr

n

R

)(

)()1(

)1()0(

αα

∑→

⋅pq

i

qd

qr

)(

)()1(

α

p is special

otherwise