BBM: Bayesian Browsing Model from Petabyte -scale Data
-
Upload
rooney-white -
Category
Documents
-
view
25 -
download
1
description
Transcript of BBM: Bayesian Browsing Model from Petabyte -scale Data
BBM: Bayesian Browsing Model from Petabyte-scale
DataChao Liu, MSR-Redmond
Fan Guo, Carnegie Mellon UniversityChristos Faloutsos, Carnegie Mellon University
Massive Log Streams
• Search log– 10+ terabyte each day (keeps increasing!)– Involves billions of distinct (query, url)’s
• Questions– Can we infer user-perceived relevance for
each (query, url) pair?– How many passes of the data are needed? Is
one enough?– Can the inference be parallel?
• Our answer: Yes, Yes, and Yes!
BBM: Bayesian Browsing Model
query URL1 URL2 URL3 URL4
C1 C2 C3 C4
S1 S2 S3 S4 Relevance
E1 E2 E3 E4Examine Snippet
ClickThroughs
Notations
• For a given query– Top-M positions, usually M=10• Positional relevance• M(M+1)/2 combinations of (r, d)’s
– n search instances
– N documents impressed in total: • Document relevance•
1 2( , ,..., )Nd d d
1 2( , ,..., )NR R R R
1 2( , ,..., ) 1, 2,...,k k k kMC C C C k n
1 2( , ,..., )MS S S S
Putting things together
• Posterior with
• Re-organize by Rj’s
How many times dj
was clicked
How many times dj was not clicked when it is at position (r + d) and the preceding click is on position r
1:nC
What Tells US
• Exact inference with joint posterior in closed form
• Joint posterior factorizes and hence mutually independent
• At most M(M+1)/2 + 1 numbers to fully characterize each posterior– Count vector:
1:( | )np R C
0 1 2 ( 1) 2( , , ,..., )M Me e e e e
Example on MapReduce
(U1, 0)(U2, 4)(U3, 0)
Map
(U1, 1)(U3, 0)(U4, 7)
Map
(U1, 1)(U3, 0)(U4, 0)
Map
21 1 1( ) (1 )p R R R 2 2( ) 1 0.98p R R 3
3 3( )p R R 4 4 4( ) (1 )p R R R (U1, 0, 1, 1) (U2,
4)(U4, 0, 7)
(U3, 0, 0, 0)
Reduce
Experiments• Compare with the User Browsing Model (Dupret and
Piwowarski, SIGIR’08)– The same dependence structure– But point-estimation of document relevance rather than
Bayesian– Approximate inference through iterations
• Data:– Collected from Aug and Sept 2008– 10 algorithmic results only– Split to training/test sets according
to time stamps for each query– 51 million search instances of
1.15 million distinct queries, 10X larger than the SIGIR’08 study
Overall Comparison on Log-Likelihood
• Experiments in 20 batches• LL Improvement Ratio = 2 1( 1)*100%ll lle
Comparison w.r.t. Frequency
• Intuition– Hard to predict clicks for infrequent
queries– Easy for frequent ones
Petabyte-Scale Experiment
• Setup:– 8 weeks data, 8
jobs– Job k takes first
k-week data
• Experiment platform– SCOPE: Easy and Efficient Parallel Processing of
Massive Data Sets [Chaiken et al, VLDB’08]
Scalability of BBM
• Increasing computation load – more queries, more urls, more impressions
• Near-constant elapse time
Computation Overload Elapse Time on SCOPE
• 3 hours• Scan 265 terabyte data• Full posteriors for 1.15 billion (query, url) pairs
Conclusions
• Bayesian Browsing Model for Search streams– Exact Bayesian inference– Joint posterior in closed form– A single pass suffices– Map-Reducible for Parallelism– Admissible to incremental updates– Perfect for mining click streams
• Models for other stream data– Browsing, twittering, Web 2.0, etc?