Fast Katz and Commuters

Post on 13-Dec-2014

700 views 0 download

description

 

Transcript of Fast Katz and Commuters

FAST KATZ AND COMMUTERS

Quadrature Rules and Sparse Linear Solvers

for Link Prediction Heuristics

David F. Gleich

Sandia National Labs

Berkeley LAPACK Seminar

February 3, 2011

With Pooya Esfandiar, Francesco Bonchi, Chen Grief,

Laks V. S. Lakshmanan, and Byung-Won On

David F. Gleich (Sandia) Berkeley SCMC Seminar

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin

Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

1 / 43

MAIN RESULTS

A – adjacency matrix

L – Laplacian matrix

Katz score :

                                               

Commute time:

                             

For Commute

Compute one       fast

David F. Gleich (Sandia) Berkeley SCMC Seminar

For Katz Compute one       fast Compute top       fast

2 of 43

OUTLINE

Why study these measures?

Katz Rank and Commute Time

How else do people compute them?

Quadrature rules for pairwise scores

Sparse linear systems solves for top-k

As many results as we have time for…

David F. Gleich (Sandia) Berkeley SCMC Seminar 3 of 43

WHY? LINK PREDICTION

David F. Gleich (Sandia) Berkeley SCMC Seminar

Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient

Neighborhood based

Path based

4 of 43

NOTE

All graphs are undirected

All graphs are connected

David F. Gleich (Sandia) Berkeley SCMC Seminar 5 of 43

LEO KATZ

David F. Gleich (Sandia) Berkeley SCMC Seminar 6 of 43

NOT QUITE, WIKIPEDIA

    : adjacency,                     : random walk

PageRank                          

Katz                          

These are equivalent if     has constant degree

David F. Gleich (Sandia) Berkeley SCMC Seminar 7 of 43

WHAT KATZ ACTUALLY SAID

Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43

“we assume that each link independently has the

same probability of being effective” …

“we conceive a constant     , depending

on the group and the context of the particular

investigation, which has the force of a probability

of effectiveness of a single link. A k-step chain

then, has probability       of being effective.”

“We wish to find the column sums of the matrix”

David F. Gleich (Sandia) Berkeley SCMC Seminar 8 of 43

A MODERN TAKE

The Katz score (node-based) is

                   

The Katz score (edge-based) is

                   

David F. Gleich (Sandia) Berkeley SCMC Seminar 9 of 43

RETURNING TO THE MATRIX

                     

                                                     

Carl Neumann                

                                    David F. Gleich (Sandia) Berkeley SCMC Seminar 10 of 43

Carl Neumann

I’ve heard the Neumann series called the “von Neumann”

series more than I’d like! In fact, the von Neumann kernel

of a graph should be named the “Neumann” kernel!

David F. Gleich (Sandia) Berkeley SCMC Seminar

Wikipedia page

11 / 43

PROPERTIES OF KATZ’S MATRIX

    is symmetric

    exists when                      

                is spd when                      

Note that         1/max-degree suffices

David F. Gleich (Sandia) Berkeley SCMC Seminar 12 of 43

COMMUTE TIME

David F. Gleich (Sandia) Berkeley SCMC Seminar

Picture taken from Google images, seems to be

Bay Bridge Traffic by Jim M. Goldstein.

13 / 43

COMMUTE TIME

Consider a uniform random walk on a graph                    

                                                       

                                               

David F. Gleich (Sandia) Berkeley SCMC Seminar

Also called the hitting

time from node i to j, or

the first transition time

14 of 43

SKIPPING DETAILS

                    : graph Laplacian

                                                             

              is the only null-vector

David F. Gleich (Sandia) Berkeley SCMC Seminar 15 of 43

WHAT DO OTHER PEOPLE DO?

1) Just work with the linear algebra formulations

2) For Katz, Truncate the Neumann series as a few (3-5) terms

3) Use low-rank approximations from EVD(A) or EVD(L)

4) For commute, use Johnson-Lindenstrauss inspired random sampling

5) Approximately decompose into smaller problems

David F. Gleich (Sandia) Berkeley SCMC Seminar

Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009,

Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007, Wang et al. ICDM2007 16 of 43

THE PROBLEM

All of these techniques are

preprocessing based because

most people’s goal is to compute

all the scores.

We want to avoid

preprocessing the graph.

David F. Gleich (Sandia) Berkeley SCMC Seminar

There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse

17 of 43

WHY NO PREPROCESSING?

The graph is constantly changing

as I rate new movies.

David F. Gleich (Sandia) Berkeley SCMC Seminar 18 of 43

WHY NO PREPROCESSING?

David F. Gleich (Sandia) Berkeley SCMC Seminar

Top-k predicted “links”

are movies to watch!

Pairwise scores give

user similarity

19 of 43

PAIRWISE ALGORITHMS

Katz                      

Commute                                        

David F. Gleich (Sandia) Berkeley SCMC Seminar

Golub and Meurant

to the rescue!

20 of 43

MMQ - THE BIG IDEA

Quadratic form                    

Weighted sum                      

Stieltjes integral                      

Quadrature approximation                      

Matrix equation               David F. Gleich (Sandia) Berkeley SCMC Seminar

Think                    

A is s.p.d. use EVD

“A tautology”

Lanczos

21 of 43

LANCZOS

        , k-steps of the Lanczos method produce

                                    and                      

David F. Gleich (Sandia) Berkeley SCMC Seminar

                    =

             

22 of 43

PRACTICAL LANCZOS

Only need to store the last 2 vectors in      

Updating requires O(matvec) work

      is not orthogonal

David F. Gleich (Sandia) Berkeley SCMC Seminar 23 of 43

MMQ PROCEDURE

Goal                        

Given                        

1. Run k-steps of Lanczos on     starting with    

2. Compute       ,     with an additional eigenvalue at     ,

set                     3. Compute     ,     with an additional eigenvalue at   , set

                  4. Output               as lower and upper bounds on    

David F. Gleich (Sandia) Berkeley SCMC Seminar

Correspond to a Gauss-Radau rule, with

u as a prescribed node

Correspond to a Gauss-Radau rule, with

l as a prescribed node

24 of 43

PRACTICAL MMQ

Increase k to become more accurate

Bad eigenvalue bounds yield worse results

      and     are easy to compute

      not required, we can iteratively

update it’s LU factorization

David F. Gleich (Sandia) Berkeley SCMC Seminar 25 of 43

PRACTICAL MMQ

David F. Gleich (Sandia) Berkeley SCMC Seminar 26 of 43

ONE LAST STEP FOR KATZ

Katz                      

           

                                                                   

                                         

David F. Gleich (Sandia) Berkeley SCMC Seminar 27 of 43

KATZ SCORES ARE LOCALIZED

David F. Gleich (Sandia) Berkeley SCMC Seminar

Up to 50 neighbors is 99.65% of the total mass

28 of 43

TOP-K ALGORITHM FOR KATZ

Approximate    

                           

where     is sparse

Keep     sparse too

Ideally, don’t “touch” all of    

David F. Gleich (Sandia) Berkeley SCMC Seminar 29 of 43

INSPIRATION - PAGERANK

Approximate    

                           

where     is sparse

Keep     sparse too? YES!

Ideally, don’t “touch” all of     ? YES!

David F. Gleich (Sandia) Berkeley SCMC Seminar

McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too.

30 of 43

THE ALGORITHM - MCSHERRY

For              

Start with the Richardson iteration

                                                     

Rewrite

                                     

Richardson converges if                        

David F. Gleich (Sandia) Berkeley SCMC Seminar 31 of 43

THE ALGORITHM

Note     is sparse.

If                 , then         is sparse.

Idea

only add one component of         to        

David F. Gleich (Sandia) Berkeley SCMC Seminar 32 of 43

THE ALGORITHM

For              

Init:                                

                               

                               

How to pick   ?

David F. Gleich (Sandia) Berkeley SCMC Seminar 33 of 43

THE ALGORITHM FOR KATZ

For                            

Init:                                  

                             

                                           

Pick   as max       David F. Gleich (Sandia) Berkeley SCMC Seminar

Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more

34 of 43

CONVERGENCE?

The “standard” proof works for         1/max-degree.

For                       , then for symmetric     , this algorithm is the Gauss-Southwell procedure and it still converges.

David F. Gleich (Sandia) Berkeley SCMC Seminar

35 of 43

RESULTS – DATA, PARAMETERS

All unweighted, connected graphs

Easy                                    

Hard                            

David F. Gleich (Sandia) Berkeley SCMC Seminar 36 of 43

KATZ BOUND CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar 37 of 43

COMMUTE BOUND CONVERG.

David F. Gleich (Sandia) Berkeley SCMC Seminar 38 of 43

KATZ SET CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar

For arXiv graph.

39 of 43

TIMING

David F. Gleich (Sandia) Berkeley SCMC Seminar 40 of 43

CONCLUSIONS

These algorithms are faster than many alternatives.

For pairwise commute, stopping criteria are simpler with bounds

For top-k problems, we often need less than 1 matvec for good enough results

David F. Gleich (Sandia) Berkeley SCMC Seminar 41 of 43

WARTS AND TODOS

Stopping criteria on our top-k algorithm can be a bit hairy… we should refine it.

The top-k approach doesn’t work right for commute time… we have an alternative.

                               

      requires the entire diagonal of      

Take advantage of new research to “seed” commute time better! David F. Gleich (Sandia) Berkeley SCMC Seminar

von Luxburg et al. NIPS2010

42 of 43

By AngryDogDesign on DeviantArt

Paper at WAW2010

Slides should be online soon

Code is online already

stanford.edu/~dgleich/

publications/2010/codes/fast-katz

David F. Gleich (Sandia) Berkeley SCMC Seminar 43

F-MEASURE

David F. Gleich (Sandia) Berkeley SCMC Seminar 44 of 43

MAIN RESULTS – SLIDE ONE

A – adjacency matrix

L – Laplacian matrix

Katz score :

                                               

Commute time:

                               

David F. Gleich (Sandia) Berkeley SCMC Seminar 45 of 43

MAIN RESULTS – SLIDE TWO

For Katz Compute one       fast

Compute top       fast

For Commute

Compute one       fast

For almost commute

Compute top $F_{i,:}$ fast

David F. Gleich (Sandia) Berkeley SCMC Seminar 46 of 43

MAIN RESULTS – SLIDE THREE

David F. Gleich (Sandia) Berkeley SCMC Seminar 47 of 43

KATZ BOUND CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar 48 of 43

KATZ ERROR CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar 49 of 43

COMMUTE BOUND CONVERG.

David F. Gleich (Sandia) Berkeley SCMC Seminar 50 of 43

COMMUTE ERROR CONVERG.

David F. Gleich (Sandia) Berkeley SCMC Seminar 51 of 43

KATZ SET CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar 52 of 43

KATZ ORDER CONVERGENCE

David F. Gleich (Sandia) Berkeley SCMC Seminar 53 of 43