1 chayes

62
The Power of Locality for Network Algorithms JENNIFER CHAYES, MICROSOFT RESEARCH CHRISTIAN BORGS, MICHAEL BRAUTBAR, SANJEEV KHANNA, BRENDAN L UCIER, AND SHANGHUA TENG

description

Workshop on random graphs 24—26 oct Доклад Дженнифер Чайес

Transcript of 1 chayes

Page 1: 1 chayes

The Power of Locality for Network Algorithms

JENNIFER CHAYES, MICROSOFT RESEARCH

CHRISTIAN BORGS, MICHAEL BRAUTBAR, SANJEEV KHANNA, BRENDAN LUCIER, AND SHANGHUA TENG

Page 2: 1 chayes

Online Networks

Online networks are often massive

WWW has trillions of (static) sites

Facebook has over a billion users

3D representation of WWW by Opte

Small piece of FB mapped by Nexus

Page 3: 1 chayes

Algorithmic Network Questions

• Ranking of the sites (e.g., PageRank)

• Finding the most influential site (or k most influential sites) under various definitions of influence

– Most highly connected

– Most influential under a certain model, e.g., KKT independent cascade model

• Covering the graph via local moves (the recruiter problem)

Page 4: 1 chayes

Constraints • Limitations on network visibility, e.g. Facebook,

LinkedIn, etc. only let you see one or two hops away on the graph

Facebook LinkedIn

• Limitations on compute time, especially relevant for online computation on massive graphs

Need local (approximation) algorithms!

Want local (approximation) algorithms to be efficient, at expense of approximation factor if necessary.

Page 5: 1 chayes

Outline of the Talk

I. Network algorithms with local access constraints – Context: local information algorithms

– Algorithms on preferential attachment networks

– Algorithms on general networks

II. Using locality to get sublinear algorithms without a priori access constraints – PageRank problem

– Finding the most influential nodes

(viral marketing in ind. cascade model)

Borgs, Brautbar, C, Lucier, Khanna : WINE ‘12

Borgs, Brautbar, C, Teng : WAW ’12 & Internet Mathematics

Borgs, Brautbar, C, Lucier: SODA ‘14

Page 6: 1 chayes

A Networking Problem with Local Access Constraints

Goal: Meet the most influential people.

Page 7: 1 chayes

A Networking Problem

Goal: Meet the most influential people.

Page 8: 1 chayes

A Networking Problem

Goal: Meet the most influential people.

Find the highest-degree vertex.

Page 9: 1 chayes

A Networking Problem with Local Access Constraints

Page 10: 1 chayes

A Networking Problem with Local Access Constraints

Page 11: 1 chayes

A Networking Problem with Local Access Constraints

Page 12: 1 chayes

A Networking Problem with Local Access Constraints

Page 13: 1 chayes

A Networking Problem with Local Access Constraints

Page 14: 1 chayes

Motivating Question

How well can a graph algorithm perform when it has only local visibility of the network structure?

… on “natural” networks?

… as a function of the “level” of visibility?

Page 15: 1 chayes

Online Social Networks

Social network applications differ in what is visible:

Facebook LinkedIn Orkut, Google+

Question: what is the impact of this design choice?

Page 16: 1 chayes

Local Algorithms

More generally:

Search Problems: find the highest-degree node, the most central, …

Coverage Problems: minimum dominating set, maximum k-coverage, …

Connectivity Problems: shortest path, multicast, …

“Local”: Graph topology is revealed locally as the algorithm builds its output set.

Page 17: 1 chayes

Outline of Part I: Algorithms with Local Access Constraints

1. A model of local information algorithms

2. Algorithms for preferential attachment networks

3. Minimum dominating set problem on general networks

Page 18: 1 chayes

Local Information Algorithms Input: Graph G = (V,E), initially unknown. Output: subset S of the vertices. (eg: find feasible S, minimizing |S|)

Two operations: 1. Add a random node to S 2. Add any visible node to S

Visible region: all nodes distance ≤ r from S, plus the induced subgraph

…plus degrees of outermost nodes

r-Local algorithm

Note: To map this into questions on Facebook and LinkedIn, think of r as the distance out from your current set of connections, i.e., your set of friends.

Page 19: 1 chayes

1-Local Algorithm

Page 20: 1 chayes

1-Local Algorithm

Page 21: 1 chayes

1-Local Algorithm

Page 22: 1 chayes

1-Local Algorithm

Page 23: 1 chayes

2-Local Algorithm

Page 24: 1 chayes

2-Local Algorithm

This talk: focus mainly on 1-local algorithms.

Page 25: 1 chayes

Preferential Attachment Networks

Page 26: 1 chayes

Preferential Attachment

Random network growth model [BA’99,BR’00,…]

1. Begin with small fixed graph (e.g. clique).

2. Each new node v connects to m ≥ 2 previous nodes at random, proportional to their degrees:

1

2

3

4

5

6

7

Pr[i connects to j] ~ deg(j)

Page 27: 1 chayes

Preferential Attachment

Properties:

Connected (with high probability)

Small diameter: O(logn / loglogn)

Power law degree sequence: P(k) ~ k-3

Older nodes tend to have higher degree:

E[deg(i)] = (n/i)½

Page 28: 1 chayes

Note: possible to remove assumption that alg can detect

node 1.

Finding the Root

Problem: Return a set S containing node 1.

Opportunistic algorithm:

Initialize S to an arbitrary node

While S does not contain node 1:

Add node v ϵ N(S) with largest degree

1-local

Page 29: 1 chayes

Finding the Root Theorem: The opportunistic algorithm finds node 1

in O(log4n) queries, with high probability over the random graph process.

Facebook LinkedIn

O(log4n) Random Walk: Ω(n½)

Note: random walk requires O*(n½) queries.

Page 30: 1 chayes

Applications

s-t connectivity: O(log4n)

- connect s and t to node 1

- can connect k terminals in O(k log4n)

Find k nodes of largest degree: O(log4n + k)

- find node 1, but don’t stop the algorithm

Page 31: 1 chayes

Proof Sketch Theorem: The greedy algorithm finds node 1 in time O(log4 n). The Hope: the algorithm reaches a node of degree 2k after k * polylog n iterations.

The Problem: reaching a node of high degree does not necessarily imply progress.

1

Bottleneck

Page 32: 1 chayes

Proof Sketch Observation: if there is a path connecting S to node 1, with all nodes of degree ≥ d, then the algorithm never queries a node of degree < d.

Q: How common are these “good” paths?

A: For m ≥ 2, most nodes lie on good paths with constant probability. (Proof: detailed prob. analysis)

S

1 1

Page 33: 1 chayes

General Graphs

Page 34: 1 chayes

Minimum Dominating Set

Problem: find smallest set S s.t. N(S) ∪ S = V.

Page 35: 1 chayes

Minimum Dominating Set

Problem: find smallest set S s.t. N(S) ∪ S = V.

Lower bound: Ω(log Δ) (set cover)

Upper bound: O(log Δ), 3-local [GK’98]

How well can a 1-local algorithm perform?

max degree

Page 36: 1 chayes

A local algorithm

Greedy Algorithm:

Initialize S to a random node.

While |D(S)| < n:

Add node v ϵ N(S) that maximizes |D(S∪{v})|

Page 37: 1 chayes

A local algorithm

Greedy Algorithm:

Initialize S to a random node.

While |D(S)| < n:

Add node v ϵ N(S) that maximizes |D(S∪{v})|

Optimal: O(1)

Greedy alg: Ω(n) 1 2

3

4

5

6

7

8 9

Page 38: 1 chayes

A local algorithm

Greedy-Random Algorithm: Initialize S to a random node.

While |D(S)| < n:

Add node v ϵ N(S) that maximizes |D(S∪{v})|

Add a random node from N(v) \ D(S)

Theorem: The greedy-random algorithm obtains a (1 + 2logΔ) approximation (in expectation and whp).

Page 39: 1 chayes

x

Analysis

v

Pick v in OPT. We wish to show that our alg. does not waste too many steps covering N(v). Consider some x chosen greedily by the algorithm, and which covers some vertex in N(v).

Page 40: 1 chayes

x

Analysis

v

Pick v in OPT. We wish to show that our alg. does not waste too many steps covering N(v). Consider some x chosen greedily by the algorithm, and which covers some vertex in N(v).

Case 1: v was visible.

x must cover many nodes due to greediness.

Page 41: 1 chayes

x

Analysis

v

Pick v in OPT. We wish to show that our alg. does not waste too many steps covering N(v). Consider some x chosen greedily by the algorithm, and which covers some vertex in N(v).

Case 2: v was not visible.

If x covers few nodes, good chance to reveal v on the “random” step.

Page 42: 1 chayes

Conclusions from Part I:

• Local information algorithms: sequential decisions with limited network visibility.

• Many problems can be solved locally and efficiently in preferential attachment and general networks.

• The level of visibility can have a strong impact on the approximability of a problem.

Page 43: 1 chayes

Part II: Using Locality to Get Sublinear Algorithms

• Sublinear algorithms to find high pagerank nodes

• Sublinear algorithms for influence maximization

Page 44: 1 chayes

Sublinear Algorithms for PageRank: Definitions

• View the WWW as a directed graph with G(V,E) with V being the webpages and E being the (directed) hyperlinks

• PageRank: Do a random walk on the webgraph, restarting at a random site every say 1/a steps. The relative weight of a page in the stationary distribution is the PageRank of that page.

Page 45: 1 chayes

PageRank: Definitions • Random walk matrix M

Muv = (1/dout(u) )Auv • dout(u) out-degree of vertex u • Auv adj. matrix of dir. graph

• Stationary distribution p

p = p M

• PageRank matrix Puv

P = a1 + (1 – a) P M • Personalized PageRank vector p(u)

p(u) = eu P = Pu.

• Contribution vector c(v)

c(v) = ev PT = P.v

PageRank: PRv = Su Puv

(Always restart at u)

(All contributions to v)

• RW with restart at u: • in each step do one RW step

with prob. 1-a, and jump to u with prob. a:

p(u) = adu,. + (1 – a) p(u) M

Page 46: 1 chayes

Computing PageRank

• Take G = (V,E) with |V| = n and |E| = m.

• Significant PageRank Problem (SPP): Find all nodes with PR > D, and no nodes with PR < D/2.

• Previous results on running time: – Power iteration method (Bianchi et. al. ‘03): (m)

– Linear algebra improvement (Langville & Meyer ‘04): (n)

• Lower bound on running time: (n/D) (Roughly from n/D sites with PR = D, and all other sites with PR = 0.)

• Approximate SPP: Can we use locality to get an (additive) -approximation which essentially matches the lower bound, i.e., time *(n/D)?

Page 47: 1 chayes

Roadmap for Approximate SPP

• Steps of Calculation

i. Calculate each Puv

ii. Each PRv is the sum of the n terms in the

contribution column vector: PRv = Su Puv

iii. Do this for all n points, i.e. all contribution vectors

• A priori each step should take (n) time

Page 48: 1 chayes

i. Local calculation of -approximate Puv

• Previous results

– Deterministic • Jeh-Widom ‘03: ((log n) -1 max-in-degree)

• Andersen et. al. ‘06: (-1 max-in-degree)

– Random • Fogaras et. al. ’05: Monte Carlo based approach which

removed dependence on max-in-degree but gave mult. rather than additive error, and where approx. depended on Puv

• Our approach – Modification of Forgaras et. al.

– Handled concentration better to remove dependence on Puv

Bad if in-degree unbounded

Page 49: 1 chayes

i. Local calc. of (,)-approximate Puv

• Local method: uses – Terminating Random Walk: A RW which terminates

with prob. a, and with prob. 1 - a, moves uniformly to a random outlink of the current node

• Algorithm: – For (-1-2 log n) do

– Run a new terminating RW starting at u to a max (capping) length of log1/(1-a)(1/)

– If the walk is terminated before reaching the capping length, add one to the counter of the node the walk last visited before terminating

– Output avg count accumulated at each node

• Running time: (-1-2log n log -1 ) ~ *(-1 )

Note : The probability that a terminating RW starting at u, happens to end it v, is Puv

Page 50: 1 chayes

ii. From Puv to approx. of PRv = Su Puv

• Obviously can’t just sum (takes time n) • Alternative: Naïve sampling

– Pick L random ui R {1, … , n}.

– Check if sum of these L terms, Sui Puiv , is large or

small wrt D L/n.

• Problem with naïve sampling – To make sure error does not drown out expectation,

need = O(n/D) – To get concentration (Chernoff), need L = O*(n/D) – Runtime = L O*(1/) = O*(n2/D2) rather than O*(n/D)

Page 51: 1 chayes

ii. Multiscale Sampling

• Choose many scales t = 2-t

• Estimate how many entries P.v of the contribution vector c(v) lie in the interval (t,2t)

• Land up spending most of our time on the estimates of the larger P.v

lots of work

• Estimate whether PRv > D in running time *(n/D).

Page 52: 1 chayes

iii. From question of PRv > D for one v to all v

• Key: Use sparse matrix methods to do all n columns in parallel.

• Maintain running time *(n/D).

Page 53: 1 chayes

Conclusion for PageRank

Locality + multiscale analysis + sparse matrix methods

Running time = *(n/D) to find approx. of all nodes with significant pagerank PRv

Running time sublinear in n for D = 0(np), 0 < p < 1.

Page 54: 1 chayes

Final Topic:

Sublinear Algorithms for

Models of Influence Maximization

Page 55: 1 chayes

Influence Maximization: Definitions

Independent Cascade Model (Kempe, Kleinberg, Tardos ‘03)

Introduced as model of viral marketing

• G = (V,E) oriented graph with |V| = n, |E| = m, and edge weights {pe|eE}

• pe = probability infection spreads out along e

• I(S) = (random) size of the set that is eventually infected starting from seed set S V

Problem: For fixed k = |S|, find the seed set S which maximizes the expected influence E[I(S)]

Page 56: 1 chayes

KKT Model: Previous Results

• KKT: E[I(S)] is submodular maximizing E[I(S)] can be approximated to within (1 – 1/e) via greedy algorithm

• With oracle access to E[I(S)], greedy alg has runtime = O(kn)

• Oracle access can be simulated Total runtime = (mnk poly(-1)), i.e., even on a sparse graph, at least quadratic in n.

Page 57: 1 chayes

Influence Maximization: Our Results

• Nearly linear time algorithm: We can find an approximately optimal seed with an approximation factor of (1 – 1/e – ) in time O*((m + n) -3). – Note: There is a lower bound of (m + n), so this is

essentially optimal.

• Sublinear time algorithm: We can find an approx-imately optimal seed with an approximation factor of (1/) in time O*(n a(G)/ ) where a(G) is the arboricity* of the graph G. – Taking = 0(np), 0 < p < 1, we get a time sublinear in n.

*Arboricity of G is the minimum number of spanning forests necessary to cover all edges of G. Roughly speaking, arboricity corr. with density of graph.

Page 58: 1 chayes

Key Elements of the Proof • Key Idea: Preprocess G with random sampling

sparse hypergraph representation which retains influence characteristics of high-influence nodes – Each hypergraph edge represents a set of nodes

influenced by a random node in the transpose graph – Degree of set S in hypergraph is approximately

proportional to influence of S in original graph – Allows us to efficiently estimate marginal influence in

the original diffusion process with very few samples

• Local and applicable in many access models: Only operations are accessing a random vertex and traversing edges incident to previously accessed vertex

Page 59: 1 chayes

Key Elements of the Proof • Sublinear variant: Construct two possible seed

sets: one using a greedy algorithm according to the constructed hypergraph, and the other is a singleton selected at random according to the hypergraph degree distribution

Page 60: 1 chayes

Conclusions • Local network algorithms may either be required

due to local information access constraints, or just desirable due to increased runtime efficiency

• Recurring elements in the sublinear network algorithms: – Sampling (sometimes at multiple scales) rather than

probing all elements

– Interspersing greedy steps with random steps to see more of the space

– Maintaining locality by using backwards random walks, transposes of matrices, etc. to find large contributors

Page 61: 1 chayes

Conclusions

– Finding the most highly connected node or nodes

– Finding connections between nodes

– Covering the network (“recruiter problem”)

– Ranking of sites on the network (significant PageRank problem)

– Finding sets of maximum influence in the independent cascade model

• With local network methods, it is possible to get sublinear time algorithms with reasonable approximation ratios for questions of interest in massive networks:

Page 62: 1 chayes

Thanks for your attention!