Distributed algorithms for finding local clusters using...
Transcript of Distributed algorithms for finding local clusters using...
![Page 1: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/1.jpg)
Distributed algorithms for finding local clusters using heat
kernel pagerank
WAW 2015
EINDHOVEN, NETHERLANDS
Olivia Simpson, UC San Diego
Fan Chung, UC San Diego
![Page 2: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/2.jpg)
Distributed graph processing
Allows for analysis of data too large to store on one machine
Need for adapting classical graph algorithms to distributed setting
![Page 3: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/3.jpg)
Local cluster detection
• Find a cluster of well-connected, well-separated vertices near a particular vertex
• Use only local information; avoid querying the whole graph
![Page 4: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/4.jpg)
Local cluster detection in the wild
• Identify competitors in an ad campaign
• Annotating protein structure
• Assign nodes to a particular clusterhead in a wireless sensor network
• Identify bottlenecks in a computer network
• Community detection
• Subroutines for bigger clustering tasks
• Global clustering (YouTube topic discovery)
• Overlapping communities (grow local communities)
![Page 5: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/5.jpg)
Distributed computation
• Data is distributed across nodes (machines) of a network
• Nodes communicate over specified communication links in rounds
• Nodes are allowed to communicate small sized messages through the links
• Initially nodes know their identities and the identities of their neighbors
• Complete data is never known by any individual machine; no shared memory
![Page 6: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/6.jpg)
Distributed algorithms
• Running time in terms of rounds of communication required for computation over arbitrary input
• Local communication is free
• Local computation is free
• Goal: optimize number of rounds
![Page 7: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/7.jpg)
Note on notation
• “Data”
• Graph: vertices, edges, |V| = n, |E| = m
Undirected, uniformly weighted
• “Network”
• nodes (machines), links
Bidirectional communication
• A graph is input instance of a problem to be solved over machines of a network
![Page 8: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/8.jpg)
CONGEST modelCommunication links are the edges of the input graph
Vertices of the graph are mapped to dedicated machines
Only allowed to send messages of size O(log n) bits
[Pandurangan, Khan ‘10], [Peleg ‘00]: Introduced to simulate bandwidth restrictions across a network
Application: compute bottlenecks in a network
![Page 9: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/9.jpg)
k-machine modelA number of vertices may be mapped to a single machine
Network is “fixed”
Each machine executes an instance of distributed algorithm
Solution to a full problem is a configuration of outputs of each of the machines
Model simulates distributed graph computation systems like Pregel, Dato
![Page 10: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/10.jpg)
Roadmap
1. Local cluster detection in the CONGEST distributed model
2. Conversion Theorem of [Klauck et al. ‘15] to convert algorithm to k-machine model
![Page 11: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/11.jpg)
Roadmap
1. Local cluster detection in the CONGEST distributed model
2. Conversion Theorem of [Klauck et al. ‘15] to convert algorithm to k-machine model
![Page 12: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/12.jpg)
Diffusion-based local cluster detection
• Find a cluster of well-connected, well-separated vertices near a particular vertex
• Use only local information; avoid querying the whole graph
![Page 13: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/13.jpg)
Diffusion-based local cluster detection
• Find a cluster of well-connected, well-separated vertices near a particular vertex
Want a set S with small
Cheeger ratio (conductance):
# 𝑒𝑑𝑔𝑒𝑠 𝑎𝑙𝑜𝑛𝑔 𝑏𝑜𝑟𝑑𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑒𝑡
𝑠𝑢𝑚 𝑜𝑓 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑤𝑖𝑡ℎ𝑖𝑛 𝑡ℎ𝑒 𝑠𝑒𝑡ɸ(S) =
![Page 14: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/14.jpg)
Diffusion-based local cluster detection
A local clustering algorithm is based on the following guarantee:
If there exists a set of vertices S such that ɸ(S) ≤ϕ, then
many vertices in S may serve as seeds for finding a set T
with Cheeger ratio close to ϕ.
![Page 15: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/15.jpg)
Diffusion-based local cluster detection
• Find a cluster of well-connected, well-separated vertices near a particular vertex
• Use only local information; avoid querying the whole graph
Use a diffusion process from the “seed” vertex to keep operations local
![Page 16: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/16.jpg)
Diffusion-based local cluster detection
seed vertex
• “Score” vertices according to the probability that accumulates after some amount of diffusion from the seed vertex
![Page 17: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/17.jpg)
The Sweep Algorithm
1. “Score” vertices according to the probability that accumulates after some amount of diffusion from the seed vertex
• Let N be the number of vertices with nonzero score, let s(v) denote the score of vertex v
2. Order vertices by score normalized by degree
s(v1)/d(v1) ≥ s(v2)/d(v2) ≥ … ≥ s(vN)/d(vN)
3. Check Cheeger ratio of each of the subsets induced by first j vertices (“sweep sets”) in the ordering
… …
![Page 18: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/18.jpg)
Diffusions used to score vertices
• Lazy random walk [Lovász, Simonovitz ‘90, ‘93]
• Truncated lazy random walk [Spielman, Teng ‘04]
• PageRank [Andersen et al., ‘06]
• Evolving cluster sets with Markov chains [Andersen, Peres ‘09](evolving clusters with Markov chains)
• Lazy random walks + evolving sets [Gharan, Trevisan ‘12]
![Page 19: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/19.jpg)
The Distributed Sweep Algorithm
1. Compute scores for each vertex in some number of rounds
2. Upcast scores and broadcast ordering to every node in O(n) rounds
3. Upcast (place in order, Lj, Rj) to a master node in O(n) rounds
4. Compute Cheeger ratio of each of the n – 1 cuts locally at master node
… …𝑺𝒋−𝟏
𝑺𝒋 𝒔𝒋
Lj Rj
![Page 20: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/20.jpg)
Distributed Sweep with PageRank
1. Compute scores for each vertex in some number of rounds
2. Upcast scores and broadcast ordering to every other node in O(n) rounds
3. Upcast (place in order, Lj, Rj) to a master node in O(n) rounds
4. Compute Cheeger ratio of each of the n – 1 cuts locally at master node
[Das Sarma et al. ‘15] with PageRank:O(1/α log2 n + n log n) rounds of communication for any reset constant 0 < α < 1
![Page 21: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/21.jpg)
Our diffusion: PHKPR
Personalized heat kernel pagerank is the expected distribution of the following “heat kernel random walk” process:
“take k random walk steps from the seed vertex with probability 𝑒−𝑡𝑡𝑘
𝑘!”
![Page 22: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/22.jpg)
A Monte Carlo method for computing PHKPR in a centralized settingSince this is the expected distribution of a random walk process, approximate by sampling random walks. Call these values PHKPR scores.
ϕ = desired Cheeger ratio
t = f(ϕ)
for r times:
perform a heat kernel random walk (t) from the seed
return the number of times a walk ends at vertex v divided by r as the PHKPR score for v
![Page 23: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/23.jpg)
Sweep with PHKPR in a centralized setting
[Chung, S. ‘14]
Sample r = 16/ε3 log n random walks, limit random walks to at most
K = O( log(1/𝜀)
log log( 1 𝜀)) steps captures PHKPR scores > ε.
In particular 1/ ε vertices have non-zero scores.
[Chung, S. ‘14]
WHP, a sweep using PHKPR will
return a set with Cheeger ratio O(ϕ1/2).
![Page 24: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/24.jpg)
Distributed PHKPR scores
![Page 25: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/25.jpg)
Distributed PHKPR scores
Launch r heat kernel random walks of length k in parallel
1. seed node initializes r tokens, each of which holds a random variable k and a counter
2. continue until the counter reaches k:nodes holding tokens pass tokens to random neighbors in rounds and increment corresponding counter each time
3. at end of K rounds, each node counts the number of tokens it holds divided by r as its PHKPR score
O(K = log(1/𝜀)
log log( 1 𝜀)) rounds
No congestion: worst case all r = O(log n) messages are sent in one edge
![Page 26: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/26.jpg)
Distributed local cluster detection with PHKPR1. Compute PHKPR scores for each vertex in O(K) rounds
• N = O(1/ε) nodes have non-zero scores
2. Upcast scores and broadcast ordering to every other node in O(N) rounds
3. Upcast (place in order, Lj, Rj) to a master node in O(N) rounds
4. Compute Cheeger ratio of each of the N – 1 cuts locally at master node
… …𝑺𝒋−𝟏
𝑺𝒋 𝒔𝒋
Lj Rj
![Page 27: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/27.jpg)
Distributed local cluster detection with PHKPR
This paper with PHKPR:O(K + N) rounds of communication if ϕ is given
[Das Sarma et al. ‘15] with PageRank:O(1/α log n + n ) rounds of communication for any reset constant 0 < α < 1 if ϕ is given
![Page 28: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/28.jpg)
Speed to the next stop…
1. Local cluster detection in the CONGEST distributed model
2. Conversion Theorem of [Klauck et al. ‘15] to convert algorithm to k-machine model
![Page 29: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/29.jpg)
Summary
• Two distributed models of computation:• Conversion Theorem transforms algorithm in CONGEST model to
equivalent algorithm in k-machine model
• Local cluster detection:• Some number of rounds to compute scores• As long as messages are small, broadcasting and upcasting O(n)
messages take O(n) rounds• Cheeger ratios for all cuts can be computed locally
• Computing scores:• Based on sampling short random walks, perfect for parallelizing• Bottleneck becomes length of random walks, not number of samples
![Page 30: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank](https://reader033.fdocuments.in/reader033/viewer/2022042200/5e9f68de82f7006e5d4c7a02/html5/thumbnails/30.jpg)
Distributed algorithms for finding local clusters using heat
kernel pagerank
WAW 2015
EINDHOVEN, NETHERLANDS
Olivia Simpson, UC San Diego