An efficient algorithm for the longest tandem subsequence

Graph Exploration:

How to do better than the random walk?

Adrian Kosowski

INRIA Bordeaux Sud-Ouest [email protected]

Réunion Displexity – La Rochelle, April 4, 2013

A. Kosowski Graph exploration…

Talk outline

• Introduction to network exploration

• The random walk model

• When is the random walk good enough?

• Overview of basic properties and applications

• When is the random walk not good enough?

• Trying to do better in practice and in theory:

• Biasing probabilities and Metropolis-Hastings walks

2

Introduction to Network Exploration

A. Kosowski Graph exploration… 3


Graph exploration: definition and motivation

• A walker is placed on some node of the input network

• The walker is allowed to traverse edges (links) of the network

• The goal is to perpetually visit all the nodes of the graph

• Different possible optimization criteria:

• we would like the walker to complete its ”first exploration” as quickly as possible.

• we would like to guarantee regularity of exploration.

• Motivation: crawling webs, sampling of nodes and gathering statistics, ranking nodes, performing network maintenance,…

• Depending on the scenario, walkers may vary in terms of physical implementation, capabilities, and available memory/resources.

• 40 years of CS theory behind the problem (automata on graphs, L=SL,…)

Introduction

4


The network:

• Think locally: the global topology of the network is not known

• The network may potentially change in time

• We may possibly know some global parameters

• a bound on n – the number of network nodes

• we may have a rough idea of the degree distribution in the graph

• Network links are undirected! (like Facebook)

The walker:

• For most of the time, we see the walker as a ”crawler process” (a bit like GoogleBot)

• When visiting a node, we learn its neighbors

• this comes with a fixed cost

• Only following links is possible – teleportation is not allowed (e.g. no numeric node ID-s)

The setting of this talk

5


(Gjoka et al., INFOCOM’10)

Modeling network data

Example. A view of data accessible by crawling the Facebook frontend.

6



Example. A graph-based model.

explicit port labeling

7



Example. A graph-based model.

8


Objectives of exploration

1. Time until all nodes have been visited at least once (cover time)

2. Time between two subsequent visits to a node, in the limit (refresh time)

3. Convergence to some limit frequency of visits to specific nodes/edges

• Time after which the limit frequency has been approximately reached (blanket time)

• Time after which the walk reaches a probability distribution on nodes/edges corresponding to the limit frequency (mixing time)

4. Properties characterizing short walks:

• A short walk quickly discoveres many nodes

• A short walk samples nodes/edges with a specified probability

Convention: Worst-case set-up + Averaging over all possible runs of the walk algorithm.

Several parameters/properties to consider:

9

The Random Walk



What is the random walk?

The random walk model

• We are lost in an unknown network G = (V,E)…

• We leave each node along one of the adjacent links, chosen uniformly at random

• The process is Markovian: the next step of exploration does not depend on the exploration history

http://informatyka.wroc.pl/node/144

11



Inspired by nature: Brownian motion

12



A geometric setting:

Paths taken by the RoombaTM

(Artwork by IBRumba)

13

14

• Information search/packet circulation in p2p networks – an alternative to flooding [Gkantsidis 2004]

• Sampling of nodes in a web or social network (to be discussed later)

• Self-stabilizing mutual exclusion (tokens following random walks meet and coalesce, until 1 remains) [Israeli-Jalfon 1990]

• Picking a spanning tree of a graph uniformly at random (applications of loop-erased walks) [Broder 1988, Wilson 1995]


Classical networking applications?


15

• Advantages?

• Simple, resource-efficient, independent of network location

• Equitable – uses all edges fairly (1/|E| frequency in the limit)

• Recovers quickly after a slight modification of the graph

• Covers web-type graphs quickly in almost linear time

• Parallel walks are faster than one [Alon et al. 2008, Sauerwald et al. 2008, Cooper et al. 2009]

• By deploying several independent walks, the cover time is reduced

• Sometimes the total number of steps made by all walks before each node of the network has been visited is reduced!

• Disadvantages?

• Completely useless in terms of worst-case performance

• Expected cover time of (n3) for some graphs

• Short walks may get stuck in local network neighborhoods




How to analyze the random walk?


• First parameter: hitting time H(u,v)

• What is the expected number of steps for a random walk to reach node v from node u?

u +

_ v

16




• Second parameter: commute time Com(u,v) = H (u,v) + H (v,u)

• What is the expected number of steps for a random walk to reach node v from node u, and then return to node v?

• Theorem [Chandra, Raghavan, Ruzzo, Smolensky, Tiwari 1989]:

Com(u,v) = 2 |E| R(u,v).

• Theorem [Foster 1949]:

{u,v} E R (u,v) = n – 1.

u +

_ v

The electrical resistor network analogy

17




• Third parameter: cover time Cov

• Cov(u) - what is the expected number of steps for a random walk to reach all nodes of the graph, starting from node u?

• Cov = maxuV Cov(u)

• Theorem [Aleliunas, Karp, Lipton, Lovasz, Rackoff 1979]

Cover time is upper bounded by sum of commute times along the edges of any spanning tree of the graph.

• Theorem [Feige 1995]

By using the best spanning tree, we obtain for any graph: Cov 4n3 / 27

• the bound is tight, and the worst-case example is precisely known

18




• The lollipop graph – worst-case cover time 4n3 / 27 - o(n3)

n / 3

2n / 3

19




• Order of the cover time for different graph classes

• Cliques n log n

• Paths, cycles n2

• 2-dimensional grids n log2 n

• 3-dimensional grids n log n

• Complete k-ary trees n log2 n / log k

• Expanders n log n

• Regular graphs not more than n2

20




• In the limit, the random walk visits all edges with the same frequency (1/|E|)

• If the graph is not bipartite, then in the limit, at any given moment: The probability of finding ourselves at a vertex v is proportional to the degree of v.

• Fourth parameter: blanket time B

• Intuitively, what is the expected number of steps of a random walk before all edges of the graph have been visited a similar number of times?

• Theorem [Ding, Lee, Peres 2011]

Cov B const * Cov

21

Can we do better than the Random Walk?


23

Disadvantages of the random walk…


Partial remedy: use a deterministic strategy instead (*n2 time overhead)

Tweaking the random walk…


24




• Expected cover time of (n3) for some weakly connected graphs

Partial remedy: use Metropolis-Hastings biasing [TBD]

”Trick”: try to tweak the input graph… [Zhue et al. 2012]



25




• Expected cover time of (n3) for some weakly connected graphs


”Trick”: try to tweak the input graph… [Zhue et al. 2012]

• Short walks may get stuck in local network neighborhoods

More precisely: a random walk of small length t is expected to visit about √t edges [Broder et al. 1994], but may possibly visit very few nodes


”Trick”: if you feel you are stuck, teleport yourself to a random location… [Jin et al. 2011]



Biased walks and the Metropolis-Hastings algorithm


27

• A biased walk is one in which the next node is chosen by the walker from among its neighbors, but transition probabilities need not be equal.

• The bias can be:

• Topological – based on the structure of the graph, degrees/importance of nodes, etc.,

• Dependant on exploration history – e.g. walks which never back-track to the node they have just come from

• In general, we want to keep the process as close to reversible Markovian as possible

• A simple way to obtain the desired form of bias (Markovian, reversible):

• put positive real-valued weights on edges

• at each step, choose an incident edge with probability proportional to its weight (relative to the sum of all weights of incident edges)

Biased walks


28

• There have been several recent papers showing how to bias random walks, given helper information about the topology of the graph, etc.

• The effort required to collect this information means that effectively a ”normal” walker needs to do (n3) steps, anyway.

• There is an exception: the Metropolis-Hastings walk weighted by node degrees [Metropolis 1959, Nonaka et al. 2010]:

• For each edge connecting u and v, put the following weight on it: min { 1 / deg(u), 1 / deg(v) }

• Add self-loops at each node, so that the sum of weights of all incident edges sums to 1, for all nodes.

• Consequence: all nodes will be visited equally often during exploration

Metropolis walks


NextState (v: node)

u <- neighbor of v in G chosen uniformly at random;

with probability min{deg(v)/deg(u), 1} move to u;

remain at v;

29

• There have been several recent papers showing how to bias random walks, given helper information about the topology of the graph, etc.

• The effort required to collect this information means that effectively a ”normal” walker needs to do (n3) steps, anyway.

• There is an exception: the Metropolis-Hastings walk weighted by node degrees [Metropolis 1959, Nonaka et al. 2010]

• The walk can be implemented, as shown below:

Metropolis walks

[Lee at al. 2012, K. 2013]


A Metropolis walker explores a graph in O(nz log(n)) steps w.h.p.

30

• The Metropolis walk has a worst-case performance superior to that of the random walk

• Note: the random walk does not carry any state when traversing edges. A little bit of memory is necessary to implement Metropolis-Hastings.

Metropolis walks

Any strategy with o(n3) cover time requires some state memory carried

over edges.

A Metropolis walker can be implemented in O(log n) bits of memory.

[Nonaka at al. 2010]


A variant of the Metropolis walk explores all graphs O(n2 log(n)) steps

w.h.p., and not more slowly (up to a factor of 2) than the random walk.

31

• Note: the Metropolis walk is slower than the random walk on many graphs – even for the star.

• Is this strategy of practical importance?

• Yes. There are several elegant ways of combining the Metropolis walk with the random walk.

Combining Metropolis walks and Random walks


32

• Note: the Metropolis walk is slower than the random walk on many graphs – even for the star.

• Is this strategy of practical importance?

• Yes. There are several elegant ways of combining the Metropolis walk with the random walk.

• The above method relies on knowledge of the average degree d = 2m/n. (can be done without.)

NextState (v: node)

u <- neighbor of v in G chosen uniformly at random;

with probability min{(deg-1(u)+d -1) / (deg -1(v)+d -1), 1} move to u;

remain at v;

Combining Metropolis walks and Random walks


33

•

• Potentially useful property in local searches around network neighborhoods.

• Trick: subdivide nodes of high degree to achieve better discovery rate.

• E.g. combine the above with a ”landmark distribution scheme” [Broder et al. 1994]

In expectation, a Metropolis walk of length t D2 discovers

at least t1/2 nodes.

Short Metropolis walks

[K. 2013]



Algorithm (Broder et al.): Short walks from landmarks

Space-time tradeoffs for s-t connectivity

34

1. Pick k landmark nodes in the graph. Add s and t to the set of landmark nodes.

2. Repeat: [a polylogarithmic number of times]

• From each landmark, run a random walk of length t.

• If a walk starting from landmark l1 visits a landmark l2, mark them as belonging to the same component of landmarks (SET UNION operation).

3. Return ”YES”, if s and t belong to the same component of landmarks. Return ”NO”, otherwise.

G: s t




35






G: s t




36






G: s t




37






G: s t




38






G: s t




39



• From each landmark, run a random Metropolis walk of length t.



… and fixing some details to make it work

Improvement: Replacing the random walk by the Metropolis walk

40

• The above process can be used to test if a graph is connected.

• For a well chosen value of k, we obtain the following theoretical result:

• Corollary: an almost-linear time algorithm for checking if a graph is connected, running in space S = n2 /m – more space-efficient than BFS/DFS!

In the RAM memory model, given S log n bits of space, one can test

Undirected s-t Connectivity (USTCON) in time: T=Õ(max{n2/S, m}).

Short Metropolis walks

[K. 2013]


41

• Advantages?

• Simple, resource-efficient, independent of network location

• Equitable – uses all nodes fairly

• Recovers quickly after a slight modification of the graph

• Covers web-type graphs quickly in almost linear time

• Parallel walks are faster than one

• Expected cover time not worse than O(n2 log n)

• After some fine-tuning, short Metropolis walks visit nodes more quickly than short random walks

• Disadvantages?

• Unbounded pessimistic cover time

• In practical scenarios, a little slower than the Random Walk

Metropolis walks


Applications of biased walks in node sampling and ranking


43

• Goal: we would like to measure some metric on the nodes of a network

• E.g. what is the number of connections people have on average on Facebook/Google+?

• We need to find a representative subset of nodes of the network

• The optimal solution: just pick network nodes uniformly at random

• Unfortunately, not feasible – we do not know all people / p2p hosts on the network

• In many cases, numeric ID-s may not be relied upon

• A first attempt: take a network subset with BFS. Often unsatisfactory.

• A feasible solution: run a walk in the network!

• Approach 1: run a short random walk starting from a random node, then fix it to account for over-representation of high degree nodes

• Approach 2: run a short Metropolis-Hastings walk.

Uniform sampling problem


44

Uniform sampling problem

Example: sampling degree distribution of Facebook nodes, Spring 2009

[Results and figures of: Gjoka et al.,INFOCOM’10]

Metropolis-Hastings – not bad

Random Walk with compensation

- good

Random Walk no compensation - bad

BFS no compensation - bad

General conclusion from the literature: Compensated Random Walk seems to win slightly with Metropolis-Hastings in most tests.


45

Goal: we would like our walker to visit some nodes more often than others

• Scenario 1: A walk to estimate Google PageRank. Propose a walk on the web which converges to a limit distribution in which more important websites are visited more often than less important ones. [Agarwal Chakrabarti 2006]

• Scenario 2: A walk to recommend new Facebook links. A ”supervised walk”, finding interactions between nodes which are likely to exist in reality, but missing from the social network. [Backstrom Leskovec 2011]

• Scenario 3: A walk to sample non-uniform populations. ”Suppose we want to compare the mean income of social network users in China and the Vatican. We need a sample of 100 users from China and 100 users from the Vatican. How to get one quickly with a walk on the social network?”. [Kurant et al. 2011]

The solution to all cases: biased walks with weights on edges dynamically adapted in a learning process. Fine-tuning is quite tricky.

Non-uniform network exploration


46

What may the future bring?

New application domains: walks in the real-time analysis of live information

Following news as it spreads virally through the social web

Walks traversing different media, e.g.: a re-tweeted FB post linking to a blog entry based on live TV news coverage…

Walks to evaluate the extent/threat due to a spreading rumour, detect the source of viral info

Walks policing the web for brand infringement & copyright enforcement

New challenges:

A clear need for biased walks of different types

A clear need to better understand how such walks parallelize

An opportunity to use the abstraction of a walk in the modeling and analysis of heterogenous networks/webs

Possibly, a need for coordination of multiple walks, resources shared among walks, etc.

Empirical studies of walks in evolving networks (e.g. networks growing by several percent during the duration of the walk).


47

What may the future bring?

New techniques – developing a combination of:

probabilistic analysis

spectral graph theory

computational complexity

modeling of dynamic systems

rumour spreading models

network optimization

distributed computing

machine learning

game/equilibria theory

sampling & statistics

…


Deterministic Graph Exploration Is there still time?

48 A. Kosowski Graph exploration…


Assumptions of the labeled graph model

• The explored graph G = (V,E) is simple, undirected, and connected

• The nodes of the graph do not have any labels or colors which are known to the agent (anonymous graph property)

• When located at a vertex, the agent can distinguish among the edges adjacent to the current node

• The agent is aware of the edge by which it entered the current node

• There are two distinct types of local orientations of edges at a node:

The labeled graph model

explicit port labeling implicit cyclic ordering

49


• In the anonymous model, the agent is an automaton with state memory

• No identifiers, no global knowledge

• Rationale: testing limits of computability, profound implications in other areas: log-space complexity theory, fault-tolerant routing, token distribution schemes…

Focus: computations in anonymous networks

explored graph accessible information (”view”)

f ( STATE, IN-PORT) = ( STATE’, OUT-PORT )


How to make the random walk deterministic?

De-randomizing random walks

• We perform an exploration using a robot equipped with some memory (state) and knowledge of the ports in the graph:

f ( STATE, IN-PORT, DEGREE ) = ( STATE’, OUT-PORT )

• The following properties are extremely desirable:

• The memory size of the robot should be as small as possible

• The worst-case cover time of the robot should be polynomial

• If possible, other properties should be retained (e.g. equity of edge visits)

• First variant: we assume nothing about the port labeling of the graph (i.e., worst case labeling)

• First approach: sequences of port numbers that work for any graph…

51


Universal Traversal Sequences (UTS-s)

Universal Sequences

• A UTS(n,d) is a sequence of numbers (t1… tk) in 1..d, such that the robot f ( STEPi, PORT ?, DEGREE d ) = (STEPi+1, PORT ti) covers any d-regular graph of (at most) n vertices in at most k steps.

• Theorem [Aleliunas, Karp, Lipton, Lovasz, Rackoff 1979] For any n, there exists a UTS(n,d) of length k n5 log n

• Proof: the probabilistic method

• Fix a sequence S of length k = n5 log n, chosen uniformly at random

• Let G be an arbitrary graph. Then a random walk following S explores G with probability 1 – , where = O(2-n^2 log n).

• Let F(G) be the set of all sequences that fail to explore G. We have: |F(G)| dk.

• Let Gn be the set of all graphs of order at most n.

• The total number of sequences which fail for all graphs from Gn is at most |Gn| dk which is less than dk. So, at least one sequence succeeds for all graphs!

52


How much memory is required to construct a UTS efficiently?

Universal Sequences

• Nisan’s generic derandomizer (1992): O(log2 n) memory

• but the length of the sequence is no longer polynomial – O(nlog n)

• Not clear even if a sequence of polynomial length can be constructed in polynomial time…

• Some explicit constructions are known, e.g. for cycles…

• It turns out that it is easier to apply UXS-s instead!

Universal Exploration Sequences (UXS-s)

• A UXS(n,d) is a sequence of numbers (x1… xk) in 1..d, such that the robot f ( STEPi, PORT p, DEGREE d) = (STEPi+1, PORT [ p + xi ] ) covers any d-regular graph of (at most) n vertices in at most k steps.

53


The L = SL complexity class problem

Touching the foundations of computer science

• L is the complexity class containing decision problems which can be solved by a deterministic Turing machine using a logarithmic amount of memory space.

• SL (Symmetric Logspace) is the complexity class of problems log-space reducible to USTCON (undirected s-t connectivity), which is the problem of determining whether two vertices of a graph are in the same connected component

A positive answer [Reingold, STOC 2005]

• UXS(n,d) can be constructed by a machine equipped with O(log n) memory

• By applying a slight modification of the sequence, a robot can explore any (not necessarily regular) graph of order at most n, thus solving USTCON.

• Note: the problem for the related oblivious (UTS-based) variant is open!

• Guide the agent along the least often used edge [Ilcinkas et al., 2010]

• Guide the agent along the edge not in use for the longest time

•

• Guide the agent along the port not in use for the longest time: ”rotor-router” introduced by [Yanovski et al. 2003], also [Bampas et al. 2009]

• Improves previous bound of 4n (Ilcinkas ’06)

Explores the graph periodically, with an exploration period of

O(m D) in graphs of diameter D with m edges.

A fast and robust exploration strategy (w.r.t. changes of graph structure),

stabilizing to a periodic traversal of an Eulerian cycle within (m D) steps.

A poor strategy, with an exponential exploration time.

61

Helping the robot: guiding using counters



Thank you.

62

An efficient algorithm for the longest tandem subsequence

Documents

Transcript of An efficient algorithm for the longest tandem subsequence