Download - Local Computations in Large-Scale Networks Idit Keidar Technion.

Local Computations in Large-Scale Networks

Idit KeidarTechnion

Material

I. Keidar and A. Schuster: “Want Scalable Computing? Speculate!” SIGACT News Sep 2006. http://www.ee.technion.ac.il/people/idish/ftp/speculate.pdfY. Birk, I. Keidar, L. Liss, A. Schuster, and R. Wolff: “Veracity Radius - Capturing the Locality of Distributed Computations”. PODC'06.http://www.ee.technion.ac.il/people/idish/ftp/veracity_radius.pdfY. Birk, I. Keidar, L. Liss, and A. Schuster: “Efficient Dynamic Aggregation”. DISC'06. http://www.ee.technion.ac.il/people/idish/ftp/eff_dyn_agg.pdf E. Bortnikov, I. Cidon and I. Keidar: “Scalable Load-Distance Balancing in Large Networks”. DISC’07. http://www.ee.technion.ac.il/people/idish/ftp/LD-Balancing.pdf

Brave New Distributed Systems

Large-scale Thousands of nodes and more ..

Dynamic… coming and going at will ...

Computations… while actually computing something together.

This is the new part.

Today’s Huge Dist. Systems

Wireless sensor networks – Thousands of nodes, tens of thousands coming soon

P2P systems – Reporting millions online (eMule)

Computation grids – Harnessing thousands of machines (Condor)

Publish-subscribe (pub-sub) infrastructures– Sending lots of stock data to lots of traders

Not Computing Together Yet

Wireless sensor networks – Typically disseminate information to central location

P2P & pub-sub systems – Simple file sharing, content distribution– Topology does not adapt to global considerations– Offline optimizations (e.g., clustering)

Computation grids – “Embarrassingly parallel” computations

Emerging Dist. Systems – Examples

Autonomous sensor networks– Computations inside the network, e.g., detecting trouble

Wireless mesh network (WMN) management– Topology control– Assignment of users to gateways

Adapting p2p overlays based on global considerationsData grids (information retrieval)

Autonomous Sensor NetworksThe data center is too hot!

Let’s all reduce power

Let’s turn on the sprinklers (need to

backup first)

Autonomous Sensor Networks

Complex autonomous decision making– Detection of over-heating in data-centers– Disaster alerts during earthquakes – Biological habitat monitoring

Collaboratively computing functions– Does the number of sensors reporting a problem exceed a

threshold?– Are the gaps between temperature reads too large?

Wireless Mesh Networks

Wireless Mesh Networks

Infrastructure (unlike MANET)City-wide coverageSupports wireless devicesConnections to Mesh and out to the Internet – “The last mile”

Cheap – Commodity wireless routers (hot spots)– Few Internet connections

Decisions, Decisions

Assigning users to gateways– QoS for real-time media applications– Network distance is important– So is load

Topology control– Which links to set up out of many “radio link” options– Which nodes connect to Internet (act as gateways)– Adapt to varying load

Centralized Solutions Don’t Cut It

LoadCommunication costsDelaysFault-tolerance

Classical Dist. Solutions Don’t Cut It

Global agreement / synchronization before any outputRepeated invocations to continuously adapt to changesHigh latency, high loadBy the time synchronization is done, the input may have changed … the result is irrelevant Frequent changes -> computation based on inconsistent snapshot of system stateSynchronizing invocations initiated at multiple locations typically relies on a common sequencer (leader) – difficult and costly to maintain

Locality to the Rescue!

Nodes make local decisions based on communication (or synchronization) with some proximate nodes, rather than the entire network

Infinitely scalableFast, low overhead, low power, …

L

The Locality Hype

Locality plays a crucial role in real life large scale distributed systems

C. Intanagonwiwat et.al, on sensor networks:

“An important feature of directed

diffusion is that … are determined

by localized interactions...”

N. Harvey et.al, on scalable DHTs:“The basic philosophy of SkipNetis to enable systems to preserve useful content and path locality…”

John Kubiatowicz et.al, on global storage:

“In a system as large as OceanStore,

locality is of extreme importance…

What is Locality?

Worst case view– O(1) in problem size [Naor & Stockmeyer,1993]– Less than the graph diameter [Linial, 1992]– Often applicable only to simplistic problems or

approximationsAverage case view– Requires an a priori distribution of the inputs

To be continued…

Interesting Problems Have Inherently Global Instances

WMN gateway assignment: arbitrarily high load near one gateway– Need to offload as far as the end of the network

Percentage of nodes whose input exceeds threshold in sensor networks: near-tie situation– All “votes” need to be counted

Fortunately, they don’t happen too often

Speculation is the Key to Locality

We want solutions to be “as local as possible”WMN gateway assignment example:– Fast decision and quiescence under even load– Computation time and communication adaptive to distance to

which we need to offload A node cannot locally know whether the problem instance is local– Load may be at other end of the network

Can speculate that it is (optimism )

Computations are Never “Done”

Speculative output may be over-ruledGood for ever-changing inputs– Sensor readings, user loads, …

Computing ever-changing outputs– User never knows if output will change

• due to bad speculation or unreflected input change– Reflecting changes faster is better

If input changes cease, output will eventually be correct– With speculation same as without

Summary: Prerequisites for Speculation

Global synchronization is prohibitive Many instances amenable to local solutionsEventual correctness acceptable – No meaningful notion of a “correct answer” at every

point in time– When the system stabilizes for “long enough”, the

output should converge to the correct one

The Challenge: Find aMeaningful Notion for Locality

Many real world problems are trivially global in the worst case

Yet, practical algorithms have been shown to be local most of the time !

The challenge: find a theoretical metric that captures this empirical behavior

Reminder: Naïve Locality Definitions

Worst case view– Often applicable only to simplistic problems or

approximationsAverage case view– Requires an a priori distribution of the inputs

Instance-Locality

Formal instance-based locality:– Local fault mending [Kutten,Peleg95, Kutten,Patt-Shamir97] – Growth-restricted graphs [Kuhn, Moscibroda, Wattenhofer05]– MST [Elkin04]

Empirical locality: voting in sensor networks– Although some instances require global computation, most

can stabilize (and become quiescent) locally– In small neighborhood, independent of graph size – [Wolff,Schuster03, Liss,Birk,Wolf,Schuster04]

“Per-Instance” Optimality Too Strong

Instance: assignment of inputs to nodesFor a given instance I, algorithm AI does:– if (my input is as in I) output f(I)

else send message with input to neighbor– Upon receiving message, flood it– Upon collecting info from the whole graph, output f(I)

Convergence and output stabilization in zero time on ICan you beat that?

Need to measure optimality per-class not per-instanceChallenge: capture attainable locality

Local Complexity [BKLSW’06]

Let – G be a family of graphs – P be a problem on G– M be a performance measure– Classification CG of inputs to P on a graph G into classes C

– For class of inputs C, MLB(C) be a lower bound for computing P on all inputs in C

Locality: GG CCGIC : MA(I) const MLB(C)

A lower bound on a single instance is meaningless!

The Trick is in The Classification

Classification based on parameters– Peak load in WMN– Proximity to threshold in “voting”

Independent of system sizePractical solutions show clear relation between these parameters and costsParameters not always easy to pinpoint– Harder in more general problems– Like “general aggregation function”

Veracity Radius – Capturing the Locality of Distributed Computations

Yitzhak Birk, Idit Keidar, Liran Liss, Assaf Schuster, and Ran Wolf

Dynamic Aggregation

Continuous monitoring of aggregate value over changing inputsExamples:– More than 10% of sensors report of seismic activity– Maximum temperature in data center– Average load in computation grid

The Setting

Large graph (e.g., sensor network)– Direct communication only between neighbors

Each node has a changing inputInputs change more frequently than topology– Consider topology as static

Aggregate function f on multiplicity of inputs– Oblivious to locations

Aggregate result computed at all nodes

Goals for Dynamic Aggregation

Fast convergence– If from some time t onward inputs do not change …

• Output stabilization time from t• Quiescence time from t• Note: nodes do not know when stabilization and

quiescence are achieved– If after stabilization input changes abruptly…

Efficient communication– Zero communication when there are zero changes– Small changes little communication

Standard Aggregation Solution: Spanning Tree

7 black, 1 white

2 bl

ack

Global communication!

black!

1 bl

ack

black!20 black, 12 white

Spanning Tree: Value Change

Global communication!

6

black, 2 white

19 black, 13 white

The Bad News

Virtually every aggregation function has instances that cannot be computed without communicating with the whole graph– E.g., majority voting when close to the threshold

“every vote counts”

Worst case analysis: convergence, quiescence times are (diameter)

Local Aggregation – Intuition

Example – Majority Voting:Consider a partition in which every set has the same aggregate result (e.g., >50% of the votes are for ‘1’)

Obviously, this result is also the global one!

51%

59%

84%88%

80%

98%

76%

57%

91%

73%

93%

Veracity Radius (VR) for One-Shot Aggregation [BKLSW,PODC’06]

Roughly speaking: the min radius r0 such that r> r0: all r-neighborhoods have same result

Example: majority Radius 1:wrong result

Radius 2:correct result

VR=2

Introducing Slack

Examine “neighborhood-like” environments that:– (1) include an (r)-neighborhood for some (r)<r– (2) are included in an r-neighborhood

Example: (r)=max{r-1,r/2}

Global result:VR

r = 2:wrong result

VR Yields a Class-Based Lower Bound

VR for both input assignments is rNode v cannot distinguish between I and I’ in fewer than r stepsLower bound of r on both output stabilization and quiescenceTrivially tight bound for output stabilization

n1 a’s

n2 b’s

only b’s

vr-1

I

n1 a’s

n2 b’s

only a’s

vr-1

I’

Veracity Radius Captures the Locality of One-Shot Aggregation

[BKLSW,PODC’06]I-LEAG (Instance-Local Efficient Aggregation on Graphs)– Quiescence and output stabilization proportional to VR– Per-class within a factor of optimal– Local: depends on VR, not graph size!

Note: nodes do not know VR or when stabilization and quiescence are achieved– Can’t expect to know you’re “done” in dynamic aggregation…

Local Partition Hierarchy

Topology static– Input changes more frequently

Build structure to assist aggregation– Once per topology change– Spanning tree, but with locality properties

Minimal Slack LPH for Mesheswith (r)=max(r-1,r/2)

Level 0 pivot:

Level 2 pivot:Level 1 pivot:

Level 0 edge:

Level 2 edge:Level 1 edge:

Mesh edge:

Another View

The I-LEAG Algorithm

Phases correspond to LPH levels

Communication occurs within a cluster only if there are nodes with conflicting outputs– All of the cluster’s nodes hold the same output when the

phase completes– All clusters’ neighbors know the cluster’s output

Conflicts are detected without communication– I-LEAG reaches quiescence once the last conflict is detected

I-LEAG’s Operation(Majority Voting)

Legend:Input:Output:Message:

!Tree edge:Conflict:

Initialization: Node’s output is its input

Startup: Communication AmongTree Neighbors



Recall neighbor valueswill be used in all phases

Phase 0 Conflict Detection


!Conflict:!

! !

!

!

Phase 0 Conflict Resolution



Updates sent by clusters that had

conflicts




!

!

!!

No new Communication




Updates sent by clusters that had

conflicts




Using information sent

at phase 0

No Communication




No conflicts found,no need for resolution

This region has been idle since

phase 0

Simulation Study

VR also explains the locality of previous algorithms

Efficient Dynamic Aggregation

Yitzhak Birk, Idit Keidar, Liran Liss, andAssaf Schuster

Naïve Dynamic Aggregation

Periodically,– Each node samples input, initiates I-LEAG– Each instance I of I-LEAG takes O(VR(I)) time,

but sends (|V|) messagesSends messages even when no input changes– Costly in sensor networks

To save messages, must compromise freshness of result

Dynamic Aggregation at Two Timescales

Efficient multi-shot aggregation algorithm (MultI-LEAG)– Converges to correct result before sampling the inputs

again– Sampling time may be proportional to graph size

Efficient dynamic aggregation algorithm (DynI-LEAG)– Sampling time is independent of graph size– Algorithm tracks global result as close as possible

Dynamic Lower Bound

Previous sample (instance) also plays a role– Example (majority voting):

Multi-shot lower bound: max{VRprev,VR}– On quiescence and output stabilization– Assumes sending zero messages when there are zero changes

I1 (VR2)

I2 (0 changes)

I3 (VR=0)

?!

Dynamic Aggregation: Take II

Initially, run local one-shot algorithm A– Store distance information travels in this instance, dist

Let D = A’s worst-case convergence timeEvery D time, run a new iteration (MULTI-A)– If input did not change, do nothing– If input changed, run full information protocol up to dist – If new instance’s VR isn’t reached, invoke A anew– Update dist

~(VR)

(~ VRprev)~(VR)

Matches max{VRprev,VR} lower boundwithin same factor as A

A is for I-LEAG

I-LEAG uses a pre-computed partition hierarchy– LPH: Local Partition Hierarchy – cluster sizes bounded both

from above and from below (doubling sizes)– Spanning tree in each cluster, rooted at pivot– Computed once per topology

I-LEAG phases correspond to LPH levels– Active phase: full-information from cluster pivot– Phase result communicated to cluster and its neighbors– Phase active only if there is a conflict in the previous level– Conflicts detected without new communication

Multi-LEAG

The Veracity Level (VL) of node v is the highest LPH level in which v’s cluster has a conflict (VL<logVR+1) A multi-LEAG iteration’s phases correspond to LPH levels:– Phase level < VL: propagate changes (if any) to pivot

• active only if there are changes

– Phase level VL: fall back to I-LEAG• active only if new VR is larger than previous

– Cache partial aggregate results in pivot nodes• allows conflict detection between active and passive clusters

MultI-LEAG Operation

Physical nodes

Pivot nodes

Veracity Level

MultI-LEAG Operation

Case I: No changes

… no changes to report

… no conflicts… no conflicts

All is quiet…

Input Change

!

New veracity level

no conflicts, no communication

Abrupt Change Flips Outcome


Clusters at VL recalculate, others forward up


New Veracity level

no conflicts, no communication

MultI-LEAG Observations

O(max{VRprev,VR}) output stabilization and quiescenceMessage efficient:– Communication only in clusters with changes,

only when radius < max{VRprev,VR}

Sampling time is O(Diameter)– Good for cheap periodic aggregation– Can we do closer monitoring?

Dynamic Aggregation Take III: DynI-LEAG

Sample inputs every O(1) link delays– Close monitoring, rapidly converges to correct result

Run multiple MultI-LEAG iterations concurrentlyChallenges: – Pipelining phases with different (doubling) durations– Intricate interaction among concurrent instances

E.g., which phase 4 updates are used in a given phase 5 .. – Avoiding state explosion for multiple concurrent instances

Ruler Pipelining

Partial iterations, fewer in every levelChanges only communicated once

t

Sampling interval

Phase 2

Phase 1

Phase 0

Full iteration

Partial iteration

Memory usage: O(log(Diameter))

VL and Output Estimation

Problem: correct output and VL of an iteration is guaranteed only after O(Diameter) time– cannot wait that long…

Solution: choose iteration with highest VL according to most recent information– Use this VL for new iterations and its output as MultI-LEAG’s current

output estimation

Eventual convergence and correctness guaranteed

DynI-LEAG Operation

Phase below VLPhase above VL

0

2

1

“Previous VL” = 2

The influence of a conflict is

proportional to its level

t

Dynamic Aggregation: Conclusions

Local operation is possible – in dynamic systems – that solve inherently global problems

MultI-LEAG delivers periodic correct snapshots at minimal costDynI-LEAG responds immediately to input changes with a slightly higher message rate

Scalable Load-Distance Balancing

Edward Bortnikov, Israel Cidon, Idit Keidar

Load-Distance Balancing

Two sources of service delay– Network delay (depends on distance to server)– Congestion delay (depends on server load)– Total = Network + Congestion

Input– Network distances and congestion functions

Optimization goal– Minimize the maximum total delay

NP-complete, 2-approximation exists

Distributed Setting

SynchronousDistributed assignment computation– Initially, users report location to the closest servers– Servers communicate and compute the assignment

Requirements:– Eventual quiescence– Eventual stability of assignment– Constant approximation of the optimal cost (parameter)

Impact of Locality

Extreme global solution– Collect all data and compute assignment centrally– Guarantees optimal cost– Excessive communication/network latency

Extreme local solution – Nearest-Server assignment– No communication– No approximation guarantee (can’t handle crowds)

No “one-size-fits-all”?

Workload-Sensitive Locality

The cost function is distance-sensitive– Most assignments go to the near servers– … except for dissipating congestion peaks

Key to distributed solution– Start from the Nearest-Server assignment– Load-balance congestion among near servers

Communication locality is workload-sensitive– Go as far as needed …– … to achieve the required approximation

Uniform Load

Skewed Load

Skewed Load

Load-Balance

Load-Balance

Skewed LoadLoad-Balance

Tree Clustering

As long as some cluster has improvable cost– Double it (merge with

hierarchy neighbor)

Clusters aligned at 2i indicesSimple, O(log N) convergence time

Ripple Clustering

Adaptive merging– Better cost in practice

As long as some cluster is improvable – Merge with smaller-cost neighbors

Conflicts possible– A B C– A B C– Random tie-breaking to resolve – Many race conditions (we love it -)

Scalability: Cost

cost = Euclidian distance + linear load

Scalability: Locality