Distributed Top-K Monitoring. Outline Introduction Related work Algorithm for distributed Top-K...

Distributed Top-KMonitoring

OutlineIntroductionRelated workAlgorithm for distributed Top-K

monitoringExperimentsSummary

IntroductionMany of these applications involve

monitoring answers to continuous queries over data streams produced at physically distributed locations, and most previous approaches require streams to be transmitted to a single location for centralized processing.

Unfortunately, the continual transmission of a large number of rapid data streams to a central location can be impractical or expensive.

IntroductionWe study a useful class of queries that continuously

report the k largest values obtained from distributed data streams (“top-k monitoring queries"), which are of particular interest because they can be used to reduce the overhead incurred while running other types of monitoring queries.

For applications requiring 100% accuracy, our algorithm provides the exact top-k set at all times. In many online monitoring applications approximate answers suffice, and our algorithm is able to reduce costs further by providing an approximation to the top-k set that is guaranteed to be accurate within a pre-specified error tolerance. The error tolerance can be adjusted dynamically as needed, with more permissive error tolerances incurring lower costs to the monitoring infrastructure.

IntroductionExample:The organizers of the 1998 FIFA Soccer World Cup,

one of the world's largest sporting events, maintained a popular web site.

The web site was served to the public by 30 servers, each with identical copies of the web content, distributed among 4 geographic locations around the world.

Following are two continuous monitoring queries that the administrators of the World Cub web site might have liked to have posed: Monitoring Query 1. Which web documents are

currently the most popular, across all servers?Monitoring Query 2. Within the local cluster of web

servers at each of the four geographic locations, which server in the cluster has the lowest current load?

IntroductionFormal Problem Definition :We consider a distributed online monitoring

environment with m+ 1 nodes: a central coordinator node N0, and m remote monitor nodes N1,N2,......,Nm.

The monitor nodes monitor a set U of n logical data objects U = {O1,O2,……,On}, which have associated numeric (real) values V1, V2,……,Vn.

The meaning of the tuple ,i is that monitor node Nj detects a change of , which may be positive or negative, in the value of object Oi.

IntroductionFor each monitor node Nj , we define partial data

values V1,j,V2,j,….. Vn,j representing Nj 's view of the data stream, where

In query 1 Each page request to the jth server for the ith object (web

document) is represented as a tuple <Oi,Nj,1>.

In query 2Minimizing total server load is the same as maximizing

(−1 load), and we could measure load as the number of hits in the last 15 minutes, so each page request to the jth server corresponds to a tuple <Nj,Nj,−1> followed 15 minutes later by a canceling tuple <Nj,Nj,1> once the page request falls outside the sliding window of current activity.

IntroductionThe coordinator node N0 must maintain and

continuously report a set T U of logical data objects of size |T| = k.

T is called the approximate top-k set, and is considered valid if and only if:

where is a user-specified approximation parameter. If , then the coordinator must continuously report the exact top-k set. For non-zero values of , a corresponding degree of error is permitted in the reported top-k set.

Introduction

IntroductionAdjustment Factors and SlackTo bring the local top-k set at each node into

alignment with the overall top-k set, we associate with each partial data value Vi,j a numeric adjustment factor that is added to Vi,j before the constraints are evaluated at the monitor nodes.

The purpose of the adjustment factors is to redistribute the data values more evenly across the monitor nodes so the k largest adjusted partial data values at each monitor node correspond to the current top-k set T maintained by the coordinator.

To ensure correctness, we maintain the invariant that the adjustment factors for each data object Oi sum to zero|

IntroductionGive a simple exampleSuppose the current partial data values at

N1 are V1,1 = 9 and V2,1 = 1 and at N2 are V1,2 = 1 and V2,2 = 3, and let k = 1.What is the current top-k set?T={O1}

IntroductionIn particular, the distribution of “slack" in

the local arithmetic constraints (the numeric gap between the two sides of the inequality) can be controlled.

the total amount of slack available to be distributed between the two local arithmetic constraints is V1 − V2 = 6 units.

To distribute the slack evenly between the two local constraints at 3 units apiece, we could set and for instance.

IntroductionIntuitively, distributing slack evenly among

all the constraints after each round of resolution seems like a good way to prolong the time before any constraint becomes violated again, making resolution infrequent.

the best allocation of slack depends on characteristics of the data such as change rates.

IntroductionApproximate AnswersTo permit a degree of error in the top-k set

T of up to , we associate additional adjustment factors with the coordinator node N0 (retaining the invariant that all the adjustment factors for each object Oi sum to zero), and introduce the additional stipulation that for each pair of objects Ot T and

Related work[13] Rank aggregation methods for the

web. In Proc. WWW10, 2001.[15] Comparing top k lists. In Proc. SODA,

2003.The above focuses on combining relative

orderings from multiple lists and does not perform numeric aggregation of values across multiple data sources.

Do not consider communication costs to retrieve data and focus on one-time queries rather than online monitoring.

Related work[8] Monitoring streams - a new class of data

management applications. In Proc. VLDB, 2002.[9] Finding frequent items in data streams. In

Proc. Twenty-Ninth International Colloquium on Automata Languages and Programming, 2002.

[27] Approximate frequency counts over data streams. In Proc. VLDB, 2002.

The above only discuss in a single data stream rather than distributed data streams and concentrates on reducing memory requirements rather than communication costs.

Algorithm for distributed Top-K monitoringIn addition to maintaining the top-k set, the

coordinator also maintains n(m + 1) numeric adjustment factors, labeled one corresponding to each pair of object Oi and node Nj, which must at all times satisfy the following two adjustment factor invariants:

Algorithm for distributed Top-K monitoring

Algorithm for distributed Top-K monitoringThe step of distributed Top-K monitoringAt the outset, the coordinator initializes the

approximate top-k set by running an efficient algorithm for one time top-k queries.

Once the approximate top-k set T has been initialized, the coordinator selects new values for some of the adjustment factors (using the reallocation subroutine) and sends to each monitor node Nj a message containing T along with all new adjustment factors corresponding to Nj.

Algorithm for distributed Top-K monitoringSpecifically, for each pair of

objects straddling T , node Nj creates a constraint specifying the following arithmetic condition regarding the adjusted partial values of the two objects at Nj :

On the other hand, if one or more of the local constraints is violated, T may have become invalid, depending on the current partial data values at other nodes. Whenever local constraints are violated, a distributed process called resolution is initiated to determine whether the current approximate top-k set T is still valid and rectify the situation if not.

Algorithm for distributed Top-K monitoringResolutionLet F be the set of objects whose partial values at Nf

are involved in violated constraints. (F contains one or more objects from T plus one or more objects not in T .)Phase 1: The node at which one or more constraints have

failed, Nf , sends a message to the coordinator N0 containing a list of failed constraints, a subset of its current partial data values, and a special “border value "it computes from its partial data values.

Phase 2: The coordinator determines whether all invalidations can be ruled out based on information from nodes Nf and N0 alone. If so, the coordinator performs reallocation to update the adjustment factors pertaining to those two nodes to reestablish all arithmetic constraints, and notices Nf of its new adjustment factors.


Phase 3: The coordinator requests relevant partial data values and a border value from all other nodes and then computes a new top-k set defining a new set of constraints, performs reallocation across all nodes to establish new adjustment factors to serve as parameters for those constraints, and notices all monitor nodes of a potentially new approximate top-k set T 0 and the new adjustment factors.


In phase 1The computation of border value:


In phase 2Each violated constraint at Nf represents a

potential invalidation of the approximate top-k set T that needs to be dealt with. The coordinator considers each whose constraint has been violated and performs the following

The coordinator applies the validation test to all pairs of objects involved in violated constraints to attempt to rule out invalidations.

If top-k set T is unchanged, go to Adjustment Factor Reallocation.

If top-k set T is changed,go to find another top-k set T’.

Algorithm for distributed Top-K monitoringAdjustment Factor ReallocationReallocation of the adjustment factors is

performed once when the initial top-k set is computed, and again during each round of resolution, either in Phase 2 or in Phase 3.

In phase 2:In phase 3:

Algorithm for distributed Top-K monitoringThe flexibility in adjustment factors is captured

in a notion we call object leeway. The leeway t of an object Ot in the top-k set is measure of the overall amount of “slack" in the arithmetic constraints (the numeric gap between the two sides of the inequality) involving partial values from the participating nodes.

The allocation of object leeway to adjustment factors at different nodes is governed by a set of m + 1 allocation parameters F0,F1,……,Fm that are required to satisfy the following restrictions:

Algorithm for distributed Top-K monitoringSimple example

ExperimentsFor Monitoring Query 1, we used a 24-hour time

slice of the server log data consisting of 52 million page requests distributed across 27 servers that were active during that period and serving some 20,000 distinct les.

Monitoring Query 3. Which destination host receives the most TCP packets?

Monitoring Query 3. Which destination host receives the most TCP packets? We used a 15-minute sliding window over packet counts for this query.

For Monitoring Query 2 we used k = 1 (i.e. we monitored the single least loaded server), and for the other two queries we used k = 20.

ExperimentsMessage SizeFirst, the largest message exchanged in our

experiments on Monitoring Queries 1 and 2 contained only k + 2 entries. Larger messages did occasionally occur for Monitoring Query 3.

However, the average number of entries per message remained small, ranging from k + 1.0 to k + 1.2, depending on the choice of parameters used when running the algorithm.

No message contained more than k + 39 numeric entries, and fewer than 5% of messages contained more than k+3 entries.

Experiments

Summary The coordinator computes an initial top-k set by

querying the monitor nodes, and then installs arithmetic constraints at the monitor nodes that ensure the continuing accuracy of the initial top-k set to within the user-supplied error tolerance.

When a constraint at a monitor node becomes violated, the node notifies the coordinator. Upon receiving notification of a constraint violation, the coordinator determines whether the top-k set is still accurate, selects a new one if necessary, and then modifies the constraints as needed at a subset of the monitor nodes. No further action is

SummaryUsing efficient top-k monitoring techniques,

the scope of detailed analysis can be limited to just the relevant data, thereby achieving a significant reduction in the overall cost to monitor anomalous behavior.

Distributed Top-K Monitoring. Outline Introduction Related work Algorithm for distributed Top-K...

Documents

Transcript of Distributed Top-K Monitoring. Outline Introduction Related work Algorithm for distributed Top-K...