Summary-based Routing for Content-based Event Distribution...

Summary-based Routing for Content-based EventDistribution Networks

Yi-Min Wang, Lili Qiu, Chad Verbowski, Dimitris Achlioptas, Gautam Das, and Paul LarsonMicrosoft Research, Redmond, WA, USA

Abstract— Providing scalable distributed Web-based eventingservices has been an important research topic. It is desirable tohave an effective mechanism for the servers to summarize theirfilters for in-network preprocessing in order to optimize systemperformance. In this paper, we propose a summary-based rout-ing mechanism and introduce the notion of imprecise summariesto provide a trade-off between routing overhead and event traffic.Our system uses similarity-based filter clustering to reduce overallevent traffic and performs self-tuning summary precision selectionto optimize throughput. We have implemented summary-basedrouting on top of an XML-based infrastructure that closely followsthe proposed Web services standards. Measurements from the ac-tual implementation validate our analytical and simulation results,and demonstrate the practical benefits of the proposed techniques.

I. INTRODUCTION

Today, World Wide Web use is predominantly based on thesynchronous polling model: a Web user either visits a Webpage using its URL or goes through a search engine to findpages of interest. A complementary model that is becom-ing increasingly popular is the asynchronous publish/subscribe(pub/sub) event notification model: a user subscribes to a serverby specifying events of interest, and later receives notificationswhen any of those events are published. The Instant Messag-ing buddy/contact lists, the Amazon shopping alerts, the EBayoutbid notifications, and the stock quotes and news alerts fromYahoo! Alerts and MSN Mobile are all examples of this model.

In addition, new applications with subscribers being non-human entities are also emerging. User-agent applications thatcommunicate with sensors and devices to perform automationtasks and B2B Web applications that communicate with eachother to conduct day-to-day businesses will both rely heavily onthe event notification model. Non-human subscribers presentadditional scalability challenges to the design of pub/sub sys-tems because they can significantly increase the total numberof participants, handle much more complex subscriptions, andreceive and process notifications at a much higher rate.

Pub/sub systems are typically classified into two categories:topic-based and content-based. In a topic-based system, eachevent is associated with one of a set of predefined topics and asubscriber receives notifications for all events belonging to thesubscribed topics. A content-based system supports more fine-grained selection of events by defining an event schema undera topic, which specifies the names and types of the attributesthat appear in an event. Subscription filters are then specified asconjunctions of predicates [11] on a subset of those attributes.For example, under the topic of “stock quotes”, we may havean event schema that defines three attributes: Symbol, Price,

and Volume. An example event would be (Symbol=GE andPrice=29.3 and Volume=30,000,000); an example subscriptionfilter would be (Symbol=GE and Price>30.0).

A centralized, content-based pub/sub system consists of aserver that receives and stores subscriptions from all sub-scribers, receives events from all publishers, performs matchingof each event against the subscriptions, and sends notificationsto all subscribers with matching subscriptions.

To provide scalability, several previous papers have proposeda distributed architecture that uses a network of servers [9] (alsocalled proxies [31] or brokers [4]), which can be either physi-cally co-located or distributed in geographically diverse loca-tions, to route events between publishers and subscribers. Anaive way of performing distributed pub/sub operations in sucha network is to propagate every event message to every server,and have all content-based matching operations performed lo-cally at each server. A natural optimization is to allow eachserver to “summarize” the entire set of local subscription filtersand propagate that summary information along the reverse pathof event dissemination in order for the upstream routers (or for-warders) [8] to block unnecessary event traffic at the earliestpoint [15], [4], [31]. The SIENA system [9] formalized this no-tion by arranging the subscription filters as a partially orderedset (poset), and using the set of root filters (i.e., maximal posetelements) as the summary. We refer to such a summary as aprecise summary because it precisely captures the set of eventsthat are needed by any downstream servers.

Two issues related to the use of summaries need to be ad-dressed. First, if the subscription filters submitted to the sameserver have poor “locality” (i.e., the sets of events they match donot have much overlap), then the summaries may be too broadto allow efficient routing and effective traffic reduction.

Second, although precise summaries minimize event traffic,they may not optimize overall system throughput. Dependingon the distribution of subscription filters, precise summariesmay be so complex that the routers become the bottleneck, de-grading system throughput. While empowering the routers withbetter hardware helps to alleviate the throughput degradation tosome extent, it is not cost-effective to provision routers basedon peak load, which can be orders of magnitudes higher thanthe typical load. By properly adjusting summary precision, wecan dynamically optimize the overall throughput using the ex-isting hardware under varying loads.

In this paper, we propose subscription partitioning andsummary-based routing to address the above issues, and eval-uate their effectiveness using analysis, simulation, and imple-mentation. As the first step to gain insights into subscription

ACM SIGCOMM Computer Communications Review Volume 34, Number 5: October 200459

partitioning and summary-based routing, we make the follow-ing simplifications and assumptions in this paper.

First, we focus on the case where the servers are physicallyco-located. As described in Section III-D, our idea can be gen-eralized to handle widely distributed servers by incorporatingnetwork distance among servers.

Second, different from many previous work, which used re-verse path forwarding to set up forwarding table (e.g., [9]), ourstudy focuses on event notification systems that decouple thesubscription paths from the notification paths. Many existingcommercial systems adopt this model. For example, a userwould go to a Web page to enter his or her subscriptions andspecify that the notifications should be sent to his or her cellphone. Such architecture gives us the flexibility to route andcluster the subscriptions once they enter the network, and al-lows us to focus on optimizing the system throughput where thesystem excludes the notification paths (e.g., the path in a cellu-lar network towards a subscriber’s cell phone). In cases whereit is desirable to have servers handle subscriptions for nearbyclients, we can extend our scheme by considering subscribers’locations as additional dimensions for clustering as describedin Section III-D.

Third, our work focus on the distribution topology that hasone level of branching as shown in Figure 1, where an eventdispatcher is connected directly to all servers. We use Ns todenote the number of servers throughout the paper. The re-maining symbols in Figure 1 will be explained in Section IVand Appendix . We leave extension of our approach to handlegeneral distribution topologies with potentially multiple levelsof branching as part of our future work.

Ns servers Ts F

Dispatcher Td

R Tl

Tl

Tp Publishers

Fig. 1. System architecture evaluated in this paper.

Our key contributions and results can be summarized as fol-lows.

First, we propose a summary-based routing framework andintroduce the notion of imprecise summaries to provide a trade-off between routing efficiency and event traffic: summaries witha lower precision would allow more efficient routing at the costof a higher amount of false-positive event traffic. The frame-work includes previous work on precise summaries and no sum-maries as two extremes of the spectrum.

Second, to make summaries more compact and effective, weuse similarity-based filter clustering for offline subscription par-titioning and online subscription routing. We evaluate the ben-efit of clustering using both uniform and Zipf-like subscriptionand event distributions.

Third, to evaluate the impact of summary precisions on over-all system throughput, we present analysis and simulation re-sults to show that varying summary precisions can provideload balancing among system components to optimize systemthroughput.

Fourth, to allow dynamic adaptation to changing system andnetwork characteristics, we present a self-tuning algorithm thatperforms ongoing monitoring of throughput bottleneck and au-tomatically adjusts summary precision to bring the system to-wards a new optimal operating point.

Finally, we describe an implementation of summary-basedrouting on an XML/SOAP messaging infrastructure conform-ing to a set of proposed Web Services standards [30]. With ac-tual parameters from the implementation and the system setup,we give a detailed discussion on when imprecise summariesare useful in practice. We present experimental results from theimplementation to validate our analysis and to demonstrate thecapability of our self-tuning algorithm to dynamically maintainoptimal throughput.

This paper is organized as follows. In Section II, we give anarchitecture overview of a content-based event distribution net-work. In Section III, we describe and evaluate summary-basedsubscription and event routing. In Section IV, we present ananalysis and simulation results for throughput evaluation, anddescribe a self-tuning algorithm for adaptive throughput opti-mization. In Section V, we describe the architecture of our WebServices-based implementation and discuss actual experimentalresults. We review previous work in Section VI, and concludethe paper in Section VII.

II. SYSTEM ARCHITECTURE

Just as Content Distribution Networks (CDNs) have been de-ployed to provide scalable Web information dissemination, wepropose building Event Distribution Networks (EDNs) to pro-vide scalable event dissemination. An EDN will be built as aself-configuring overlay network of servers. As shown in theleft-hand side of Figure 2, a geographically distributed subsetof nodes, called edge servers, are deployed to provide a low-latency front-end interface to geographically distributed pub-lishers and subscribers. Other servers reside inside the networkand may host subscriptions or route traffic or both. Through theedge servers, event sources (i.e., publishers) publish events andsubscribers submit subscriptions for events of interest.

Notification Routing Service

1. Submit subscription 2. Subscription routing 3. Summary reporting

4. Summary exchange 5. Event routing 6. Notification delivery

Edge Servers

Summary- Based Router

Summary Manager

Single-Node

Filtering Engine

Summary

Subscription

Event

EDN Node

EDN Node

Event Sources

Subscriber

1

2 3

3

4

5

5

6

6

Subscriber

1 6

6

Fig. 2. EDN network architecture. (Shaded squares are edge servers; non-shaded squares are internal servers; the right-hand side is the node architecturefor an internal server.)

The following definitions will be used throughout the paper:a filter f is any expression that defines a set of eventsEf , and an


event e matches the filter f if e ∈ Ef . A filter f ′ is said to coveranother filter f if Ef ⊆ Ef ′ . A similar covering relationshipis defined between two sets of filters. A subscription consistsof a notification address and a subscription filter, which is afilter in the form of a conjunction of predicates conforming to apre-defined event schema. When a server hosting a subscriptionreceives a published event that matches the subscription’s filter,it sends a notification to the corresponding notification address.Given a set of subscription filters F , a summary S of F is a setof filters that covers F . A filter f ∈ F is maximal if there is nof ′ ∈ F such that f ′ covers f . Clearly, the set of all maximalfilters in F covers F .

Any subscription submitted to an edge server is forwarded toan internal server. (We will use the term “server” to mean “in-ternal server”.) In this paper, we focus on the following type ofnotification addresses. The subscriber specifies the address ofa notification routing service followed by a device-independentID such as a Yahoo! ID or a .NET Passport ID. A notificationis first sent to the routing service, which resolves the ID andforwards the notification to the subscribing user via instant mes-saging, email, cell phone SMS, etc. [27]. When such type of no-tification addresses is in use, the internal servers can effectivelybatch notifications, which alleviates the scalability problem onthe notification delivery side and allows us to concentrate on theevent propagation side.

Upon receiving a subscription from an edge server, an inter-nal server may further forward it to another server whose sum-mary overlaps the most with (and hopefully already covers) thenew filter, thus minimizing the additional event traffic handledby each server. The server makes the routing decision based onthe subscription routing summaries. Each subscription serverknows about all other subscription servers in the network, andperiodically sends them a summary of its subscription set forrouting subscriptions. The right-hand side of Figure 2 illus-trates the architecture of an internal server. The single-nodefiltering engine can be any pub/sub engine that stores subscrip-tions in efficient data structures for fast matching [14], [2],[11]. The summary manager is responsible for maintainingsummaries for the entire sub-tree rooted at the current node.1 It intercepts all new subscriptions entering the local filteringengine, and updates the local summary accordingly. It also re-ceives summaries from all child nodes. All these summaries aremade available to the summary-based router, which maintainsthem in an efficient data structure for fast event and subscrip-tion routing. The summary manager combines all summariesinto a sub-tree summary and reports it to the parent node. Uponreceiving an event, a summary-based router forwards it downonly those sub-trees whose summaries the event matches.

The format of the summaries can be different from that ofthe subscription filters, and thus the data structure used by thesummary-based routers can be different from that used by thefiltering engines. For example, when shopping at an online auc-tion site, users may subscribe notifications by specifying ranges

1For ease of presentation, we assume throughout the paper that the serversare statically configured into a tree. The same concept of summary-based rout-ing can be applied to systems that dynamically construct trees through eitheradvertisement forwarding [9] or SUBSCRIBE message forwarding [24] in apeer-to-peer routing system.

for price, seller rating, shipping cost, and time to arrive. Eachsubscription is then represented as a multi-dimensional rectan-gle. Instead of keeping rectangles for every subscription, wecan summarize them using a small number of bounding rect-angles to speed up routing. For equality predicates (e.g., Sym-bol=GE) in subscription filters, we can summarize them usinga Bloom filter [6], [28]. Another way of summarizing subscrip-tions is to use only a selected subset of attributes in the eventschema.

Let S denote a summary of the set of filters F representingall the subscriptions hosted by a single server. The summary Sis precise if EF = ES ; otherwise, EF ⊂ ES and it is called animprecise summary. Clearly, the set of all maximal filters of Fforms a precise summary of F . The most imprecise summarycovers the entire event space, which corresponds to the caseof not using summaries, but forwarding every event to everyserver. Reducing summary precision, i.e., increasing the sizeof ES \ EF (where “\” denotes set-minus), would introducemore “false-positive traffic” that invokes wasteful operationsat some downstream filtering engines to produce zero match.However, it would typically reduce per-event processing timeat the summary-based routers. Summary-based routing thus al-lows tunable in-network pre-processing of event traffic to opti-mize overall network performance. We note that subscriptionrouting summaries need not be the same as those used for eventrouting: while the latter must guarantee correctness, the formerare only used for optimization and so they can be in a less com-plex format and exchanged less frequently.

III. SUMMARY-BASED ROUTING

In this section, we describe how to route incoming subscrip-tions and events.

A. Subscription Routing & Partitioning

In general, there are two approaches to partitioning the over-all pub/sub operations among multiple servers: we either par-tition the event space or partition the set of subscription fil-ters [28]. We call the former Event Space Partitioning (ESP),and the latter Filter Set Partitioning (FSP). As discussed in [28],the ESP approach in general needs to replicate subscriptions tomultiple servers, making it difficult to support stateful updateand subscription deletion. Therefore, we focus on the FSP ap-proach in this paper to simplify subscription management.

1) Range Predicates: We have investigated ESP-based sub-scription partitioning of equality predicates in [28]. In this pa-per, we study subscription partitioning of range predicates usingthe FSP approach. Given an event schema with d attributes, itis convenient to model each event as a point and each subscrip-tion filter as a rectangle in a d-dimensional space where eachdimension corresponds to an attribute.

The objective of subscription partitioning is to minimize thetotal event traffic and the associated server load, while avoidingoverloading any individual server. There are two versions of theproblem: offline and online. In the offline version, we are givena set of existing subscription filters to be partitioned among aset of servers. Such subscriptions are usually available when asuccessful event service provider upgrades its pub/sub system


(e.g., upgrade from a single-node centralized architecture to adistributed one in order to accommodate growing demands). Aservice provider can also periodically re-partition the subscrip-tions using the offline scheme when the performance of the on-line version degrades to a certain threshold. In the online ver-sion, each server, upon receiving an incoming new subscription,makes a local decision based on the subscription routing sum-maries to route the subscription to the “best” server. To avoidserver overload, only those servers with load below the pre-specified threshold are considered candidate targets for routing.Intuitively, an online algorithm would perform better if it startswith an initial good partitioning provided by an offline scheme.

Below we use similarity-based filter clustering for subscrip-tion partitioning and routing. We will evaluate the effective-ness of the clustering techniques using realistic subscription andevent distributions in Section III-C.

Random routing: The random algorithm is oblivious to thesimilarity among different subscriptions. It randomly routes asubscription to a server. Subscriptions that have large overlapwith each other may be assigned to different servers, and thusincur significant duplicate traffic and server processing load forevents falling in the intersection region.

R-tree based routing: An R-tree [16] is a dynamic indexstructure for multi-dimensional data rectangles. It is a height-balanced tree similar to a B-tree [5], with all data rectanglesresiding at the leaf nodes. Typically, an R-tree algorithm triesto grow the tree in such a way that rectangles with large inter-sections reside at leaf nodes that are close to each other.

An offline R-tree algorithm, such as the bulk-loading algo-rithm described in [13], builds an R-tree in a top-down fashion.At each level, it sorts the rectangles based on their minimum,maximum, and center coordinates of each dimension, consid-ers all cuts orthogonal to the coordinate axes that would alignwith the eventual partition boundaries, greedily picks the cutthat minimizes a cost function, and then recursively applies thecuts to the smaller partitions until the desired number of rect-angles per partition is reached. We apply the bulk-loading al-gorithm to the offline partitioning problem by forcing the num-ber of children of the root node to be equal to the number ofservers. Each sub-tree rooted at a child is assigned to a server.The cost function is chosen to be the sum of the volumes of thebounding rectangles of the two partitions on the two sides ofthe cut. Assuming that events are uniformly distributed, mini-mizing this sum of volumes corresponds well to minimizing thesum of event traffic.

In the online version, a new subscription is checked againsteach server’s summary R-tree and forwarded to the server thatowns a rectangle that has the maximum overlap with the sub-scription rectangle, and is not overloaded (i.e., current load isbelow a threshold).

K-Mean clustering: K-Mean [18] is a well-known cluster-ing technique to group points into clusters based on their prox-imity. It starts with an arbitrary initial cluster assignment. Thenit assigns each point to its closest cluster centroid, re-computesthe centroids after all assignments, and iterates until the totaldistance between the new clusters’ centroids and the old clus-ters’ centroids is within δ (In our experiment, δ = 1). By rep-resenting each subscription rectangle using its centroid, we can

Algorithm Storage Amortized Total TrafficSearch Time (Server Load)

Random 0 O(1) HighOffline R-Tree O(S) O(log(S)) LowestOnline R-Tree O(Ns ·Nb) O(Ns ·Nb) Low

Offline K-Mean O(S +Ns) O(I ·Ns) MediumOnline K-Mean O(Ns) O(Ns) Medium

TABLE ICOMPARISON OF DIFFERENT PARTITION ALGORITHMS, WHERE THE

NUMBER OF DIMENSIONS IN THE DATA IS A SMALL CONSTANT.

apply K-Mean clustering to partitioning subscriptions offline.We refer to this scheme as offline K-Mean.

In the online version, we route an incoming subscription Ato the server i whose subscriptions’ centroid is closest to A’scentroid, and then we update the subscriptions’ centroid at theserver i.

Summary: Table I compares the cost and performance ofdifferent partitioning algorithms, where Ns is the number ofservers, S is the total number of subscriptions, I is the numberof iterations in K-Mean, Nb is the number of summary bound-ing rectangles in the R-Tree for each server, and the number ofdimensions of rectangles is a constant. Note that Ns ∗Nb ≤ S.In Section III-C, we will evaluate the performance of these al-gorithms in more details.

Table I shows the trade-off between the complexity and per-formance of the algorithms. Random assignment is cheapest,but yields the highest amount of traffic and server load; on theother hand, R-Tree reduces the total traffic and server load atthe cost of more expensive computation; and K-Mean falls inbetween. A nice feature of the R-Tree approach is that we cansmoothly trade off the complexity for performance by adjust-ing the number of summary bounding rectangles Nb we use forsubscription routing.

B. Event Routing

We use the R-tree algorithm to route incoming events. Givenan event e, it finds all the servers which have at least onesummary bounding rectangle containing e. The pseudo-codefor event routing is shown below. Note that the event routingscheme is independent of the subscription routing scheme, andwe use the R-tree for event routing regardless of which sub-scription routing scheme is used.

Online R-tree based event routing

Input: An incoming event, eOutput: The IDs of the servers to which the event is forwarded

count = 0;foreach server sfoundMatch = SearchAnyContaining(e, RTree[s]);if (foundMatch = TRUE)

serverIDs[count++] = s;end

end

C. Performance Evaluation

In this subsection, we compare the performance of differ-ent subscription partitioning algorithms. Our evaluation differsfrom previous studies on clustering algorithms in the followingways.


First, according to our application requirement, we use eventtraffic as a performance metric for comparing different parti-tioning schemes. Specifically, we use the average per-server hitratio as the performance metric, which is defined as the totalnumber of forwarded events by the dispatcher divided by thetotal number of published events and the number of serversNs.Given a set of subscription rectangles, a lower hit ratio is desir-able as it indicates more effective traffic reduction.

Second, we examine the effectiveness of partitioning algo-rithms under different event and subscription distributions. Inparticular, we study the impact of realistic event and subscrip-tion distributions on the amount of event traffic generated.

Third, we investigate the effect of subscription summary pre-cision on the traffic reduction.

Unless otherwise specified, we use the following settings:each server hosts 10,000 subscriptions on average, with a ca-pacity constraint (i.e., load threshold) of 20,000 subscriptions.Events and subscriptions are both generated randomly in a 4-dimensional space with coordinate values uniformly distributedin the range of 0 to 10 in each dimension. In most experiments,the R-tree schemes use precise summaries for subscription rout-ing. In addition, we also evaluate the effects of imprecise sum-maries, Zipf-like subscription and event distributions, differentserver load thresholds, and different numbers of dimensions.

Varying the number of servers: Figure 3 compares the per-formance of different partitioning algorithms by varying thenumber of servers. With random partitioning, the per-serverhit ratio is the highest and close to 1.0. This means that almostevery incoming event is forwarded to all the servers due to poorclustering of subscription rectangles. In comparison, the offlineR-tree algorithm performs the best, and reduces the hit ratio by20% to 60%. Moreover, the benefit of offline R-Tree increaseswith the number of servers. The offline/online R-tree curve isobtained by using 50% of the subscriptions to build an R-treeoffline and then inserting the remaining subscriptions with theonline R-tree algorithm. It performs very close to the offlineR-tree. However the online R-tree performs noticeably worse:its hit ratio is 0.2 higher than that of the offline R-tree when thenumber of servers is large. This suggests a set of good initialpartitions obtained from the offline algorithm is important tothe performance of the R-tree. The performance of the K-Meanapproach falls between random partitioning and online R-tree.The R-tree approach outperforms K-Mean because the R-treetries to minimize the sum of rectangles volume of all server,which is a direct measure of event traffic.

Varying subscription routing summary precision: Nextwe evaluate the effect of summary precision on traffic reduc-tion. We control the precision by varying the depth of the sub-tree of the maximal R-tree that we use for subscription routing.Shallower trees with less precise summaries would allow fasterrouting, but potentially at the expense of less effective cluster-ing. Figure 4 shows the hit ratio versus the average depth ofthe summary R-Trees across all servers, when the number ofservers is 10 and 20. The leftmost data point corresponds toevery server using only the root node of the maximal R-tree;the rightmost point is for using the entire maximal R-tree as aprecise summary.

We make the following observations. First, the offline/online

0000..1100..2200..3300..4400..5500..6600..7700..8800..99

11

00 55 1100 1155 2200## sseerrvveerrss

HHiitt

rraatt

iioo pp

eerr ss

eerrvvee

rr

RRaannddoomm OOfffflliinnee RR--ttrreeeeOOnnlliinnee RR--ttrreeee OOfffflliinnee//OOnnlliinnee RR--ttrreeeeOOfffflliinnee KK--MMeeaann OOnnlliinnee KK--MMeeaannOOfffflliinnee//OOnnlliinnee KK--MMeeaann

Fig. 3. Performance of different partitioning algorithms as a function of thenumber of servers.

R-tree outperforms the corresponding online R-tree in all cases.Second, the hit ratio tends to decrease as we use more detailedR-trees. For example, relative to the case of depth-1 R-tree, theuse of precise summaries cuts down the hit ratio by 20% - 25%.

However, the decrease in hit ratio is not always monotonic.This is because the heuristic we use (i.e., route a subscriptionto the server that contains a leaf node that has the largest over-lap with it) is sometimes sub-optimal. For example, server 1may have two rectangles each overlapping 50% with the incom-ing subscription A, but the union of the two rectangle overlapswith A completely, while server 2 has one rectangle overlap-ping 80% with A. According to the heuristic, we would routeA to server 2, but this is not as good as routing it to server 1,which incurs no extra traffic.

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9Depth of summary R-tree

Hit

rat

io p

er s

erve

r

Online R-tree (Ns=10) Online R-tree (Ns=20)

Offline/Online R-tree (Ns=10) Offline/Online R-tree (Ns=20)

Fig. 4. The effect of subscription routing summary precision.

Zipf-like subscription/event distribution: To examine theimpact of subscription distribution, below we compare the al-gorithms by generating subscriptions whose boundary valuesfollow a Zipf-like distribution.

A number of studies [3], [7], [21] report that the Web requestsfollow a Zipf-like distribution; that is, the number of requeststo the i-th popular document is proportional to 1

iα , where α is asmall constant. We analyzed the end-user subscriptions at a ma-jor stock-quote notification service provider, and also observeda Zipf-like distribution [28]. Based on the previous findings, itis likely that subscriptions in the form of range predicates mayexhibit a similar distribution.

We use the following approach to generate subscriptions withrange predicates that follow a Zipf-like distribution. Since theZipf-like distribution is only applicable for discrete values, wefirst discretize values in each dimension toB bins; then for sub-


scriptions with n attributes, we have Bn bins, where each binis numbered as x1x2..xd, with xi representing the value of thebin in the i-th dimension. Without loss of generality, we assignpopularity ranking to these bins according to their numericalorder (e.g., for a 2 dimensional space, the bins in the order ofdecreasing popularity are 00, 01, 02, ... , 10, 11, 12, ... ). Thisranking order assumes that some dimensions have greater im-pact on popularity than other dimensions, which is a reasonableassumption in practice. Next we assign the popularity distribu-tion to the bins according to 1

iα , where α is varied from 0.25to 4. Then we draw the boundary values of each subscriptionfrom the above distribution. Specifically, we pick two bins, de-noted as x1x2..xd and x′1x

′2..x′d. Then the new subscription’s

lower and upper bounds in the i-th dimension are defined bymin(xi, x′i) and max(xi, x′i), respectively.

Figure 5 shows the results when subscriptions follow a Zipf-like distribution and events are still uniformly distributed; andFigure 6 shows the results when subscriptions and events bothfollow a Zipf-like distribution with the same value of α but dif-ferent popularity ranking of the bins. 2

We make the following observations. First, in both cases theevent traffic decreases with increasing α for all partition algo-rithms. This occurs because as α increases, subscriptions aremore concentrated around certain area, and it becomes moredifficult for events with a different popularity distribution tofind matches, thereby reducing the hit ratio. Second, when αincreases, the performance of K-mean improves. This is be-cause rectangles’ lower and upper bounds are drawn from thesame popularity distribution, and an increase in α makes therectangle boundaries closer together, resulting in a small vol-ume. This increases the effectiveness of K-mean, which usesthe centriods’ positions for subscription routing. Finally, therandom algorithm performs much worse than the others. Forexample, it yields 3 - 5 times as much traffic as the offline R-tree. The clustering algorithms, such as offline R-tree, achievea higher traffic reduction rate when subscriptions follow a Zipf-like distribution because when the rectangles are clustered to-gether and have smaller volume, the effectiveness of clusteringis improved.

00

00..11

00..22

00..33

00..44

00..55

00..66

00..77

00 11 22 33 44

aallpphhaa iinn tthhee ZZiippff--ddiissttrriibbuuttiioonn

HHiitt

rraatt

iioo pp

eerr ss

eerrvvee

rr

RRaannddoomm OOfffflliinnee RR--ttrreeeeOOnnlliinnee RR--ttrreeee OOfffflliinnee//OOnnlliinnee RR--ttrreeeeOOfffflliinnee KK--mmeeaann OOnnlliinnee KK--mmeeaannOOfffflliinnee//OOnnlliinnee KK--mmeeaann

Fig. 5. The effect of Zipf-like subscription distributions, when subscriptionsfollow a Zipf-like distribution and events are uniformly distributed (Ns = 10).

Other Results: As mentioned earlier, subscription routingtakes load balancing into account by routing subscriptions only

2Since in practice the popularity rankings of events and subscriptions may notmatch, we generate the events using the same steps as we generate subscriptionsexcept that we assign the bins with a random popularity ranking instead of usingtheir numerical order.

00

00..11

00..22

00..33

00..44

00..55

00..66

00..77

00..88

00..99

11

00 00..55 11 11..55 22 22..55 33 33..55 44

aallpphhaa iinn tthhee ZZiippff--lliikkee ddiissttrriibbuuttiioonn

HHiitt

rraatt

iioo pp

eerr ss

eerrvvee

rr

RRaannddoomm OOfffflliinnee RR--ttrreeeeOOnnlliinnee RR--ttrreeee OOfffflliinnee//OOnnlliinnee RR--ttrreeeeOOfffflliinnee KK--mmeeaann OOnnlliinnee KK--mmeeaannOOfffflliinnee//OOnnlliinnee KK--mmeeaann

Fig. 6. The effect of Zipf-like subscription distributions when both subscrip-tions and events follow a Zipf-like distribution with the same value of α, butdifferent popularity ranking of the bins (Ns = 10).

to those servers whose load is below a certain threshold. Thethreshold can be used to control the degree of load balancing, sowe call it the load balancing factor, which is defined as the ra-tio of the maximum allowable server load to the average serverload. We see a clear trade-off between load balancing and thehit ratio: as we relax the load balancing constraint by increas-ing the load balancing factor, the hit ratio drops monotonicallybecause we are more likely to be able to route a subscriptionto the server closest to it in the event space rather than route itto an alternative server in order to avoid overloading the closestserver.

We also study the effectiveness in traffic reduction as a func-tion of the percentage of subscriptions used to build an R-treeoffline. In general, the hit ratio decreases as we increase thepercentage because more subscriptions are clustered better. Forexample, relative to the online R-tree data point, the hit ratiois reduced by 7%, 20% and 30% when 30%, 50%, and 100%subscriptions are used offline, respectively.

Finally, we vary the number of dimensions from 2 to 10,and observe consistent results with those from the 4-dimensioncase. The offline R-tree algorithm continues to reduce traffic by30% - 50% compared to random partitioning, and online/offlineR-tree performs close to offline R-tree, within 10% differencein hit ratio.

Summary: To summarize, our simulation results suggestthat R-trees are most effective in reducing event traffic. So wewill focus on the R-tree-based routing for the rest of the paper.

D. Extension to Widely Distributed Servers

So far, we have considered the case where servers are phys-ically co-located. We can generalize our clustering scheme tohandle widely distributed servers as follow.

First, when the paths from the dispatcher to different servershave different network distance, we can change the cost func-tion used in the offline R-tree algorithm to be a weighted sumof traffic and network distance. That is, we choose the cutthat minimizes the sum of the volumes of the bounding rect-angles of the two partitions on the two sides of the cut weightedby the network distance. Similarly, in the online R-tree algo-rithm, a new subscription is forwarded to the server that incursleast additional traffic weighted by the network distance, wherethe additional traffic incurred at a server is approximated bymini |NewSubscription \ BoundRectanglei |, and i’s are in-dices for the bounding rectangles at the server.


Second, for performance reasons it is sometimes desirable tohave servers handle subscriptions for nearby clients. In suchcases, we can estimate similarity between different subscrip-tions not only based on subscription filters but also based onsubscribers’ locations. This can be achieved in our system byconsidering the subscriber’s location (e.g., its latency towards aset of landmarks [22]) as additional dimensions for clustering.Subscription servers can then be seeded with an affinity towardsspecific coordinates in a given dimension. Clustering based onboth subscription filters and subscribers’ location is likely toproduce clusters with good locality, since subscriptions oftenexhibit spatial locality as shown in [1].

IV. THROUGHPUT OPTIMIZATION

In addition to event traffic reduction, overall system through-put is another important performance metric to optimize.Higher system throughput would mean that the event distribu-tion network can process incoming events at a higher rate, im-proving the scalability on the publisher side. Although eventrouting based on precise summaries minimizes event traffic, itdoes not necessarily optimize system throughput. Since thesize and complexity of precise summaries are determined bythe characteristics of the subscription filters and are not oth-erwise controllable, it is possible that the dispatcher may be-come the bottleneck and the system throughput is negativelyimpacted. The likelihood increases when the dispatcher servesa large number of servers, and precise summaries match only asmall percentage of overall event traffic.

We propose the use of imprecise summaries as a mecha-nism for balancing the load between the dispatcher, the net-work links, and servers to optimize system throughput. Theintuition is that, if event routing with precise summaries causesthe dispatcher’s CPU to become the bottleneck of the overallsystem, switching to less precise summaries would speed upevent routing and allow the dispatcher to accept more eventsper second, enhancing system throughput. This is, however,only true when the dispatcher remains the bottleneck. Obvi-ously, if either the dispatcher’s input link or the publishers be-come the bottleneck, there is no incentive for the dispatcher tofurther reduce summary precision to speed up event routing. Ifeither the dispatcher’s outgoing link or the servers become thebottleneck, further reducing summary precision would gener-ate more false positive traffic and cause more congestion at thebottleneck, degrading overall system throughput.

In this section, we first study system throughput using anal-ysis and simulation, and then describe a self-tuning algorithmto dynamically adjust summary precision for throughput opti-mization. Note that while our simulation and implementation(in the next section) use rectangle subscriptions as an exampleto demonstrate the benefit of a tunable summary precision forrouting, the idea is more general, and can be potentially appliedto other forms of subscriptions.

A. Throughput Analysis and Simulation Study

We analyze the impact of imprecise summaries on systemthroughput in Appendix . In this section, we summarize the keyresults and equations that will be used in the simulation study.

Referring to Figure 1, let TPD, TPOL, and TPS denote theindividual throughput components of the dispatcher CPU, theoutgoing link, and the servers, respectively. These componentsand the overall system throughput TP can be calculated as:

TPD = 1/(Td ·Ns)

TPOL = 1/(R · Tl ·Ns)TPS = 1/(R · Ts · F )

TP = min(TPD, TPOL, TPS)

where Td is dispatcher’s average per-server per-event routingtime, Ns is the total number of servers receiving event mes-sages from the dispatcher, R is the average hit ratio per server,Tl is the average per-message transmission time by dispatcher’soutgoing link, Ts is the average per-event processing time on asingle-node filtering engine, and F is the server’s load expan-sion factor as defined in the Appendix.

Let TP ′ = min(TPOL, TPS); that is, TP ′ is the through-put bottleneck downstream from the dispatcher CPU. If TP ′ <TPD for all summary precisions, then precise summary offersthe optimal throughput. If TP ′ > TPD for all summary pre-cisions, then the no-summary operating point is the optimalone. In the remaining case, the TP ′ and TPD curves intersectand the summary precision corresponding to the intersectionprovides the optimal throughput. If TP ′ = TPOL, the opti-mal Relative ThroughPut (RTP) equations with respect to theprecise-summary and no-summary operating points are

RTP op =Tdp

Ro · Tl, and (1)

RTP on =1/(Ro · Tl ·Ns)

1/(Tl ·Ns)=

1Ro

, (2)

respectively, where Ro is the average hit ratio per server at theoptimal operating point.

We use simulations to study the effect of the number of sub-scriptions, the number of dimensions, and subscription parti-tioning algorithms on the optimal relative throughput achiev-able by imprecise summaries. The impact of actual systems andnetwork parameters on the enhancement of absolute through-put will be discussed in the implementation section. Herewe assume that all communication links have a bandwidth of100Mbps or roughly 10MBps, and that the average size of eachevent message is 1KB. This translates into Tl = 1K

10M = 100 mi-croseconds per message. We present simulation results only forthe case of TP ′ = TPOL; results for the case of TP ′ = TPSare similar. The main performance metrics are therefore thetwo optimal relative throughput RTP op and RTP on defined inEq. 1 and Eq. 2, respectively. In our experiments, all eventsand subscriptions have 8 dimensions, and the values on eachdimension are uniformly distributed between 0 and 10, unlessotherwise specified.

Varying the number of rectangles: Figure 7 shows the rel-ative throughput as a function of summary precision, which isrepresented as the ratio between the number of bounding rect-angles and the total number of subscription rectangles. Eachcurve corresponds to a different number of subscription rectan-gles and is normalized by its own precise-summary throughput,


which is at the rightmost of the curve. As it shows, initiallywhen the system’s bottleneck is at the outgoing link, increasingthe summary precision reduces the amount of out-going traf-fic and alleviates the bottleneck, thereby improving the overallthroughput. On the other hand, an increase in the summary pre-cision also puts a higher load on the CPU. At some point whenthe CPU becomes a bottleneck, a further increase in summaryprecision aggravates the bottleneck, and decreases the overallthroughput. The peak of each curve marks the intersection ofthe TPOL and TPD curves (i.e., when the throughput of dis-patcher CPU equals to that of outgoing link), and thus the opti-mal operating point, and its Y-axis value gives the RTP op mea-sure. The RTP on measure can be calculated based on the hitratio Ro shown in the legend. (Rp is the average hit ratio perserver when precise summaries are used.)

0.5

1

1.5

2

2.5

3

0% 20% 40% 60% 80% 100%

Summary precision

Rel

ativ

e th

rou

gh

pu

t

100,000 rectangles (Rp=0.75;Ro=0.97) 50,000 rectangles (Rp=0.67;Ro=0.89)20,000 rectangles (Rp=0.54;Ro=0.82) 10,000 rectangles (Rp=0.42;Ro=0.73)

Fig. 7. System throughput for different number of rectangles.

As the number of rectangles increases, both the number ofmaximal rectangles and the hit ratio R increase, which to-gether cause the average precise-summary routing time Tdp toincrease. As a result, RTP op increases from 1.55 for 10,000rectangles to 3.01 for 100,000 rectangles, a 200% increase insystem throughput over precise summaries. This demonstratesthat, compared to precise summaries, imprecise summaries areespecially useful when the number of subscriptions is large.

On the other hand, the optimal relative throughput with re-spect to the no-summary operating point is RTP on = 1

Roand

the value drops from 1.37 to 1.03 as the number of rectanglesincreases. Fortunately, Figure 8 shows that this negligible 3%throughput gain (for 100,000 rectangles) is not inherent as wescale up to a large number of subscription rectangles. Whenwe limit the volume of each rectangle and thus reduce the hitratio, the use of imprecise summaries can still increase systemthroughput by 67% ( 1

0.60 − 1) compared to no summaries, asshown by the lower curve in Figure 8. The shape of the curvein fact resembles the lower curves in Figure 7.

Varying the number of dimensions: Figure 9 studies the ef-fect of the number of dimensions on relative throughput whilekeeping a constant number of 10,000 subscription rectangles.As the number of dimensions increases, the hit ratio Rp de-creases because the volume of the event space increases expo-nentially, while the routing time Tdp increases due to a higheroverhead incurred in checking if a point belongs to a rectan-gle. As a result, the dispatcher’s CPU is more likely to be thebottleneck, and RTP op tends to increase as the number of di-mensions increases, as shown by the two upper curves. In con-

0.5

1

1.5

2

2.5

3

0% 20% 40% 60% 80% 100%

Summary precision

Rel

ativ

e th

rou

gh

pu

t

Rp=0.75;Ro=0.92 Rp=0.37;Ro=0.60

Fig. 8. Effect of hit ratios on optimal throughput gain (100,000 rectangles).

trast, the two lower curves illustrate the cases where imprecisesummaries are not useful: the low route time and high hit ra-tio result in TPOL < TPD for all summary precisions, and soprecise summaries provide the optimal throughput.

0.5

1

1.5

2

0% 20% 40% 60% 80% 100%

Summary precision

Rel

ativ

e th

rou

gh

pu

t

10 dims (Rp=0.11;Ro=0.76) 8 dims (Rp=0.42;Ro=0.73)6 dims (Rp=0.81) 4 dims (Rp=0.98)

Fig. 9. System throughput for different numbers of dimensions.

Subscription locality and partitioning: The results pre-sented so far have assumes that randomly generated subscrip-tion rectangles are randomly partitioned among the servers. Wenext evaluate the effect of subscription locality and partition-ing on system throughput. We simulate locality by generatingthe same random subscriptions within a subspace of the eventspace. The results (not shown here) are very close to the follow-ing analytical results. Suppose the ratio of the subspace volumeto the entire event space volume is x. Clearly, the hit ratio Rwill drop by that ratio. The routing time Td also drops by ap-proximately the same ratio because events within the subspaceare routed using the same R-tree (with all rectangles proportion-ally smaller) and those outside the subspace can be filtered outvery quickly by the top-level bounding rectangle. The absolutethroughput thus increases by a ratio of 1

x , but the optimal op-erating point stays the same and so RTP op remains unchanged.The relative throughput RTP on increases by a ratio of 1

x .A very similar behavior is observed when random subscrip-

tions are partitioned offline using a bulk-loading R-tree algo-rithm. Figure 10 shows the relative throughput for 100,000 rect-angles on 10 servers with and without offline partitioning. Thetop two curves (with square points) are normalized against thesame precise-summary throughput value to illustrate the bene-fit of offline partitioning in terms of increased system through-put. In this case, offline partitioning reduces the routing timeTd by approximately 50% and hence doubles the throughput.The bottom curve is constructed by normalizing the top curve


against its own precise-summary throughput. Since offline par-titioning also reduces the hit ratioR by approximately 50%, thiscurve matches the without-partitioning curve very well, whichindicates that imprecise summaries remain effective in furtherenhancing the already increased system throughput.

0.5

1

1.5

2

2.5

3

0% 20% 40% 60% 80% 100%

Summary precision

Rel

ativ

e th

rou

gh

pu

t

With partitioning (w.r.t. without partitioning)Without partitioningWith partitioning

Fig. 10. Effect of subscription partitioning on system throughput.

B. Self-tuning

In addition to demonstrating the benefits of imprecise sum-maries for throughput optimization, the analysis and simulationresults presented so far can also be used to perform offline se-lection of optimal summary precision, if a sufficient numberof representative subscriptions are available. However, if thoseare not available or if the optimal operating point shifts due tochanges in the event or subscription distributions or in networkconditions, the system must be able to automatically determinethe (new) optimal operating point and self-adapt using the op-timal summary precision for routing. We describe such a self-tuning algorithm in this subsection.

The main idea behind the algorithm is based on the followingobservation: if the dispatcher’s CPU is the bottleneck, we canmove towards the optimal operating point, OPT , by reducingsummary precision; otherwise, either the dispatcher’s outgoinglink or the servers are the bottleneck and we can move towardsOPT by increasing summary precision. Instead of measuringthe various parameters used in the throughput equations, ourself-tuning algorithm simply relies on the dispatcher to monitorits own CPU load and declare itself the bottleneck if and only ifit is fully loaded. In practice, the dispatcher periodically sam-ples its CPU usage and compares the average CPU load duringthe last time interval (e.g., 1 minute) with a threshold (e.g., 96%usage). A threshold lower than 100% is necessary to accountfor measurement errors and small fluctuations in CPU load.

The following pseudo-code describes our self-tuning algo-rithm. When either the (variable) timeout t occurs or a signif-icant change in system throughput is detected, the dispatcherincrements the self-tuning step index p, changes the summaryprecision by x in the direction discussed previously, and in-vokes the update t() function. Depending on the supportedgranularity of x (which may be determined based on imple-mentation efficiency and complexity 3), the limited number of

3To improve adaptivity, we can use a larger step size when we are fartheraway from the optimal point, and use a smaller step size when we get closer tothe optimal.

operating points may force the algorithm to oscillate around theoptimal point. The update t() function tries to capture the factthat the best achievable operating point has been reached by de-tecting the number of repeated oscillations exceeding a thresh-old (e.g, 3 in the pseudo-code). When that happens, the timeoutvalue t is increased to n ∗ t, n ≥ 1, (subject to an upper boundof maxT ) to dampen the oscillation. The factor n provides atrade-off between stability and adaptivity: a smaller n wouldallow the system to detect changes in optimal operating pointfaster, achieving greater adaptivity; this would however causethe oscillation to happen at a higher frequency, and thus lessstability.

Self-tuning algorithmwait for timeout t or throughputChangeTriggerp++;if (dispatcher CPU-bound)

decrease summary precision by xelse

increase summary precision by xendupdate t();

function update t()if (p < 3)t = 1;

else if (precision[p] = precision[p− 2] andprecision[p− 1] = precision[p− 3])if (precision[p] > precision[p− 1])numOscillations++;if (numOscillations >= 3 and n ∗ t ≤ maxT )t = n ∗ t;numOscillations = 0;

endend

elsenumOscillations = 0;t = 1;

end

V. AN XML-BASED IMPLEMENTATION

We next describe an implementation of summary-based sub-scription and event routing in an XML-based Web Servicesframework. The framework implements a set of SOAP-basedprotocols that are being proposed to W3C as standards [30].

A. Design & Implementation

To allow experimentations with different algorithms, wefirst modified the framework to support extensibility at vari-ous points of the system, and then plugged in our routing andmatching modules through the extensibility mechanism.

Figure 11 illustrates the high-level software architecture ofthe extensible framework as well as EDN-specific components(in shaded boxes). The Messaging Layer provides the infras-tructure for sending and receiving XML messages between WebServices endpoints. The Namespace Binding Layer maintainsa hierarchical namespace (for event topics or routing table en-tries) and associates each name entry with a matcher class thatwould be instantiated to store the filters contained in messagessent to that name. When a new route or topic entry is createdin the namespace, the creation message has the option of spec-ifying a URI that identifies the matching engine to use for han-dling any filter operations associated with the topic/route. The


default matcher class in the framework is the standard XPathfilter matcher, which is also used by the Messaging Layer formessage dispatching.

The Base Route Manager registers a handler with the Mes-saging Layer to receive all incoming messages, and uses anamespace layer instance and the matcher instances associatedwith its name entries to make routing decisions. It also regis-ters another handler to receive route administration messagesfor creating, deleting, and enumerating route information. TheBase Subscription Manager registers a handler to receive allmessages related to topic management, subscriptions, and eventpublications. In addition to supporting pluggable matcher andnamespace implementations, the extensibility also allows appli-cations to extend the base classes in order to add custom XMLelements to base XML administrative messages and to overrideor include additional logic in route or topic management.

EDN R-tree

Matcher

XPath Filter

Matcher

Messaging Layer

Namespace Binding Layer

Base Route Manager

Base Subscription Manager

EDN Route Manager

EDN Subscription Manager

Fig. 11. Software architecture of extensible Web Services framework.

In an R-tree-based EDN system, the dispatcher runs an ex-tended Route Manager with namespace entries associated withthe RTreeRouteSet class (instead of the default XPath filtermatcher class); an RTreeRouteSet instance holds the set ofrouting R-trees, one from each server, matches an incomingevent against all the trees, and returns a list of servers thathave a match. Each EDN server runs an extended Route Man-ager for random, R-tree, or K-Mean-based subscription rout-ing; it also runs an extended Subscription Manager with names-pace entries associated with the RTreeMatchingEngine class.An RTreeMatchingEngine instance uses an R-tree matcher asthe single-node filtering engine and another R-tree matcheras the summary manager that maintains the maximal R-tree.When the server’s routing R-tree is changed due to the inser-tion of new subscriptions, the updated R-tree is sent to the dis-patcher’s Route Manager as a route update message. Alterna-tively, an RTreeMatchingEngine instance can use the XPath fil-ter matcher as the filtering engine and extract the “most distin-guishing” range predicates from each new subscription for usein the R-tree summary manager; that is, it allows the use ofdifferent formats for filtering and routing.

We have set up a network of 10 servers for running exper-iments. All of them are 1.7GHz Pentium 4 PCs with 1GBof memory. The results on subscription partitioning with of-fline R-trees matched the simulation results. The hit ratios fromthe online R-tree and offline/online R-tree experiments differedslightly from the simulation results due to the nondeterminis-tic relative timing between subscription arrivals and summaryupdates. We will focus on the system throughput issues in theremainder of this section.

B. When is Imprecise Summary Useful?

A key question to ask is: “in practice, when is imprecisesummary useful, and how to choose the precision to optimizesystem throughput?” We answer the question in three steps.First, we extend the dispatcher CPU-bound throughput equa-tion to accommodate various system parameters from actualimplementations. Next, we show that imprecise summaries in-deed provide throughput gains for our Web Services-based im-plementation, and the actual throughput measurements closelymatch the calculated curves from the equations. We then usethe calculated curves to explore the “what if” scenarios as wevary system parameters such as messaging overhead, networkbandwidth, and CPU speed.

The throughput equation TPD = 1Td·Ns assumes that mes-

saging overhead is negligible with respect to routing overhead.This may not be true in XML-based implementations, which tryto provide self-describing messages and interoperability at thecost of increased message size and message processing over-head. We extend the equation as follows:

TPD =1

Tf + (Td ·Ns) + Tc · (R ·Ns)(3)

where Tf is the per-event fixed overhead for the receivingand sending operations, and Tc is the per-copy sending over-head. The network-bound throughput equation is re-expressedin terms of the network bandwidth B and average message sizeSm as follows.

TPOL =B

R · Sm ·Ns(4)

In our base-line implementation, we have Tf = 1.6ms, Tc = 0(with an efficient multicast repeater), and Sm = 3KB. Inour experimental setup, we have Ns = 10, CPU speed of1.7GHz, and achievable network bandwidth B = 80Mbps ona 100Mbps link. The route time Td and hit ratio R are obtainedfrom offline simulations. (Note that Td and R can only be ob-tained when a set of existing subscriptions are available. In theabsence of such information, we will rely on the self-tuning al-gorithm to automatically determine the optimal summary preci-sion, as will be demonstrated in the next subsection. Also notethat we will focus on the above two throughput equations inthe remainder of this section; server CPU-bound scenarios thatinvolve TPS have similar behaviors.)

The “TP-Measured” curve in Figure 12 illustrates the ac-tual throughput gain measured from the system. The optimalsummary precision is around 50%, and the optimal throughputis 438 messages/sec, which is 31% and 27% higher than theprecise-summary and no-summary throughput, respectively. Itis clear from the figure that the “TP-Measured” curve closelymatches the calculated curve, which is the minimum of the“TP OL” curve (i.e., Eq. (4) with B = 80Mbps) and the“TP D-1.6ms” curve (i.e., Eq. (3) with Tf = 1.6ms).

The other two curves in Figure 12 study the throughput be-havior as the per-event fixed overhead Tf changes and the“TP OL” curve remains the same. The “TP D-0.2ms” curverepresents, for example, the case when a binary wire formatbecomes a standard and significantly reduces the messagingoverhead. This would push the CPU-bound throughput curve


upwards and move the optimal operating point towards pre-cise summary. In contrast, as Tf increases due to, for exam-ple, message authentication and integrity checking overhead,the optimal operating point would move towards no summary.In the extreme case where Tf is so high that the CPU-boundcurve no longer intersects the network-bound curve, impre-cise summaries would no longer be useful and the dispatchershould multicast every event to all servers without performingany summary-based routing.

225500

330000

335500

440000

445500

550000

555500

00%% 2200%% 4400%% 6600%% 8800%% 110000%%

SSuummmmaarryy pprreecciissiioonn

TThh

rroouu

gghh

ppuu

tt ((mm

ssggss//

sseecc))

TTPP__OOLL TTPP__DD--00..22 mmss TTPP__DD--11..66 mmss TTPP__DD--22..66 mmss TTPP--MMeeaassuurreedd

Fig. 12. System throughput as per-event overhead at the dispatcher changes.

Figure 13 examines the effect of available network band-width on system throughput, while keeping the CPU-boundcurve unchanged (which is the same as the “TP D-1.6ms” curvein Figure 12). The figure shows that, as network bandwidthdecreases from 100Mbps, 80Mbps, 40Mbps, to 10Mbps, theoptimal operating point moves towards the right. The “TP OL-10Mbps” curve no longer intersects the “TP D” curve, makingthe precise summary the optimal operating point. Figure 13demonstrates the importance of self-tuning: even when the dis-tributions of events and subscriptions remain stable, changesin network conditions may affect the available bandwidth andcause the optimal operating point to drift. The system must beable to adapt to such changes to maintain optimal throughput.

00

110000

220000

330000

440000

550000

660000

770000

00%% 2200%% 4400%% 6600%% 8800%% 110000%%


TThh

rroouu

gghh

ppuu

tt ((mm

ssggss//

sseecc))

TTPP__DD TTPP__OOLL--1100 MMbbppss TTPP__OOLL--4400 MMbbppssTTPP__OOLL--8800 MMbbppss TTPP__OOLL--110000 MMbbppss

Fig. 13. System throughput as available network bandwidth changes.

We next study the throughput behavior as the (single-processor) dispatcher CPU speed changes. The lowest curvein Figure 14 shows that, with the XML messaging layer that weare currently using and with 80Mbps available network band-width, a 800MHz dispatcher would have been unable to catchup with the network and shifted the optimal operating point tothe far left and render summary-based routing not useful. Incontrast, a 3GHz dispatcher will move the CPU-bound through-put curve upwards. This allows the system to scale up to handleevent schemas with more attributes and hence higher routing

time Td, which would bring the CPU-bound throughput curveback down and be able to benefit from the enhancement pro-vided by imprecise summaries.

110000

220000

330000

440000

550000

660000

770000

00%% 2200%% 4400%% 6600%% 8800%% 110000%%


TThh

rroouu

gghh

ppuu

tt ((mm

ssggss//

sseecc))

TTPP__OOLL TTPP__DD--880000MMHHzz TTPP__DD--11..77GGHHzz TTPP__DD--33GGHHzz

Fig. 14. System throughput as dispatcher CPU speed changes.

Finally, the two lower curves in Figure 15 illustrate the effectof non-trivial per-copy sending overhead Tc on system through-put. This represents the case where either the messaging layerdoes not optimize for multicast sending, or the application re-quires different processing for events forwarded to differentservers (such as encrypting each message with a different key).In addition to increasing per-event processing time and hencelowering the throughput curve, a higher Tc has a secondary ef-fect that decreases the slope of the curve: if a significant portionof the time that we save from routing with a lower precision is“wasted” on the Tc for forwarding the additional false-positivetraffic, the benefits of imprecise summaries may be greatly re-duced.

225500

330000

335500

440000

445500

550000

555500

00%% 2200%% 4400%% 6600%% 8800%% 110000%%


TThh

rroouu

gghh

ppuu

tt ((mm

ssggss//

sseecc))

TTPP__OOLL TTPP__DD--00 mmss TTPP__DD--00..0055 mmss TTPP__DD--00..1155 mmss

Fig. 15. System throughput as per-copy overhead in multicast changes.

C. Self-tuning Experimental Results

We illustrate the self-tuning behavior of our system with twosets of experiments, both of which use the setting correspondingto the “TP-Measured” curve in Figure 12 as the basis. We setthe granularity x of summary precision change to 10%, and use96% as the threshold for determining whether the dispatcheris CPU-bound. For simplicity, we chose n = 1 in the self-tuning algorithm, which corresponds to not dampening oscilla-tions. (In the final version, we will include the results using thedampening optimization described in Section IV-B.)

In the first set of experiments, we started the system with10% and 100% precisions, and expected the self-tuning algo-rithm to bring the system to the optimal operating point of


around 50% precision. Figure 16 illustrates the throughput andsummary precision behavior for the two experiments. Whenstarted with a 10% precision, the system was quickly brought tothe 50% point and stayed around the 40% and 50% precisions.Occasionally, the CPU usage at the 50% point dropped belowthe threshold and prompted the algorithm to try the 60% pointto see if the optimal operating point had shifted (e.g., Steps 14and 15). Since we did not vary any system or network param-eters in this experiment, the algorithm correctly switched backto the near-optimal precision points. The second experimentstarting with a 100% precision, shown in Figure 16(b), took abit longer to converge. Between Steps 6 and 7, a transient con-dition falsely suggested that the optimal operating point mightbe between 60% and 50%. Once that transient condition disap-peared, the algorithm correctly stabilized around the 40% and50% precisions.

330000

335500

440000

445500

550000

555500

00 55 1100 1155 2200 2255

SSeellff--ttuunniinngg sstteeppss

TThh

rroouu

gghh

ppuu

tt ((mm

ssggss//

sseecc))

00

1100

2200

3300

4400

5500

6600SS

uumm

mmaarr

yy pp

rreeccii

ssiioo

nn ((

%%))

TThhrroouugghhppuutt SSuummmmaarryy pprreecciissiioonn

(a) Starting from 10% precision

330000

335500

440000

445500

550000

555500

00 55 1100 1155 2200


TThh

rroouu

gghh

ppuu

tt ((mm

ssggss//

sseecc))

3300

4400

5500

6600

7700

8800

9900

110000

SSuu

mmmm

aarryy

pprree

cciissii

oonn

((%%

))


(b) Starting from 100% precision

Fig. 16. Self-tuning system throughput and summary precision.

In the second set of experiments, we changed the event dis-tribution from uniform to a normal distribution with a meanof 9.5 and a standard deviation of 0.25, and expected the self-tuning algorithm to bring the system to a new optimal operat-ing point. As shown in Figure 17, the system was started witha 10% precision and brought to the near-optimal precisions of40% and 50%; then, the event distribution was changed aroundSteps 13 and 14. The disturbance triggered the algorithm tosearch for a new optimal point, first incorrectly towards higherprecisions and then in the reversed direction. Eventually, thealgorithm stabilized around the 10% and 20% summary preci-sions, achieving a higher throughput than that at the previousoptimal operating point.

330000

335500

440000

445500

550000

555500

00 55 1100 1155 2200 2255


TThh

rroouu

gghh

ppuu

tt ((mm

ssggss//

sseecc))

00

1100

2200

3300

4400

5500

6600

7700

SSuu

mmmm

aarryy

pprree

cciissii

oonn

((%%

))


Fig. 17. Self-tuning system throughput and summary precision in response tochanging event distribution.

VI. RELATED WORK

Related work on pub/sub systems can be classified into fourcategories: (1) commercial products such as TIBCO’s TIB Ren-dezvous, Talarian’s SmartSockets, etc.; (2) interface standardsin CORBA and Java; (3) single-node subscription filtering algo-rithms such as [14], [2], [11], which are orthogonal to our work;(4) prototype event notification services such as Elvin [25],Gryphon [4], Hierarchical Proxy Architecture [31], Ready [15],SCRIBE [24], and SIENA [9].

The EDN summary-based routing was inspired by thequench expressions in Elvin, the hybrid matching schemes inReady, the vector annotation in Gryphon, the filters poset inSIENA, and the subscription merging in Hierarchical Proxy Ar-chitecture. Similar ideas have also appeared outside the pub/subliterature [26]. The more recent SIENA fast forwarding pa-per [8] described a fast content-based forwarder, which per-forms particularly well when the total number of distinct at-tribute names is large and the number of attributes per messageis relatively small.

Almost all of these work, however, couple the subscriptionpaths tightly with the notification paths (e.g., use reverse pathforwarding to set up forwarding table). In comparison, we focuson event notification systems in which the subscription pathsare completely decoupled from the notification paths. The lat-ter architecture is commonly used in many commercial pub/subsystems. Under this new architecture model, we generalize theexisting proposals by supporting imprecise summaries to opti-mize an additional performance metric – system throughput.

C. Y. Chan et al. [10] proposes a new index structure for reg-ular expressions, called RE-tree. RE-trees is similar in spiritto R-trees, but handle regular expressions rather than multi-dimensional rectangles. The notion of imprecise summary isalso useful to RE-trees, when they are used in a distributed fash-ion.

There have been several recent papers on using similarity-based clustering in the pub/sub setting. All of them are con-cerned with reducing the total number of required IP multicastaddresses for notification delivery, while the EDN subscriptionpartitioning algorithms aim at enabling compact summaries forevent traffic reduction, before notifications are generated. TheGroup Approximation Algorithm described in [20] tries to com-bine actual multicast groups into approximate groups, while in-troducing the least amount of false-positive traffic. For multi-party applications, [29] proposed using the k-means method togroup subscribers with similar sets of publishers that they are


interested in, so as to minimize overall wasted event traffic. Thegrid-based clustering framework in [23] partitioned the eventspace into cells, and associated a feature vector with each cellto indicate the set of subscribers interested in events falling intothat cell. The cells are then clustered to minimize the expectedwaste of event traffic. The paper briefly mentioned the use ofR-trees to map an event to a multicast group.

Finally, in [28] we have examined how to partition equalitypredicates to reduce event traffic.

VII. CONCLUSIONS & FUTURE WORK

In this paper, we propose a summary-based routing frame-work for content-based event distribution.

To make summary compact and effective, we cluster sub-scriptions based on their similarity. Our evaluation shows thatfor range predicates, the R-tree algorithms are most effective:compared to random partition, the R-tree algorithms cut downevent traffic by 20% to 60% when subscriptions follow a uni-form distribution, and by up to 5 times when subscriptions fol-low a Zipf-like distribution.

To optimize system throughput, we propose using imprecisesummary, which generalizes previous work on precise sum-maries and provides an important trade-off between routingoverhead and false-positive traffic. Simulation results showedthat summary-based routing could increase throughput by up to200% and 67% compared to the uses of precise summaries andno summaries, respectively.

To assess the practical benefits of summary-based routingin the context of Web-based eventing, we have implementedthe proposed techniques in an XML/SOAP-based frameworkthat conforms to the proposed Web Services standards. Perfor-mance measurements from the actual implementation largelyvalidated our throughput analysis, which was extended to ac-commodate various implementation parameters.

To help system engineers determine whether imprecise sum-maries are useful for their particular implementations, we givea detailed discussion on the behavior of system throughput asmessaging overhead, available network bandwidth, and CPUspeed varies. Most importantly, we have designed a self-tuningalgorithm that can dynamically maintain optimal throughput byautomatically adapting to changing operation conditions. Ex-perimental results from the implementation confirm the self-tuning capability in practice. Although we have focused onrange predicates in this paper, the general concept of self-tuning, imprecise summary-based subscription and event rout-ing should be applicable to other types of publish/subscribe sys-tems as well.

A number of avenues for future work remain. First, we arecurrently investigating another approach to subscription parti-tioning. To minimize the average number of servers that anevent needs to reach, we can formulate the problem as a graphpartitioning problem: each vertex represents a rectangle andthe edge cost between two vertices reflects the penalty if thetwo rectangles reside on different servers. Given the numberof servers Ns, the goal is to cut the graph into Ns clustersso that the sum of edge costs in the cut is minimized whileavoiding overloading any server. This is a well-studied prob-lem known as Capacitated Graph Partitioning [12] or Min-Cut

Clustering [19]. There are good algorithms to achieve this, suchas the multilevel Kernighan-Lin algorithm [17].

Second, we have focused on the distribution topology thathas only one level of branching. It is interesting to extendsummary-based routing for general distribution topologies.

Finally, so far we considered providing summaries for sub-scriptions after all the dimensions have been defined and sub-scriptions have been constructed in that space. An interestingdirection for future work would be to consider how our sum-mary mechanism can guide the definition of the space by offer-ing system throughput as an additional criterion for evaluatingdesign alternatives of the dimensions.

REFERENCES

[1] A. Adya, P. Bahl, and L. Qiu. Characterizing alert and browse servicesfor mobile clients. In Proc. of USENIX Annual Conference, Jun. 2002.

[2] M. K. Aguilera, R. E. Strom, D. C. Sturman, M. Astley, and T. D. Chan-dra. Matching events in a content-based subscription system. In Sympo-sium on Principles of Distributed Computing, pages 53–61, 1999.

[3] M. F. Arlitt and C. L. Williamson. Internet Web Servers: Workload Char-acterization and Performance Implications. In IEEE/ACM Transactionson Networking, pages 631–645, 1997.

[4] G. Banavar, T. D. Chandra, B. Mukherjee, J. Nagarajarao, R. E. Strom,and D. C. Sturman. An efficient multicast protocol for content-basedpublish-subscribe systems. In International Conference on DistributedComputing Systems, pages 262–272, 1999.

[5] R. Bayer and E. McCreight. Organization and maintenance of large or-dered indices. In Proc. 1970 ACM-SIGFIDET Workshop on Data De-scription and Access, Nov. 1970.

[6] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors.Communications of the ACM, 13(7):422–426, 1970.

[7] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Cachingand Zipf-like Distributions: Evidence and Implications. In Proceedingsof INFOCOMM ’99, March 1999.

[8] A. Carzaniga, J. Deng, and A. L. Wolf. Fast forwarding for content-based networking. In Technical Report CU-CS-922-01, Department ofComputer Science, University of Colorado, Nov. 2001.

[9] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf. Design and evaluationof a wide-area event notification service. ACM Transactions on ComputerSystems, 19(3):332–383, 2001.

[10] C. Y. Chan, M. Garofalakis, and R. Rastogi. RE-Tree: An efficient indexstructure for regular expressions. In Proc. of VLDB ’2002, 2002.

[11] F. Fabret, H. A. Jacobsen, F. Llirbat, J. Pereira, K. A. Ross, and D. Shasha.Filtering algorithms and implementation for very fast publish/subscribe.In Proc. of SIGMOD Conference, 2001.

[12] C. E. Ferreira, A. Martin, C. C. de Souza, R. Weismantel, and L. A.Wolsey. The node capacitated graph partitioning problem: A computa-tional study. In Mathematical Programming, 1996.

[13] Y.J. Garcia, M.A. Lopez, and S.T. Leutenegger. A greedy algorithm forbulk loading R-trees. In Univ. of Denver Computer Science Tech. Report#97-02, 1997.

[14] J. Gough and G. Smith. Efficient recognition of events in a distributedsystem. In Proc. of the 18 th Australasian Computer Science Conference,Feb. 1995.

[15] R. Gruber, B. Krishnamurthy, and E. Panagos. The architecture of theready event notification service. In Proc. of the 19th IEEE InternationalConference on Distributed Computing Systems Middleware Workshop,1999.

[16] A. Guttman. R-trees: A dynamic index structure for spatial searching. InProc. of ACM SIGMOD Conference, 1984.

[17] B. Hendrickson and R. Leland. A multilevel algorithm for partitioninggraphs. In Proc. of Supercomputing, 1995.

[18] A. K. Jain and Dubes. Algorithms for clustering data. 1988.[19] Weighted min cut. http://riot.ieor.berkeley.edu/riot/Applications/

WeightedMinCut/.[20] L. Opyrchal, M. Astley, J. S. Auerbach, G. Banavar, R. E. Strom,

and D. C. Sturman. Exploiting IP multicast in content-based publish-subscribe systems. In Proc. of Middleware, pages 185–207, 2000.

[21] V. N. Padmanabhan and L. Qiu. The Content and Access Dynamics ofa Busy Web Site: Findings and Implications. In Proceedings ACM SIG-COMM 2000, August 2000.

[22] S. Ratnasamy, M. Handley, R. Karp, and S. Shenker. Topologically-awareoverlay construction and server selection. In Proc. of IEEE INFOCOM,Jun. 2002.


[23] A. Riabov, Z. Liu, J. L. Wolf, P. S. Yu, and L. Zhang. Clustering al-gorithms for content-based publication-subscription systems. In Proc. ofICDCS, 2002.

[24] A. I. T. Rowstron, A. M. Kermarrec, M. Castro, and P. Druschel. SCRIBE:The design of a large-scale event notification infrastructure. In Inter-national Workshop on Networked Group Communication, pages 30–43,2001.

[25] B. Segall and D. Arnold. Elvin has left the building: A publish /subscribenotification service with quenching. In Proc. of AUUG, 1997.

[26] Alex C. Snoeren, Kenneth Conley, and David K. Gifford. Mesh-basedcontent routing using xml. In ACM Symposium on Operating SystemPrinciples, Oct. 2001.

[27] Y. M. Wang, P. Bahl, and W. Russell. The SIMBA user alert service archi-tecture for dependable alert delivery. In IEEE Int. Conf. on DependableSystems and Networks (DSN), 2001.

[28] Y. M. Wang, L. Qiu, D. Achlioptas, G. Das, P. Larson, and H. J. Wang.Subscription partitioning and routing in content-based publish/subscribenetworks (Brief Announcement). In Proc. of DISC, Oct. 2002.

[29] T. Wong, R. H. Katz, and S. McCanne. An evaluation on using preferenceclustering in large-scale multicast applications. In Proc. of INFOCOM,pages 451–460, 2000.

[30] Web Services standards. http://www.webservices.org/index.php/standards/.[31] H. Yu, D. Estrin, and R. Govindan. A hierarchical proxy architecture for

internet-scale event services. In Proc. of WETICE, 1999.

APPENDIXWe use the following notation to analyze system throughput

(referring to Figure 1):• Ns: the total number of servers receiving event messages

from the dispatcher.• Td: dispatcher’s average per-server per-event routing time.

It is calculated as the average per-event routing time of thesummary-based router, divided by Ns.

• R: the average hit ratio per server (always less than one).It is calculated as the per-event average number of serversreceiving events from the dispatcher, divided by Ns.

• Ts: average per-event processing time on a single-nodefiltering engine.

• F : server’s load expansion factor. In addition to invokingthe filtering engine to process each event, a server needsto generate and deliver notifications to all subscribers witha matching subscription. We use Ts · F to represent theoverall per-event processing time at the server. The fac-tor F is close to the minimum value of 1 if Ts dominatesthe processing time. This corresponds to the case whereeither a separate scalable notification delivery network isavailable and so the servers only need to send a copy ofthe event message and the IDs of the matching subscribersto the delivery network, or the application allows delayingnotification deliveries until the system is less loaded, forexample, during off-peak hours.

• Tdp, Rp and Tsp: the values of Td, R, and Ts, respectively,when precise summaries are used.

• Tdo,Ro, and Tso: the values of Td,R, and Ts, respectively,at the optimal operating point.

• Tdn and Tsn: the values of Td and Ts, respectively, whenno summary is used. (Note that Rn = 1.)

• Tl: the average per-message transmission time from acommunication link. Let B be the link bandwidth ex-pressed as the number of bytes per second; let Sm be theaverage message size in terms of number of bytes. Wehave Tl = Sm

B . For ease of presentation, we will assumethat all servers and the dispatcher have the same band-width, and hence the same Tl for all incoming and out-

going links. The analysis can be easily extended to ac-commodate different values of Tl for different links.

• Tp: the average time interval between two consecutiveevent messages sent to the dispatcher by the publisher(s).In the analysis below, we assume that all published eventsare sent to a single dispatcher. Multiple dispatchers can beused to further enhance throughput either by partitioningthe set of events or by partitioning the set of servers amongthe dispatchers. The analysis would be similar.

Let TPIL, TPD, TPOL, and TPS denote the individualthroughput components of the incoming link, the dispatcherCPU, the outgoing link, and the servers, respectively. The over-all system throughput TP is then determined by the minimumof these four components, which are calculated as follows.

TPIL = 1/max(Tp, Tl)

TPD = 1/(Td ·Ns)

TPOL = 1/(R · Tl ·Ns)

TPS = 1/(R · Ts · F )

TP = min(TPIL, TPD, TPOL, TPS)

We would like to support as high event rate from the pub-lisher as possible, so we can assume Tp < Tl. We are mainlyinterested in systems with Ns ≥ 10 and R ≥ 0.1, so we haveTPOL = 1

R·Tl·Ns ≤1Tl

= TPIL and

TP = min(1

Td ·Ns,

1R · Tl ·Ns

,1

R · Ts · F)

Clearly, the first term (inverse of Td) decreases monotonicallyand the second term (inverse of R) increases monotonically assummary precision increases. The third component has botha hit ratio and a time component Ts. We will show that italso increases monotonically with summary precision. Givena hit ratio R2 corresponding to server processing time Ts2,suppose the hit ratio increases by x percent by using a lowersummary precision, R1. The new Ts1 can be calculated asTs2·100+Ts′ ·x

100+x > Ts2·100100+x where Ts′ is the average per-event pro-

cessing time for those additional x percent of false-positive traf-fic. We then have R1 · Ts1 > R2·(100+x)

100 · Ts2·100100+x = R2 · Ts2.

Let TP ′ = min(TPOL, TPS), i.e., the minimum of the twomonotonically increasing terms. We distinguish three cases.First, if TP ′ < TPD for all summary precisions, then we haveTP = TP ′ and precise summary offers the optimal through-put. In this case, the dispatcher’s CPU is not the bottleneckeven with precise summaries, so imprecise summaries are notuseful. Second, if TP ′ > TPD for all summary precisions,then we have TP = TPD and the no-summary operating pointis the optimal one. (When the route time Td approaches zero,the messaging layer overhead for receiving and sending is nolonger negligible. This will be discussed in more details in theimplementation section.)

In the third case, the TP ′ and TPD curves intersect and thesummary precision corresponding to the intersection providesthe optimal throughput. If TP ′ = TPOL, the optimal operat-ing point happens at 1

Tdo·Ns = 1Ro·Tl·Ns . The optimal Relative


ThroughPut (RTP) with respect to the precise-summary operat-ing point is

RTP op =Optimal throughput

Precise summary throughput

=Tdp

Ro · Tl(5)

(where the subscript “p” stands for “precise summary” andthe superscript “o” indicates that TPOL is considered), whichwould be more significant when Tdp is large and imprecise sum-maries can reduce it to Tdo (which must be less than Tl becauseRo ≤ 1) while maintaining a low Ro.

The optimal relative throughput with respect to the no-summary operating point (i.e., R = 1) is

RTP on =Optimal throughput

No summary throughput

=1/(Ro · Tl ·Ns)

1/(Tl ·Ns)=

1Ro

(6)

Note that, since we have 1 ≥ Ro ≥ Rp, RTP on would be in-significant if Rp is already close to 1.

Similarly, if TP ′ = TPS , the intersection occurs at1

Tdo·Ns = 1Ro·Tso·F , and the optimal RTP’s are RTP sp = Tdp

Tdo

and RTP sn = TsnRo·Tso .


Summary-based Routing for Content-based Event Distribution...

Documents

Transcript of Summary-based Routing for Content-based Event Distribution...