IEEE TRANSACTION ON CLOUD COMPUTING, VOL. … › IEEE_2013-2014_Projects › basepaper › A...A...

2168-7161 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TCC.2014.2338327, IEEE Transactions on Cloud Computing

IEEE TRANSACTION ON CLOUD COMPUTING, VOL. 6, NO. 1, JANUARY 2013 1

A Scalable and Reliable Matching Service forContent-based Publish/Subscribe Systems

Xingkong Ma, Student Member, IEEE , Yijie Wang, Member, IEEE , and Xiaoqiang Pei

Abstract—Characterized by the increasing arrival rate of live content, the emergency applications pose a great challenge: how todisseminate large-scale live content to interested users in a scalable and reliable manner. The publish/subscribe (pub/sub) modelis widely used for data dissemination because of its capacity of seamlessly expanding the system to massive size. However, mostevent matching services of existing pub/sub systems either lead to low matching throughput when matching a large numberof skewed subscriptions, or interrupt dissemination when a large number of servers fail. The cloud computing provides greatopportunities for the requirements of complex computing and reliable communication. In this paper, we propose SREM, a scalableand reliable event matching service for content-based pub/sub systems in cloud computing environment. To achieve low routinglatency and reliable links among servers, we propose a distributed overlay SkipCloud to organize servers of SREM. Througha hybrid space partitioning technique HPartition, large-scale skewed subscriptions are mapped into multiple subspaces, whichensures high matching throughput and provides multiple candidate servers for each event. Moreover, a series of dynamicsmaintenance mechanisms are extensively studied. To evaluate the performance of SREM, 64 servers are deployed and millionsof live content items are tested in a CloudStack testbed. Under various parameter settings, the experimental results demonstratethat the traffic overhead of routing events in SkipCloud is at least 60% smaller than in Chord overlay, the matching rate in SREM isat least 3.7 times and at most 40.4 times larger than the single-dimensional partitioning technique of BlueDove. Besides, SREMenables the event loss rate to drop back to 0 in tens of seconds even if a large number of servers fail simultaneously.

Index Terms—Publish/Subscribe, Event Matching, Overlay Construction, Content Space Partitioning, Cloud Computing

F

1 INTRODUCTION

Because of the importance in helping users to makereal-time decisions, data dissemination has becomedramatically significant in many large-scale emer-gency applications, such as earthquake monitoring,disaster weather warning, and status update in socialnetworks. Recently, data dissemination in these emer-gency applications presents a number of fresh trends.One is the rapid growth of live content. For instance,Facebook users publish over 600,000 pieces of contentand Twitter users send over 100,000 tweets on averageper minute [1]. The other is the highly dynamicnetwork environment. For instance, the measurementstudies indicates that most users’ sessions in socialnetworks only last several minutes [2]. In emergencyscenarios, the sudden disasters like earthquake or badweather may lead to the failure of a large number ofusers instantaneously.

These characteristics require the data disseminationsystem to be scalable and reliable. Firstly, the systemmust be scalable to support the large amount of livecontent. The key is to offer a scalable event match-ing service to filter out irrelevant users. Otherwise,the content may have to traverse a large number of

• All authors are with Science and Technology on Parallel and Distribut-ed Processing Laboratory, College of Computer, National University ofDefense Technology, Changsha, Hunan, P. R. China, 410073.

• Yijie Wang is the corresponding author.• E-mail: maxingkong, wangyijie, [email protected]

uninterested users before they reach interested users.Secondly, with the dynamic network environment, it’squite necessary to provide reliable schemes to keepcontinuous data dissemination capacity. Otherwise,the system interruption may cause the live contentbecomes obsolete content.

Driven by these requirements, publish/subscribe(pub/sub) pattern is widely used to disseminate datadue to its flexibility, scalability, and efficient supportof complex event processing. In pub/sub systems(pub/subs), a receiver (subscriber) registers its interestin the form of a subscription. Events are published bysenders to the pub/sub system. The system matchesevents against subscriptions and disseminates them tointerested subscribers.

In traditional data dissemination applications, thelive content are generated by publishers at a lowspeed, which makes many pub/subs adopt the multi-hop routing techniques to disseminate events. A largebody of broker-based pub/subs forward events andsubscriptions through organizing nodes into diversedistributed overlays, such as tree-based design [3]–[6],cluster-based design [7], [8] and DHT-based design[9]–[11]. However, the multi-hop routing techniquesin these broker-based systems lead to a low matchingthroughput, which is inadequate to apply to currenthigh arrival rate of live content.

Recently, cloud computing provides great opportu-nities for the applications of complex computing andhigh speed communication [12], where the servers areconnected by high speed networks, and have pow-

IEEE TRANSACTIONS ON CLOUD COMPUTING VOL:PP NO:99 YEAR 2014




erful computing and storage capacities. A numberof pub/sub services based on the cloud computingenvironment have been proposed, such as Move [13],BlueDove [14] and SEMAS [15]. However, most ofthem can not completely meet the requirements ofboth scalability and reliability when matching large-scale live content under highly dynamic environ-ments. This mainly stems from the following facts:1) Most of them are inappropriate to the matchingof live content with high data dimensionality dueto the limitation of their subscription space parti-tioning techniques, which bring either low matchingthroughput or high memory overhead. 2) These sys-tems adopt the one-hop lookup technique [16] amongservers to reduce routing latency. In spite of its highefficiency, it requires each dispatching server to havethe same view of matching servers. Otherwise, thesubscriptions or events may be assigned to the wrongmatching servers, which brings the availability prob-lem in the face of current joining or crash of matchingservers. A number of schemes can be used to keepthe consistent view, like periodically sending heart-beat messages to dispatching servers or exchangingmessages among matching servers. However, theseextra schemes may bring a large traffic overhead orthe interruption of event matching service.

Motivated by these factors, we propose a scal-able and reliable matching service for content-basedpub/sub service in cloud computing environments,called SREM. Specifically, we mainly focus on twoproblems: one is how to organize servers in the cloudcomputing environment to achieve scalable and reli-able routing. The other is how to manage subscrip-tions and events to achieve parallel matching amongthese servers. Generally speaking, we provide thefollowing contributions:

• We propose a distributed overlay protocol, calledSkipCloud, to organize servers in the cloud com-puting environment. SkipCloud enables subscrip-tions and events to be forwarded among brokersin a scalable and reliable manner. Also it is easyto implement and maintain.

• To achieve scalable and reliable event matchingamong multiple servers, we propose a hybridmulti-dimensional space partitioning technique,called HPartition. It allows similar subscriptionsto be divided into the same server and providesmultiple candidate matching servers for eachevent. Moreover, it adaptively alleviates hot spotsand keeps workload balance among all servers.

• We implement extensive experiments based on aCloudStack testbed to verify the performance ofSREM under various parameter settings.

The rest of this paper is organized as follows.Section 2 introduces content-based data model andan essential framework of SREM. Section 3 presentsthe SkipCloud overlay in detail. Section 4 describes

HPartition in detail. Section 5 discusses the dynamicsmaintenance mechanisms of SREM. We evaluate theperformance of SREM in Section 6. In Section 7, wereview the related work on the matching of existingcontent-based pub/subs. Finally, we conclude the pa-per and outline our future work in Section 8.

2 DESIGN OF SREM2.1 Content-Based Data ModelSREM uses a multi-dimensional content-based datamodel. Consider our data model consists of k dimen-sions A1, A2, · · · , Ak. Let Ri be the ordered set of allpossible values of Ai. So, Ω = R1×R2×· · ·×Rk is theentire content space. A subscription is a conjunction ofpredicates over one or more dimensions. Each predi-cate Pi specifies a continuous range for a dimensionAi, and it can be described by the tuple (Ai, vi, Oi),where vi ∈ Ri and Oi represents a relational operator(<,≤, =,≥, >, etc). The general form of a subscriptionis S = ∧k

i=1Pi. An event is a point within the contentspace Ω. It can be represented as k dimension-valuepairs, i.e., e = ∧k

j=1(Aj , vj). For each pair (Aj , vj), wesay it satisfies a predicate (Ai, vi, Oi) if Aj = Ai andvjOivi. By this definition we say an event e matches Sif each predicate of S satisfies some pairs of e.

2.2 Overview of SREM

SkipCloud

Datacenter 1

Datacenter 4 Datacenter 3

Datacenter 2

B

S

P

Broker

Subscriber

Publisher

B1

P

S2

Forward events through SkipCloud

Forward subscriptions through SkipCloud

Send matched events to subscriptions

S1B2

B3

B4

B5 B6B7B8

Fig. 1: System Framework

To support large-scale users, we consider a cloudcomputing environment with a set of geographicallydistributed datacenters through the Internet. Eachdatacenter contains a large number of servers (brokers),which are managed by a datacenter managementservice such as Amazon EC2 or OpenStack.

We illustrate a simple overview of SREM in Figure1. All brokers in SREM as the front-end are exposedto the Internet, and any subscriber and publishercan connect to them directly. To achieve reliableconnectivity and low routing latency, these brokersare connected through an distributed overlay, calledSkipCloud. The entire content space is partitionedinto disjoint subspaces, each of which is managedby a number of brokers. Subscriptions and eventsare dispatched to the subspaces that are overlapping




TABLE 1: Notations in SkipCloudNb the number of brokers in SkipCloudm the number of levels in SkipCloudDc the average degree in each cluster of SkipCloudNc the number of top clusters in SkipCloud

with them through SkipCloud. Thus, subscriptionsand events falling into the same subspace are matchedon the same broker. After the matching process com-pletes, events are broadcasted to the correspondinginterested subscribers. As shown in Figure 1, thesubscriptions generated by subscribers S1 and S2 aredispatched to broker B2 and B5, respectively. Uponreceiving events from publishers, B2 and B5 will sendmatched events to S1 and S2, respectively.

One may argue that different datacenters are re-sponsible for some subset of the subscriptions ac-cording to the geographical location, such that we donot really need much collaboration among the servers[3], [4]. In this case, since the pub/sub system needsto find all the matched subscribers, it requires eachevent to be matched in all datacenters, which leadsto large traffic overhead with the increasing numberof datacenters and the increasing arrival rate of livecontent. Besides, it’s hard to achieve workload balanceamong the servers of all datacenters due to the variousskewed distributions of users’ interests. Another ques-tion is that why we need a distributed overlay likeSkipCloud to ensure reliable logical connectivity indatacenter environment where servers are more stablethan the peers in P2P networks. This is because as thenumber of servers increases in datacenters, the nodefailure becomes normal, but not rare exception [17].The node failure may lead to unreliable and inefficientrouting among servers. To this end, we try to organizeservers into SkipCloud to reduce the routing latencyin a scalable and reliable manner.

Such a framework offers a number of advantagesfor real-time and reliable data dissemination. First, itallows the system to timely group similar subscrip-tions into the same broker due to the high bandwidthamong brokers in the cloud computing environment,such that the local searching time can be greatly re-duced. This is critical to reach high matching through-put. Second, since each subspace is managed by mul-tiple brokers, this framework is fault-tolerant even if alarge number of brokers crash instantaneously. Third,because the datacenter management service providesscalable and elastic servers, the system can be easilyexpanded to Internet-scale.

3 SKIPCLOUD

3.1 Topology Construction

Generally speaking, SkipCloud organizes all brokersinto levels of clusters. As shown in Figure 2, theclusters at each level of SkipCloud can be treated as

a partition of the whole broker set. Table 1 shows keynotations used in this section.

000001 010 011 100

101 110111

000001

010011

100

101

110

111

100

111

101

110000 001

010

011

01 10 11

0 1

Level 2

Level 1

Level 0

ClusterID

Broker

00

Top cluster

Global cluster

Fig. 2: An example of SkipCloud with 8 brokers and3 levels. Each broker is identified by a binary string.

At the top level, brokers are organized into multipleclusters whose topologies are complete graphs. Eachcluster at this level is called top cluster. It contains aleader broker which generates a unique b-ary identifierwith length of m using a hash function (e.g. MD-5). This identifier is called ClusterID. Correspondingly,each broker’s identifier is a unique string with lengthof m + 1 and shares common prefix of length mwith its ClusterID. At this level, brokers in the samecluster are responsible for the same content subspaces,which provides multiple matching candidates for eachevent. Since brokers in the same top cluster generatefrequent communication among themselves, such asupdating subscriptions and dispatching events, theyare organized into a complete graph to reach eachother in one hop.

After the top clusters have been well organized, theclusters at the rest levels can be generated level bylevel. Specifically, each broker decides to join whichcluster at every level. The brokers whose identifiersshare the common prefix with length i would jointhe same cluster at level i, and the common prefixis referred to as the ClusterID at level i. That is,clusters at level i+ 1 can be regarded as a b-partitionof the clusters at level i. Thus, the number of clustersreduces linearly with the decreasing of levels. Let εbe the empty identifier. All brokers at level 0 join onesingle cluster, called global cluster. Therefore, there arebi clusters at level i. Figure 2 shows an example ofhow SkipCloud organizes 8 brokers into 3 levels ofclusters by binary identifers.

Algorithm 1: Neighbor List MaintenanceInput : views: the neighbor lists.

m: the total number of levels in SkipCloud.cycle: current cycle.

1 j = cycle%(m + 1);2 for each i in [0,m-1] do3 update views[i] by the peer sampling service based on Cyclon.

4 for each i in [0,m-1] do5 if views[i] contains empty slots then6 fill these empty slots with other levels’ items who share

common prefix of length i with the ClusterID of views[i].




To organize clusters of the non-top levels, we em-ploy a light-weighted peer sampling protocol basedon Cyclon [18], which provides robust connectivity ofeach cluster. Suppose there are m levels in SkipCloud.Specifically, each cluster runs Cyclon to keep reliableconnectivity among brokers in the cluster. Since eachbroker falls into a cluster at each level, it maintainsm neighbor lists. For a neighbor list at the cluster oflevel i, it samples equal number of neighbors fromtheir corresponding children clusters at level i + 1.This ensures that the routing from the bottom levelcan always find a broker pointing to a higher level.Because brokers maintain levels of neighbor lists,they update the neighbor list of one level and thesubsequent ones in a ring to reduce the traffic cost.The pseudo-code of view maintenance algorithm isshow in Algorithm 1. This topology of the multi-levelneighbor lists is similar to Tapestry [19]. Comparedwith Tapestry, SkipCloud uses multiple brokers in topclusters as targets to ensure reliable routing.

3.2 Prefix RoutingPrefix routing in SkipCloud is mainly used to ef-ficiently route subscriptions and events to the topclusters. Note that the cluster identifiers at level i+ 1are generated by appending one b-ary to the corre-sponding clusters at level i. The relation of identifiersbetween clusters is the foundation of routing to targetclusters. Briefly, when receiving a routing request toa specific cluster, a broker examines its neighbor listsof all levels and chooses the neighbor which sharesthe longest common prefix with the target ClusterIDas the next hop. The routing operation repeats untila broker can not find a neighbor whose identifieris more closer than itself. Algorithm 2 describes theprefix routing algorithm in pseudo-code.

Algorithm 2: Prefix Routing

1 l = commonPrefixLength(self.ID, event.ClusterID);2 if (l == m) then3 process(event);4 else5 destB ← the broker whose identifier matches event.ClusterID with

the longest common prefix from self.views.6 lmax = commonPrefixLength(destB.identifier, event.ClusterID);7 if (lmax ≤ l) then8 destB ← the broker whose identifier is closest to

event.ClusterID from view[l].9 if (destB and myself is in the same cluster of level l) then

10 process(event);

11 forwardTo(destB, event);

Since a neighbor list in the cluster at level i is auniform sampling from the corresponding childrenclusters at level i+1, each broker can find a neighborwhose identifier matches at least one more longercommon prefix with the target identifier before reach-ing the target cluster. Therefore, the prefix routing inAlgorithm 2 guarantees that any top cluster will be

TABLE 2: Notations in HPartitionΩ the entire content spacek the number of dimensions of ΩAi the dimension i, i ∈ [1, k]Ri the range of Ai

Pi the predicate of Ai

Nseg the number of segments on Ri

N′seg the number of segments of hot spotsα the minimum size of the hot cluster

Nsub the number of subscriptionsGx1···xk the subspace with SubspaceID x1 · · ·xk

reached in at most logbNc hops, where Nc is the thenumber of top clusters. Assume the average degree ofbrokers in each cluster is Dc. Thus, each broker onlyneeds to keep Dc ∗ logbNc neighbors.

4 HPARTITION

In order to take advantage of multiple distributedbrokers, SREM divides the entire content space amongthe top clusters of SkipCloud, so that each top clusteronly handles a subset of the entire space and searchesa small number of candidate subscriptions. SREMemploys a hybrid multi-dimensional space partitioningtechnique, called HPartition, to achieve scalable andreliable event matching. Generally speaking, HPar-tition divides the entire content space into disjointsubspaces (Section 4.1). Subscriptions and events withoverlapping subspaces are dispatched and matchedon the same top cluster of SkipCloud (Section 4.2and 4.3). To keep workload balance among servers,HPartition divides the hot spots into multiple coldspots in an adaptive manner (Section 4.4). Table 2shows key notations used in this section.

4.1 Logical Space Construction

Our idea of logical space construction is inspired byexisting single-dimensional space partitioning (calledSPartition) and all-dimensional space partitioning(called APartition). Let k be the number of dimensionsin the entire space Ω and Ri be the ordered set of val-ues of the dimension Ai. Each Ri is split into Nseg con-tinuous and disjoint segments Rj

i , j = 1, 2, · · · , Nseg,where j is the segment identifier (called SegID) of thedimension Ai.

A1 A2

A3 A4

(a) SPartition (b) APartition (c) HPartition

Fig. 3: Comparison among three different logical spaceconstruction techniques.




The basic idea of SPartition like BlueDove [14]is to treat each dimension as a separated space, asshow in Figure 3 (a). Specifically, the range of eachdimension is divided into Nseg segments, each ofwhich is regarded as a separated subspace. Thus, theentire space Ω is divided into k∗Nseg subspaces. Sub-scriptions and events falling into the same subspaceare matched against each other. Due to the coarse-grained partitioning of SPartition, each subscriptionfalls into a small number of subspaces, which bringsmultiple candidate brokers for each event and lowmemory overhead. On the other hand, SPartition mayform a hot spot easily if a large number of subscribersare interested in the same range of a dimension.

On the other hand, the idea of APartition likeSEMAS [15] is to treat the combination of all di-mensions as a separate space, as shown in Figure 3(b). Formally, the whole space Ω is partitioned into(Nseg)

k subspaces. Compared with SPartition, AParti-tion leads to a smaller number of hot spots, since ahot subspace would be formed only if its all segmentsare subscribed by a large number of subscriptions.On the other hand, each event in APartition only hasone candidate broker. Compared with SPartition, eachsubscription in APartition falls into more subspaces,which leads to a higher memory cost.

Inspired by both SPartition and APartition, HPar-tition provides a flexible manner to construct logicalspace. Its basic idea is to divide all dimensions into anumber of groups and treat each group as a separatespace, as shown in Figure 3 (c). Formally, k dimen-sions in HPartition are classified into t groups, eachof which contains ki dimensions, where

∑ti=1 ki = k.

That is, Ω is split into t separated spaces Ωi, i ∈ [1, t].HPartition adopts APartition to divide each Ωi. Thus,Ω is divided into

∑ti=1(Nseg)

ki subspaces. For eachspace Ωi, let Gx1···xk

, xi ∈ [0, Nseg] be its subspace,where x1 · · ·xk is the concatenation of SegIDs of thesubspace, called SubspaceID. For the dimensions thata subspace does not contain, the corresponding posi-tions of the SubspaceID are set to 0.

Each subspace of HPartition is managed by atop cluster of SkipCloud. To dispatch each subspaceGx1···xk

to its corresponding top cluster of SkipCloud,HPartition uses a hash function like MurMurHash[20] to map each SubspaceID x1 · · ·xk to a b-ary iden-tifier with length of m, represented by H(x1 · · ·xk).According to the prefix routing in Algorithm 2, eachsubspace will be forwarded to the top cluster whoseClusterID is nearest to H(x1 · · ·xk). Note that thebrokers in the same top cluster are in charge of thesame set of subspaces, which ensures the reliabilityof each subspace.

4.2 Subscription Installation

A user can specify a subscription by defining theranges over one or more dimensions. We treat S =

∧ki=1Pi as a general form of subscriptions, where Pi is

a predicate of the dimension Ai. If S does not containa dimension Ai, the corresponding Pi is the entirerange of Ai. When receiving a subscription S, thebroker first obtains all subspaces Gx1···xk

which areoverlapping with S. Based on the logical space con-struction of HPartition, all dimensions are classifiedinto t individual spaces Ωj , j ∈ [1, t]. For each Gx1···xk

of Ωi, it satisfies

xi =

s, if Ai = Ri, R

si ∩ Pi = ∅

0, if Ai = Ri

(1)

As shown in in Figure 4, Ω consists of 2 groups,and each range is divided into 2 segments. For asubscription S1 = (20 ≤ A1 ≤ 80) ∧ (80 ≤ A2 ≤100)∧(105 ≤ A3 ≤ 170)∧(92 ≤ A4 ≤ 115), its matchedsubspaces are G1200, G1100 and G0022.

1200 2200

1100 2100

0012 0022

0011 0021

0 180900

90

180

0 180900

90

180

1200 2200

1100 2100

0012 0022

0011 0021

0 180900

90

180

0 180900

90

180

S1

S1

S1

S1

S1

Fig. 4: An example of subscription assignment.

For the subscription S1, its every matched sub-space Gx1···xk

is hashed to a b-ary identifier withlength of m, represented by H(x1 · · ·xk). Thus, S1

is forwarded to the top cluster whose ClusterID isnearest to H(x1 · · ·xk). When a broker in the topcluster receives the subscription S1, it broadcasts S1

to other brokers in the same cluster, such that the topcluster provides reliable event matching and balancedworkloads among its brokers.

4.3 Event Assignment and Forwarding StrategyUpon receiving an event, the broker forwards thisevent to its corresponding subspaces. Specifically, con-sidering an event e = ∧k

i=1(Ai, vi). Each subspaceGx1···xk

that e falls into should satisfy

xi ∈ 0, s, vi ∈ Rsi (2)

For instance, let e be (A1, 30) ∧ (A2, 100) ∧ (A3, 80) ∧(A4, 80). Based on the logical partitioning in Figure 4,the matched subspaces of e are G1200, G0011.

Recall that HPartition divides the whole space Ωinto t separated spaces Ωi, which indicates thereare t candidate subspaces Gi, 1 ≤ i ≤ t for eachevent. Because of the different workloads, the strategy




of how to select one candidate subspace for eachevent may greatly affect the matching rate. For anevent e, suppose each candidate subspace Gi containsNi subscriptions. The most intuitive approach is theleast first strategy, which chooses the subspace withleast number of candidate subscriptions. Correspond-ingly, its average searching subscription amount isNleast = min

i∈[1,t]Ni. However, when a large number

of skewed events fall into the same subspace, thisapproach may lead to unbalanced workloads amongbrokers. Another is the random strategy, which selectseach candidate subspace with equal probability. It en-sures balanced workloads among brokers. However,the average searching subscription amount increas-es to Navg = 1

t

∑ti=1 Ni. In HPartition, we adopt

a probability based forwarding strategy to dispatchevents. Formally, the probability of forwarding e tothe space Ωi is pi = (1 − Ni/

∑ti=1 Ni)/(t − 1).

Thus, the average searching subscription amount isNprob = (N −

∑ti=1

N2i

N )/(t− 1). Note that when everysubspace has the same size, Nprob reaches its maximalvalue, i.e., Nprob ≤ Navg . Therefore, this approach hasbetter workload balance than the least first strategyand less searching latency than the random strategy.

4.4 Hot Spots AlleviationIn real applications, a large number of subscriptionsfall into a few number of subspaces which are calledhot spots. The matching in hot spots leads to a highsearching latency. To alleviate the hot spots, an ob-vious approach is to add brokers to the top clusterscontaining hot spots. However, This approach raisesthe total price of the service and wastes resources ofthe clusters only containing cold spots.

To utilize distributed multiple clusters, a bettersolution is to balance the workloads among clustersthrough partitioning and migrating hot spots. Thegain of the partitioning technique is greatly affectedby the distribution of subscriptions of the hot spot. Tothis end, HPartition divides each hot spot into a num-ber of cold spots through two partitioning techniques:hierarchical subspace partitioning and subscriptionset partitioning. The first aims to partition the hotspots where the subscriptions are diffused among thewhole space, and the second aims to partition the hotspots where the subscriptions fall into a narrow space.

4.4.1 Hierarchical Subspace PartitioningTo alleviate the hot spots whose subscriptions diffusein its whole space, we propose a hierarchical subspacepartitioning technique, called HSPartition. Its basicidea is to divide the space of a hot spot into a numberof smaller subspaces and reassign its subscriptionsinto these new generated subspaces.

Step one, each hot spot is divided along with itsranges of dimensions. Assume Ωi contains ki dimen-sions, and one of its hot spots is Gx1···xk

. For each

0 1800

180

A1

90

1

3

4

5

2

6

97

8

1,2,3

3,7,9 6,7,8

2,4,5,6

10~20 2100-1:10,13,16,19

2100-2:11,14,17,20

2100-3:12,15,18

90

A2

0

180

90A2

0 180A1

90

12001200 12002200

12001100 12002100

1200 2200

1100 2100

2200

1100

Fig. 5: An example of dividing hot spots. Two hotspots of G1200 and G2100 are divided HSPartition andSSPartition, respectively.

xi = 0, the corresponding range Rxii is divided into

N′

seg segments, each of which is denoted by Rxiyi(1 ≤xi ≤ Ns, 1 ≤ yi ≤ N

′

seg). Therefore, Gx1···xkis divided

into (N′

seg)ki subspaces, each of which is represented

by Gx1···xky1···yk.

Step two, subscriptions in Gx1···xkare reassigned

to these new subspaces according to the subscrip-tion installation in Section 4.2. Because each subspaceoccupies a small piece of the space of Gx1···xk

, itmeans a subscription S of Gx1···xk

may fall into apart of these subspaces, which will reduce the numberof subscriptions that each event matches against onthe hot spot. HSPartition enables the hot spots to bedivided into much smaller subspaces iteratively. Asshown in the left top corner Figure 5, the hot spot G102

is divided into 4 smaller subspaces by HSPartition,and the maximum number of subscriptions in thisspace decreases from 9 to 4.

To avoid concurrency partitioning one hot spot bymultiple brokers in the same top cluster, the leaderbroker of each top cluster is responsible for period-ically checking and partitioning hot spots. When abroker receives a subscription, it hierarchically dividesthe space of the subscription until each subspace cannot be divided further. Therefore, the leader brokerthat is in charge of dividing a hot spot Gx1···xk

disseminates a update message including a three-tuple “Subspace Partition”, x1 · · ·xk, N

′

s to all theother brokers. Because the global cluster at level 0 ofSkipCloud organizes all brokers into a random graph,we can utilize a gossip-based multicast method [21]to disseminate the update message to other brokersin O(logNb) hops, where Nb is the total broker size.

4.4.2 Subscription Set PartitioningTo alleviate the hot spots whose subscriptions fallinto a narrow space, we propose a subscription setpartitioning, called SSPartition. Its basic idea is to di-vide the subscriptions set of a hot spot into a numberof subsets and scatter these subsets to multiple topclusters. Assume Gx1···xk

is a hot spot and N0 is thesize of each subscription subset.

Step one, divide the subscription set of Gx1···xk

into n subsets, where n = |Gx1···xk|/N0. Each subset




is assigned a new SubspaceID x1 · · ·xk − i, (1 ≤i ≤ |Gx1···xk

|/N0), and the subscription with identifierSubID would be dispatched to the subset Gx1···xk−i ifH(SubID)%n = i.

Step two, each subset Gx1···xk−i is forwardedto its corresponding top cluster with ClusterIDH(x1 · · ·xk − i). Since these non-overlapping subsetsare scattered to multiple top clusters, it allows eventsfalling into the hot spots to be matched in parallelmultiple brokers, which brings much lower matchinglatency. Similar to the process of HSPartition, theleader broker who is responsible for dividing a hotspot Gx1···xk

disseminates a three-tuple “SubscriptionPartition”, x1 · · ·xk, n to all the other brokers tosupport further partitioning operations. As shown inthe right bottom corner of Figure 5, the hot spot G2100

into 3 smaller subsets G2100−1, G2100−2 and G2100−3 bySSPartition. Correspondingly, the maximal number ofsubscriptions of the subspace decreases from 11 to 4.

4.4.3 Adaptive Selection AlgorithmBecause of diverse distributions of subscriptions, bothHSPartition and SSPartition can not replace with eachother. On one hand, HSPartition is attractive to dividethe hot spots whose subscriptions are uniform scat-tered regions. However, it’s inappropriate to dividethe hot spots whose subscriptions all appear at thesame exact point. On the other hand, SSPartitionallows to divide any kind of hot spots into multiplesubsets even if all subscriptions falls into the samesingle point. However, compared with HSPartition, ithas to dispatch an event to multiple subspaces, whichbrings a higher traffic overhead.

To achieve balanced workloads among brokers, wepropose an adaptive selection algorithm to select ei-ther HSPartition or SSPartition to alleviate hot spot-s. The selection is based on the similarity of sub-scriptions in the same hot spot. Specifically, assumeGx1···xky1···yk

is the subspace with maximal size ofsubscriptions in HSPartition. α is a threshold value,which represents the similarity degree of subscription-s’ spaces in a hot spot. We choose HSPartition as thepartitioning algorithm if |Gx1···xky1···yk

|/|Gx1···xk| < α.

Otherwise, we choose SSPartition. Through combin-ing both partitioning techniques, this selection algo-rithm can alleviate hot spots in an adaptive manner.

4.5 Performance Analysis4.5.1 The Average Searching SizeSince each event is matched in one of its candidatesubspaces, the average number of subscriptions of allsubspaces, called the average searching size, is criticalto reduce the matching latency. We give the formalanalysis as follows.

Theorem 1: Suppose the percentage of each predi-cate’s range of each subscription is λ, then the averagesearching size Nprob in HPartition is not more than

Nsub

t

∑ti=1 λ

ki if Nseg → ∞, where t is the numberof groups in the entire space, ki is the number ofdimensions in each group, and Nsub is the numberof subscriptions.

Proof: For each dimension Ai, i ∈ [1, k], the rangeof Ai is Ri, the length of corresponding predicate Pi

is λ∥Ri∥. Since each dimension is divided into Nseg

segments, the length of each segment is ∥Ri∥/Nseg.Thus, the expected number of segments that Pi fallsinto is ⌈λNseg⌉. For a space Ωi, it contains (Nseg)

ki

subspaces. Then the average searching size in Ωi

is Ni =⌈λNseg⌉kiNsub

(Nseg)ki. When Nseg → ∞, we have

⌈λNseg⌉Nseg

= λ , and Ni = λkiNsub. According to theprobability based forwarding strategy in Section 4.3,the average searching size is Nprob ≤ 1

t

∑ti=1 Ni. That

is, Nprob ≤ Nsub

t

∑ti=1 λ

ki .According to the result of Theorem 1, the average

searching size Nprob decreases with the reduction of λ.That is, smaller λ brings less matching time in HParti-tion. Fortunately, the subscriptions distribution in realworld applications is often skewed, and most predi-cates occupy small ranges, which guarantees small av-erage searching size. Besides, note that Nsub

t

∑ti=1 λ

ki

reaches its minimal value Nsubλkt if ki = kj , where

i ∈ [1, t] and j ∈ [1, t]. It indicates that the upperbound of Nprob can further decrease to Nsubλ

kt if each

group has the same number of dimensions.

4.5.2 Event Matching ReliabilityTo achieve reliable event matching, each event shouldhave multiple candidates in the system. In this section,we give a formal analysis of the event matchingavailability as follows.

Theorem 2: SREM promises event matching avail-ability in the face of concurrent crash failure of up toδNtop(1− e

− tNtop )− 1 brokers, where δ is the number

of brokers in each top cluster of SkipCloud, Ntop isthe number of top clusters, and t is the number ofgroups in HPartition.

Proof: Based on HPartition, each event has t can-didate subspaces which are diffused into Ntop topclusters. Over n boxes, distributing m balls at random,the expectation of the number of empty boxes isne−

mn . Similarly, over Ntop top clusters, distributing t

candidate subspaces at random, the expectation of thenumber of non-empty top clusters is Ntop(1−e

− tNtop ).

Since each top cluster contains δ brokers which man-ages the same set of subspaces, the expectation ofthe number of non-empty brokers for each event isδNtop(1−e

− tNtop ). Thus, SREM ensures available event

matching in the face of concurrent crash failure of upto δNtop(1− e

− tNtop )− 1 brokers.

According to Theorem 2, the event matching avail-ability of SREM is affected by δ and t. That is, bothSkipCloud and HPartition provide flexible schemes toensure reliable event matching.




5 PEER MANAGEMENT

In SREM, there are mainly three roles: clients, brokers,and clusters. Brokers are responsible for managing allof them. Since the joining or leaving of these rolesmay lead to inefficient and unreliable data dissem-ination, we will discuss the dynamics maintenancemechanisms used by brokers in this section.

5.1 Subscriber DynamicsTo detect the status of subscribers, each subscriberestablishes affinity with a broker (called home broker),and periodically sends its subscription as a heartbeatmessage to its home broker. The home broker main-tains a timer for its every buffered subscription. If thebroker has not received a heartbeat message from asubscriber over Tout time, the subscriber is supposedto be offline. Next, the home broker removes thissubscription from its buffer and notifies the brokerscontaining the failed subscription to remove it.

5.2 Broker DynamicsBroker dynamics may lead to new clusters joiningor old clusters leaving. In this section, we mainlyconsider the brokers joining/leaving from existingclusters, rather than the changing of the cluster size.

When a new broker is generated by its datacentermanagement service, it firstly sends a “Broker Join”message to the leader broker in its top cluster. Theleader broker returns back its top cluster identifier,neighbor lists of all levels of SkipCloud, and all sub-spaces including the corresponding subscriptions. Thenew broker generates its own identifier by adding ab-ary number to its top cluster identifier and takes thereceived items of each level as its initial neighbors.

There is no particular mechanism to handle brokerdeparture from a cluster. In the top cluster, its leaderbroker can easily monitor the status of other brokers.For the clusters of the rest levels, the sampling serviceguarantees that the older items of each neighbor listare prior to be replaced by fresh ones during the viewshuffling operation, which makes the failed brokersbe removed from the system quickly. From the per-spective of event matching, all brokers in the sametop cluster have the same subspaces of subscriptions,which indicates that broker failure would not inter-rupt the event matching operation if there is at leastone broker alive in each cluster.

5.3 Cluster DynamicsBrokers dynamics may lead to new clusters joining orold clusters leaving. Since each subspace is managedby the top cluster whose identifier is closest to thatof the subspace, it’s necessary to adaptively migratea number of old clusters to the new joining clusters.Specifically, the leader broker of the new cluster de-livers its top ClusterID carried on a “Cluster Join”

message to other clusters. The leader brokers in allother clusters find out the subspaces whose identifiersare closer to the new ClusterID than their own clusteridentifiers, and migrate them to the new cluster.

Since each subspace is stored in one cluster, thecluster departure incurs subscription loss. The peersampling service of SkipCloud can be used to detectfailed clusters. To recover lost subscriptions, a simplemethod is to redirect the lost subscriptions by theirowners’ heartbeat messages. Due to the unreliablelinks between subscribers and brokers, this approachmay lead to long repair latency. To this end, we storeall subscriptions into a number of well-known serversof the datacenters. When these servers obtain thefailed clusters, they dispatch the subscriptions in thesefailed clusters to the corresponding live clusters.

Besides, the number of levels m in SREM is adjustedadaptively with the change of the broker size Nb toensure m = ⌈logb(Nb)⌉, where b is the number ofchildren clusters of each non-top cluster. Each leaderbroker runs a gossip-based aggregation service [22]at the global cluster to estimate the total broker size.When the estimated broker size Ne

d is enough tochange m, each leader broker notifies other brokersto update their neighbor lists through a gossip-basedmulticast with a probability of 1/Ne

d . If a numberof leader brokers disseminate the “Update NeighborLists” messages simultaneously, the earlier messageswill be stopped when it collides with later ones, whichensures all leader brokers have the same view of m.When m decreases, each broker removes its neighborlist of the top level directly. When m increases, a newtop level is initialized by filling appropriate neighborsfrom the neighbor lists at other levels. Due to thelogarithmic relation of m and Nb, only significantchange of Nb can change m. So, the level dynamicsbrings quite low traffic cost.

6 EXPERIMENT

6.1 ImplementationTo take advantage of the reliable links and highbandwidth among servers of the cloud computingenvironment, we choose the CloudStack [23] testbedto design and implement our prototype. To developthe prototype as modular and portable, we use ICE[24] as the fundamental communication platform. ICEprovides a communication solution that is easy toprogram with, and allows the developers to onlyfocus on their application logic. Based on ICE, we addabout 11,000 lines of Java code.

To evaluate the performance of SkipCloud, we im-plement both SkipCloud and Chord to forward sub-scriptions and messages. To evaluate the performanceof HPartition, the prototype supports different spacepartitioning policies. Moreover, the prototype pro-vides three different message forwarding strategies,i.e, least subscription amount forwarding, random




TABLE 3: Default Parameters in SREMParameter Nb δ Dc k ki Nseg

Value 64 2 6 8 1, 2, 4 10

Parameter N′seg Nsub σsub Nhot N0 β

Value 2 40K 50 1, 000 500 0.7

forwarding, and probability based forwarding men-tioned in Section 4.3.

6.2 Parameters and Metrics

We use a group of virtual machines (VMs) in theCloudStack testbed to evaluate the performance ofSREM. Each VM is running in a exclusive physicalmachine, and we use 64 VMs as brokers. For each VM,it is equipped with four processor cores, 8GB memory,and is connected to Gigabit Ethernet switches.

In SkipCloud, the number of brokers in each topcluster δ is set to 2. To ensure reliable connectivity ofclusters, the average degree of brokers in each clusterDc is 6. In HPartition, the entire subscription spaceconsists of 8 dimensions, each of which has a rangefrom 0 to 500. The range of each dimension is cut into10 segments by default. For each hotspot, the range ofits every dimension is cut into 2 segments iteratively.

In most experiments, 40, 000 subscriptions are gen-erated and dispatched to their corresponding brokers.The ranges of predicates of subscriptions follow anormal distribution with standard deviation of 50,represented by σsub. Besides, there are two millionevents generated by all brokers. For the events, thevalue of each pair follows a uniform distribution a-long the entire range of the corresponding dimension.A subspace is labeled as a hotspot if its subscriptionamount is over 1,000. Besides, the size of basic sub-scription subset N0 in Section 4.4.2 is set to 500, andthe threshold value α in Section 4.4.3 is set to 0.7. Thedetail of default parameters is shown in Table 3.

Recall that HPartition divides all dimensions intot individual spaces Ωi, each of which contains kidimensions. We set ki to be 1, 2, and 4, respectively.These three partitioning policies are called HPartition-1, HPartition-2, and HPartition-4, respectively. Notethat HPartition-1 represents the single-dimensionalpartitioning (mentioned in Section 4.1), which isadopted by BlueDove [14]. The details of implement-ed methods are shown in Table 4, where SREM-ki and Chord-ki represent HPartition-ki under Skip-Cloud and Chord, respectively. In the following ex-periments, we evaluate the performance of SkipCloudthrough comparing SREM-ki with Chord-ki, and e-valuate the performance of HPartition through com-paring HPartition-4 and HPartition-2 with the single-dimensional partitioning in BlueDove, i.e., HPartition-1. Besides, we do not use APartition (mentioned inSection 4.1) in the experiments, which is mainly be-cause its fine-grained partitioning technique leads to

TABLE 4: List of Implemented MethodsName Overlay Partitioning Technique

SREM-4 SkipCloud HPartition-4SREM-2 SkipCloud HPartition-2SREM-1 SkipCloud HPartition-1 (SPartition, BlueDove)Chord-4 Chord HPartition-4Chord-2 Chord HPartition-2Chord-1 Chord HPartition-1 (SPartition, BlueDove)

extremely high memory cost.We evaluate the performance of SREM through a

number of metrics.• Subscription Searching Size: The number of sub-

scriptions that need to be searched on each brokerfor matching a message.

• Matching rate: The number of matched events persecond. Suppose the first event is matched at themoment of T1, and the last one is matched at themoment of T2. Thus, the matching rate is Ne

T2−T1.

• Event loss rate: The percentage of lost events in aspecified time period.

6.3 Space Partitioning PolicyThe space partitioning policy determines the numberof searching subscriptions for matching a message.It further affects the matching rate and workloadallocation among brokers. In this section, we testthree different space partitioning policies : HPartition-1, HPartition-2, and HPartition-4.

0

0.2

0.4

0.6

0.8

1

1 4 16 64 256 1024

Cum

ulat

ive

Dis

trib

utio

n Fu

ncti

on

(a) The Subscription Searching Size

HPartition-4HPartition-2HPartition-1

0 100 200 300 400 500 600 700 800 900

1000

10 20 30 40 50 60

Ave

rage

Sub

scri

ptio

n Se

arch

ing

Size

(b) Matcher ID

HPartition-4HPartition-2HPartition-1

Fig. 6: Distribution of subscription searching sizes

Figure 6 (a) shows the cumulative distribution func-tion (CDF) of subscription searching sizes on all bro-kers. Most subscription searching sizes of HPartition-1 are smaller than the threshold value of Nhot. InFigure 6 (a), the percentage of “cold” subspaces inHPartition-4, HPartition-2, and HPartition-1 is 99.7%,95.1% and 83.9%. As HPartition-4 splits the entirespace into more fine-grained subspaces, subscriptionsfall into the same subspace if they are interested inthe same range of 4 dimensions. In contrast, sinceHPartition-1 or HPartition-2 splits the entire spaceinto less fine-grained subspaces, subscriptions are dis-patched to the same subspace with a higher prob-ability. Besides, the CDF of each approach shows asharp increase at the subscription searching size of500, which is caused by SSPartition in Section 4.4.2.




Figure 6 (b) shows the average subscription search-ing size of each broker. The maximal average sub-scription sizes of HPartition-4, HPartition-2, andHPartition-1 are 437, 666 and 980, respectively. Theircorresponding normalized standard deviations (stan-dard deviation divided by the average) of the averagesubscription sizes of each approach are 0.026, 0.057and 0.093, respectively. It indicates that brokers ofHPartition-4 have less subscription searching size andbetter workload balance, which mainly lies in its morefine-grained space partitioning.

In conclusion, with the increasing of the partition-ing granularity, HPartition shows better workloadbalance among brokers at the cost of the growth ofthe memory cost.

6.4 Forwarding Policy and Workload Balance6.4.1 Impact of Message Forwarding PolicyThe forwarding policy determines each message tobe forwarded to which broker, which significantlyaffects the workload balance among brokers. Figure7 shows the matching rates of three policies: prob-ability based forwarding, least subscription amountforwarding and random forwarding.

As shown in Figure 7 the probability based for-warding policy has highest matching rate in variousscenarios. This mainly lies in its better trade-off be-tween workload balance and searching latency. Forinstance, when we use HPartition-4 under SkipCloud,the matching rate of the probability based policy are1.27× and 1.51× that of random policy and leastsubscription amount forwarding policy, respectively.When we use HPartition-4 under Chord, the corre-sponding gains of the probability based policy become1.26× and 1.57×, respectively.

6.4.2 Workload Balance of BrokersWe compare the workload balance of brokers underdifferent partitioning strategies and overlays. In theexperiment, we use the probability based forwardingpolicy to dispatch events. One million events areforwarded and matched against 40, 000 subscriptionsin the experiments. We use an index of β(Nb) [25]to evaluate the load balance among brokers, whereβ(Nb) = (

∑Nb

i=1 Loadi)2/(Nb

∑Nb

i=1 Load2i ), and Loadi

is the workload of each broker. The value of β(Nb) isbetween 0 and 1, and the higher value means betterworkload balance among brokers.

Figure 9 (a) shows the distributions of the numberof forwarding events among brokers. The forwardingevent size of SREM-ki is less than that of Chord-ki, which is because SkipCloud spends less routinghops than Chord. The corresponding values of β(Nb)of SREM-4, SREM-2, SREM-1, Chord-4, Chord-2 andChord-1 are 0.99, 0.96, 0.98, 0.99, 0.99 and 0.98 respec-tively. These values indicate quite balanced workloadsof forwarding events among brokers. Besides, the

1

4

16

64

256

1024

10 20 30 40 50 60The

Num

ber

of F

orw

ding

Eve

nts

(a) Matcher ID

×103SREM-4 SREM-2 SREM-1

0

1

2

3

4

5

6

10 20 30 40 50 60

Mat

chin

g R

ate

(eve

nts/

s)

(b) Matcher ID

×103Chord-4 Chord-2 Chord-1

Fig. 9: Workload Balance

number of forwarding events in SREM is at least60% smaller than in Chord. Figure 9 (b) shows thedistributions of matching rates among brokers. Theircorresponding values of β(Nb) are 0.98, 0.99, 0.99, 0.99,0.97 and 0.89, respectively. The balanced matchingrates among brokers is mainly caused by the fine-grained partitioning of HPartition and the probabilitybased forwarding policy.

6.5 Scalability

In this section, we evaluate the scalability of all ap-proaches through measuring the changing of match-ing rate with different values of Nb, Nsub, k and Nseg.In each experiment, only one parameter of Table 3 ischanged to validate its impact.

1 2 4 8

16 32 64

128 256 512

4 8 16 32 64

Mat

chin

g R

ate

(eve

nts/

s)

(a) Number of Brokers

×103SREM-4 SREM-2 SREM-1

1 2 4 8

16 32 64

128 256 512

40000 50000 60000 70000 80000

(b) Number of Subscriptions

×103Chord-4 Chord-2 Chord-1

1 2 4 8

16 32 64

128 256 512

8 12 16 20

Mat

chin

g R

ate

(eve

nts/

s)

(c) Number of Dimensions

×103

1 2 4 8

16 32 64

128 256 512

4 6 8 10

(d) Number of Segments

×103

Fig. 10: Scalability

We first change the number of brokers Nb. Asshown in Figure 10 (a), the matching rate of eachapproach increases linearly with the growth of Nb. AsNb increases from 4 to 64, the gain of the matchingrate of SREM-4, SREM-2, SREM-1, Chord-4, Chord-2 and Chord-1 is 12.8×, 11.2×, 9.6×, 10.0×, 8.4×and 5.8×, respectively. Compared with Chord-ki, thehigher increasing rate of SREM-ki is mainly causedby the smaller routing hops of SkipCloud. Besides,SREM-4 presents the highest matching rate with var-ious values of Nb due to the fine-grained partitioningtechnique of HPartition and the lower forwardinghops of SkipCloud.




1 2 4 8

16 32 64

128 256 512

Probability Random Least

Mat

chin

g R

ate

(eve

nts/

s)

Forwarding Policy

×103

SREM-4SREM-2

SREM-1Chord-4

Chord-2Chord-1

Fig. 7: Forwarding Policy

1 2 4 8

16 32 64

128 256 512

50 100 150 200

Mat

chin

g R

ate

(eve

nts/

s)

(a) Standard Deviation of Subscriptions

×103

1 2 4 8

16 32 64

128 256 512

50 100 150 200

(b) Standard Deviation of Events

×103

SREM-4SREM-2

SREM-1Chord-4

Chord-2Chord-1

1 2 4 8

16 32 64

128 256 512

1 2 4 8 16

(c) Number of Subscription Patterns

×103

Fig. 8: The Impact of Workload Characteristics

We change the number of subscriptions Nsub inFigure 10 (b). As Nsub increases from 40K to 80K, thematching rate of SREM-4, SREM-2, SREM-1, Chord-4,Chord-2 and Chord-1 decreases 61.2%, 58.7%, 39.3%,51.5%, 51.0% and 38.6%, respectively. Compared withSREM-1, each subscription in SREM-4 may fall intomore subspaces due to its fine-grained partitioning,which leads to a higher increasing rate of the averagesubscription searching size in SREM-4. That’s whythe decreasing percentages of the matching rates inSREM-ki and Chord-ki increase with the growth ofki. In spite of the higher decreasing percentage, thematching rates of SREM-4 and Chord-4 are 27.4× and33.7× that of SREM-1 and Chord-1 respectively, whenNsub equals to 80, 000.

We increase the number of dimensions k from 8to 20 in Figure 10 (c). The matching rate of SREM-4,SREM-2, SREM-1, Chord-4, Chord-2 and Chord-1 de-creases 83.4%, 31.3%, 25.7%, 81.2%, 44.0% and 15.0%,respectively. Similar to the phenomenon of Figure 10(b), the decreasing percentages of the matching ratesin SREM-ki and Chord-ki increase with the growthof ki. This is because the fine-grained partitioning ofHPartition-4 leads to faster growing of the averagesubscription searching size. When k equals to 20, thematching rates of SREM-4 and Chord-4 are 8.7× and9.4× that of SREM-1 and Chord-1, respectively.

We evaluate the impact of different number ofsegments Nseg in Figure 10 (d). As Nseg increases from4 to 10, The corresponding matching rates of SREM-4, SREM-2, SREM-1, Chord-4, Chord-2 and Chord-1 increases 82.7%, 179.1%, 44.7%, 98.3%, 286.1% and48.4%, respectively. These numbers indicate that thebigger Nseg brings more fine-grained partitioning andsmaller average subscription searching size.

In conclusion, the matching rate of SREM shows alinear increasing capacity with the growing Nb, andSREM-4 presents highest matching rate in variousscenarios due to its more fine-grained partitioningtechnique and less routing hops of SkipCloud.

6.6 ReliabilityIn this section, we evaluate the reliability of SREM-4by testing its ability to recover from server failures.During the experiment, 64 brokers totally generate120 million messages and dispatch them to their

corresponding subspaces. After 10 seconds, a part ofbrokers are shut down simultaneously.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60

Even

t Lo

ss R

ate

(%)

Time (s)

failed Nb = 4failed Nb = 8failed Nb = 16failed Nb = 32

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

0 20 40 60 80 100 120 140

Mat

chin

g R

ate

(eve

nts/

s)

Time (s)

×106failed Nb = 4failed Nb = 8failed Nb = 16failed Nb = 32

Fig. 11: Reliability

Figure 11 (a) shows the changing of event loss ratesof SREM when 4, 8 16 and 32 brokers fail. Fromthe moment when brokers fail, the correspondingevent loss rates increase to 3%, 8%, 18% and 37%respectively in 10 seconds, and drop back to 0 in 20seconds. This recovery ability mainly lies in the quickfailure detection of the peer sampling service men-tioned in Section 3.1 and the quick reassigning lostsubscriptions by the well-known servers mentionedin Section 5.3. Note that the maximal event loss ratein each case is less than the percentage of lost brokers.This is because each top cluster of SkipCloud initiallyhas two brokers, and an event will not be dropped ifit is dispatched to an alive broker of its top cluster.

Figure 11 (b) shows the changing of the matchingrates when a number of brokers fail. When a numberof brokers fail at the moment of 10 second, there isan apparent dropping of the matching rate in thefollowing tens of seconds. Note that the droppinginterval increases with the number of failed brokers,which is because the failure of more brokers leadsto a higher latency of detecting these brokers andrecovering the lost subscriptions. After the handlingof failed brokers, the matching rate of each situationincreases to a higher level. It indicates SREM canfunction normally even if a large number of brokerfail simultaneously.

In conclusion, the peer sampling service of Skip-Cloud ensures continuous matching service even if alarge number of brokers fails simultaneously. Throughbuffering subscriptions in the well-known servers,failed subscriptions can be dispatched to the corre-




sponding alive brokers in tens of seconds.

6.7 Workload Characteristics

In this section, we evaluate how workload character-istics affect the matching rate of each approach.

First, we evaluate the influence of different sub-scription distribution through changing the standarddeviation σsub from 50 to 200. Figure 8 (a) showsthe matching rate of each approach decreases withthe increasing of σsub. As σsub increases, the pred-icates of each subscription occupy larger ranges ofthe corresponding dimensions, which causes that thesubscription is dispatched to more subspaces.

We then evaluate the performance of each approachunder different event distributions. In this experimen-t, the standard deviation of the event distribution σechanges from 50 to 200. Note that skewed events leadto two scenarios. One is the events are skewed inthe same way as subscriptions. Thus, the hot eventscoincide with the hot spots, which severely hurts thematching rate. The other is the events are skewedin the opposite way as subscriptions, which benefitsthe matching rate. Figure 8 (b) shows how the ad-verse skewed events affect the performance of eachapproach. As σe reduces from 200 to 50, the matchingrate of SREM-4, SREM-2, SREM-1, Chord-4, Chord-2 and Chord-1 decreases 79.4%, 63.2%, 47.0%, 77.2%,59.1% and 52.6%, respectively. Although the matchingrate of SREM-4 decreases greatly as σe reduces, it isstill higher than other approaches under different σe.

We evaluate the performance of each approachunder different combinations of dimensions (calledsubscription patterns) in Figure 8 (c). Until now, allabove mentioned experiments generate subscriptionsby one subscription pattern, where each predicateof the dimension follows the normal distribution.We uniformly select a group of patterns to generatesubscriptions. For the predicates that a subscriptiondoes not contain, their ranges are the whole ranges ofthe corresponding dimensions. Figure 8 (c) shows thematching rate decreases with the number of subscrip-tion patterns. As the number of subscription patternsgrows from 1 to 16, the matching rate of SREM-4, SREM-2, SREM-1, Chord-4, Chord-2 and Chord-1decreases 86.7%, 63.8%, 2.3%, 90.3%, 65.9% and 10.3%,respectively. The sharp decline of the matching rate ofSREM-4 and Chord-4 is mainly caused by the quickincreasing of the average searching size. However, thematching rates of SREM-4 and Chord-4 are still muchhigher than that of the other approaches.

In conclusion, the matching throughput of eachapproach decreases greatly as the skewness of sub-scriptions or events increases. Compared with otherapproaches, the fine-grained partitioning of SREM-4ensures higher matching throughput under variousparameter settings.

6.8 Memory Overhead

Storing subscriptions is main memory overhead ofeach broker. Since each subscription may fall into thesubspaces that are managed by the same broker, eachbroker only needs to store a real subscription anda group of its identifications to reduce the memoryoverhead. We use the average number of subscriptionidentifications that each broker stores, noted by Nsid,as a criterion. Recall that a predicate specifies a contin-uous range of a dimension. We restrict a continuousrange by two numerals, each of which occupies 8bytes. Suppose that each subscription identificationuses 8 bytes. Thus, the maximal average memoryoverhead of managing subscriptions on each broker is16kNsub+8Nsid, where k is the number of dimensions,and Nsub is the total number of subscriptions. Asmentioned in Section 4.1, HPartition-ki brings largememory cost and partitioning latency as ki increases.When we test the performance of HPartition-8, theJava heap space is out of memory.

0.01

0.02

0.04

0.08

0.16

0.32

0.64

1.28

40000 60000 80000 100000

Nsid

(a) Number of Subscriptions

×106HPartition-4 HPartition-2

0.01

0.02

0.04

0.08

0.16

0.32

0.64

1.28

50 100 150 200

Nsid

(b) Standard Deviation of Subscriptions

×106HPartition-1

Fig. 12: Memory Overhead

We first evaluate the average memory cost of bro-kers with the changing of Nsub in Figure 12 (a). Thevalue of Nsid of each partitioning technique increaseslinearly with the growth of Nsub. As Nsub increasesfrom 40K to 100K, the value of Nsid in HPartition-4,HPartition-2 and HPartition-1 grows 143%, 151% and150%, respectively. HPartition-4 presents the highestNsid than that of the other approaches, which isbecause the number of subspaces that a subscriptionfalls into increases with the decreasing room of eachsubspace. In spite of the high Nsid in HPartition-4, itsmaximal average overhead on each broker only cost20.2 MB when Nsub reaches 100K.

We then evaluate how the memory overhead ofbrokers changes with different values of σsub. Asσsub increases, each predicate of subscriptions spansa wider range, which leads to much more memoryoverhead. As σsub increases from 50 to 200, the valueof Nsid in HPartition-4, HPartition-2 and HPartition-1grows 86%, 60% and 27%, respectively. Unsurprising-ly, Nsid in HPartition-4 still has largest value. Whenσsub equals to 200, the maximal average memoryoverhead of HPartition-4 is 6.2 MB, which bringssmall memory overhead to each broker.

In conclusion, the memory overhead of HPartition-ki increases slowly with the growth of Nsub and σsub,




if ki is no more than 4.

7 RELATED WORK

A large body of efforts on broker based pub/subshave been proposed in recent years. One method is toorganize brokers into a tree overlay, such that eventscan be delivered to all relevant brokers without dupli-cate transmissions. Besides, data replication schemes[26] are employed to ensure reliable event matching.For instance, Siena [3] advertises subscriptions to thewhole network. When receiving an event, each brokerdetermines to forward the event to the correspondingbroker according to its routing table. In Atmosphere[4], it dynamically identifies entourages of publishersand subscribers to transmit events with low latency.It is appropriate to the scenarios with small-scale ofsubscribers. As the number of subscribers increases,the over-overlays constructed in Atmosphere proba-bly have the similar latency like in Siena. To ensurereliable routing, Kazemzadeh et.al [5] propose a δ-fault-tolerance algorithm to handle concurrent crashfailure of up to δ brokers. Brokers are required tomaintain a partial view of this tree that includes allbrokers within distance δ+1. Zhao et.al [6] propose ahybrid network architecture, where a tree overlay anda DHT overlay work together to guarantee the highperformance of normal operations and high reliabilityin the presence of failures. The multi-hop routingtechniques in these tree-based pub/subs lead to ahigh routing latency. Besides, skewed subscriptionsand events lead to unbalanced workloads amongbrokers, which may severely reduces the matchingthroughput. In contrast, SREM uses SkipCloud toreduce the routing latency and HPartition to balancethe workloads of brokers.

Another method is to divide brokers into multipleclusters through unstructured overlays. Brokers ineach cluster are connected through reliable topologies.For instance, brokers in Kyra [7] is grouped intocliques based on their network proximity. Each cliquedivides the whole content space into non-overlappingzones based on the number of its brokers. After that,the brokers in different cliques which are responsiblefor similar zones are connected by a multicast tree.Thus, events are forwarded through its correspondingmultiple tree. Sub-2-Sub [8] implements epidemic-based clustering to partition all subscriptions intodisjoint subspaces. The nodes in each subspace areorganized into a bidirectional ring. Due to the longdelay of routing events in unstructured overlays, mostof these approaches are inadequate to achieve scalableevent matching. In contrast, SREM uses SkipCloudto organize brokers, which uses the prefix routingtechnique to achieve low routing latency.

To reduce the routing hops, a number of methodsorganize brokers through structured overlays whichcommonly need Θ(logN) hops to locate a broker.

Subscriptions and events falling into the same sub-space are sent and matched on a rendezvous broker.For instance, PastryString [9] constructs a distributedindex tree for each dimension to support both numer-ical and string dimensions. The resource discoveryservice proposed by Ranjan et al. [10] maps events andsubscriptions into d-dimensional indexes, and hashesthese indexes onto a DHT network. To ensure reliablepub/sub service, each broker in Meghdoot [11] hasa back up which is used when the primary brokerfails. Compared with these DHT-based approaches,SREM ensures smaller forwarding latency throughthe prefix routing of SkipCloud, and higher eventmatching reliability by multiple brokers in each topcluster of SkipCloud and multiple candidate groupsof HPartition.

Recently, a number of cloud providers have of-fered a series of pub/sub services. For instance,Move [13] provides high available key-value storageand matching respectively based on one-hop lookup[16]. BlueDove [14] adopts a single-dimensional par-titioning technique to divide the entire spare anda performance-aware forwarding scheme to selectcandidate matcher for each event. Its scalability islimited by the coarse-grained clustering technique.SEMAS [15] proposes a fine-grained partitioning tech-nique to achieve high matching rate. However, thispartitioning technique only provides one candidatefor each event and may lead to large memory costas the number of data dimensions increases. In con-trast, HPartition makes a better trade-off between thematching throughput and reliability through a flexiblemanner of constructing logical space.

8 CONCLUSIONS AND FUTURE WORK

This paper introduces SREM, a scalable and reliableevent matching service for content-based pub/subsystems in cloud computing environment. SREM con-nects the brokers through a distributed overlay Skip-Cloud, which ensures reliable connectivity amongbrokers through its multi-level clusters and bringsa low routing latency through a prefix routing al-gorithm. Through a hybrid multi-dimensional spacepartitioning technique, SREM reaches scalable andbalanced clustering of high dimensional skewed sub-scriptions, and each event is allowed to be matchedon any of its candidate servers. Extensive experimentswith real deployment based on a CloudStack testbedare conducted, producing results which demonstratethat SREM is effective and practical, and also presentsgood workload balance, scalability and reliability un-der various parameter settings.

Although our proposed event matching service canefficiently filter out irrelevant users from big datavolume, there are still a number of problems we needto solve. Firstly, we do not provide elastic resourceprovisioning strategies in this paper to obtain a good




performance price ratio. We plan to design and im-plement the elastic strategies of adjusting the scaleof servers based on the churn workloads. Secondly,it does not guarantee that the brokers disseminatelarge live content with various data sizes to the corre-sponding subscribers in a real-time manner. For thedissemination of bulk content, the upload capacitybecomes the main bottleneck. Based on our proposedevent matching service, we will consider utilizinga cloud-assisted technique to realize a general andscalable data dissemination service over live contentwith various data sizes.

ACKNOWLEDGMENTS

This work was supported by the National GrandFundamental Research 973 Program of China (GrantNo.2011CB302601), the National Natural ScienceFoundation of China (Grant No.61379052), the Na-tional High Technology Research and Development863 Program of China (Grant No.2013AA01A213), theNatural Science Foundation for Distinguished YoungScholars of Hunan Province (Grant No.S2010J5050),Specialized Research Fund for the Doctoral Programof Higher Education (Grant No.20124307110015).

REFERENCES[1] Dataperminite. [Online]. Available: http://www.domo.com/

blog/2012/06/how-much-data-is-created-every-minute/[2] F. Benevenuto, T. Rodrigues, M. Cha, and V. Almeida, “Char-

acterizing user behavior in online social networks,” in Proceed-ings of the 9th ACM SIGCOMM conference on Internet measure-ment conference. ACM, 2009, pp. 49–62.

[3] A. Carzaniga, “Architectures for an event notification servicescalable to wide-area networks,” Ph.D. dissertation, POLITEC-NICO DI MILANO, 1998.

[4] P. Eugster and J. Stephen, “Universal cross-cloud communica-tion,” IEEE Transactions on Cloud Computing, 2014.

[5] R. S. Kazemzadeh and H.-A. Jacobsen, “Reliable and highlyavailable distributed publish/subscribe service,” in ReliableDistributed Systems, 2009. SRDS’09. 28th IEEE InternationalSymposium on. IEEE, 2009, pp. 41–50.

[6] Y. Zhao and J. Wu, “Building a reliable and high-performancecontent-based publish/subscribe system,” J. Parallel Distrib.Comput., vol. 73, no. 4, pp. 371–382, 2013.

[7] F. Cao and J. P. Singh, “Efficient event routing in content-basedpublish/subscribe service network,” in INFOCOM, 2004, pp.929–940.

[8] S. Voulgaris, E. Riviere, A. Kermarrec, M. Van Steen et al.,“Sub-2-sub: Self-organizing content-based publish and sub-scribe for dynamic and large scale collaborative networks,”in Research Report RR5772. INRIA, Rennes, France, 2005.

[9] I. Aekaterinidis and P. Triantafillou, “Pastrystrings: A com-prehensive content-based publish/subscribe dht network,” inICDCS, 2006, pp. 23–32.

[10] R. Ranjan, L. Chan, A. Harwood, S. Karunasekera, andR. Buyya, “Decentralised resource discovery service for largescale federated grids,” in e-Science and Grid Computing, IEEEInternational Conference on. IEEE, 2007, pp. 379–387.

[11] A. Gupta, O. D. Sahin, D. Agrawal, and A. El Abbadi, “Megh-doot: Content-based publish/subscribe over p2p networks,” inMiddleware, 2004, pp. 254–273.

[12] Y. Wang, X. Li, X. Li, and Y. Wang, “A survey of queries overuncertain data,” Knowledge and information systems, vol. 37,no. 3, pp. 485–530, 2013.

[13] W. Rao, L. Chen, P. Hui, and S. Tarkoma, “Move: A large scalekeyword-based content filtering and dissemination system,” inICDCS, 2012, pp. 445–454.

[14] M. Li, F. Ye, M. Kim, H. Chen, and H. Lei, “A scalable andelastic publish/subscribe service,” in IPDPS, 2011, pp. 1254–1265.

[15] X. Ma, Y. Wang, Q. Qiu, W. Sun, and X. Pei, “Scalable andelastic event matching for attribute-based publish/subscribesystems,” Future Generation Computer Systems, 2013.

[16] A. Lakshman and P. Malik, “Cassandra: a decentralized struc-tured storage system,” Operating Systems Review, vol. 44, no. 2,pp. 35–40, 2010.

[17] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Di-makis, R. Vadali, S. Chen, and D. Borthakur, “Xoring ele-phants: Novel erasure codes for big data,” in Proceedings of the39th international conference on Very Large Data Bases. VLDBEndowment, 2013, pp. 325–336.

[18] S. Voulgaris, D. Gavidia, and M. van Steen, “Cyclon: Inex-pensive membership management for unstructured p2p over-lays,” J. Network Syst. Manage., vol. 13, no. 2, pp. 197–217, 2005.

[19] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, andJ. D. Kubiatowicz, “Tapestry: A resilient global-scale overlayfor service deployment,” Selected Areas in Communications,IEEE Journal on, vol. 22, no. 1, pp. 41–53, 2004.

[20] Murmurhash. [Online]. Available: http://burtleburtle.net/bob/hash/doobs.html

[21] A.-M. Kermarrec, L. Massoulie, and A. J. Ganesh, “Probabilis-tic reliable dissemination in large-scale systems,” Parallel andDistributed Systems, IEEE Transactions on, vol. 14, no. 3, pp.248–258, 2003.

[22] M. Jelasity, A. Montresor, and O. Babaoglu, “Gossip-basedaggregation in large dynamic networks,” ACM Trans. Comput.Syst., vol. 23, no. 3, pp. 219–252, 2005.

[23] Cloudstack. [Online]. Available: (http://cloudstack.apache.org/)

[24] Zeroc. [Online]. Available: (http://www.zeroc.com/)[25] B. He, D. Sun, and D. P. Agrawal, “Diffusion based distributed

internet gateway load balancing in a wireless mesh network,”in Global Telecommunications Conference (GLOBECOM). IEEE,2009, pp. 1–6.

[26] Y. Wang and S. Li, “Research and performance evaluation ofdata replication technology in distributed storage systems,”Computers & Mathematics with Applications, vol. 51, no. 11, pp.1625–1632, 2006.

Xingkong Ma received the B.S. degree incomputer science and technology from theSchool of Computer of Shandong University,China, in 2007, and received the M.S. degreein computer science and technology fromNational University of Defense Technology,China, in 2009. He is currently a Ph.D. candi-date in the School of Computer of NationalUniversity of Defense Technology. His cur-rent research interests lie in the areas of datadissemination, publish/subscribe.

Yijie Wang received the PhD degree incomputer science and technology from theNational University of Defense Technology,China, in 1998. Now she is a Professor inthe Science and Technology on Parallel andDistributed Processing Laboratory, NationalUniversity of Defense Technology. Her re-search interests include network computing,massive data processing, parallel and dis-tributed processing.

Xiaoqiang Pei received the B.S. degree andM.S. in computer science and technologyfrom National University of Defense Technol-ogy, China, in 2009 and 2011, respective-ly. He is currently a Ph.D. candidate in theSchool of Computer of National Universityof Defense Technology. His current researchinterests lie in the areas of fault-tolerant stor-age, cloud computing.

IEEE TRANSACTION ON CLOUD COMPUTING, VOL. … › IEEE_2013-2014_Projects › basepaper › A...A...

Documents

Transcript of IEEE TRANSACTION ON CLOUD COMPUTING, VOL. … › IEEE_2013-2014_Projects › basepaper › A...A...