JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed...

15
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 A-DSP: An Adaptive Join Algorithm for Dynamic Data Stream on Cloud System Junhua Fang Rong Zhang Yan Zhao Kai Zheng Xiaofang Zhou IEEE fellow Aoying Zhou Abstract—The join operations, including both equi and non-equi joins, are essential to the complex data analytics in the big data era. However, they are not inherently supported by existing DSPEs (D istributed S tream P rocessing E ngines). The state-of-the-art join solutions on DSPEs rely on either complicated routing strategies or resource-inefficient processing structures, which are susceptible to dynamic workload, especially when the DSPEs face various join predicate operations and skewed data distribution. In this paper, we propose a new cost-effective stream join framework, named A-DSP (A daptive D imensional S pace P rocessing), which enhances the adaptability of real-time join model and minimizes the resource used over the dynamic workloads. Our proposal includes 1) a join model generation algorithm devised to adaptively switch between different join schemes so as to minimize the number of processing task required; 2) a load-balancing mechanism which maximizes the processing throughput; and 3) a lightweight algorithm designed for cutting down unnecessary migration cost. Extensive experiments are conducted to compare our proposal against state-of-the-art solutions on both benchmark and real-world workloads. The experimental results verify the effectiveness of our method, especially on reducing the operational cost under pay-as-you-go pricing scheme. Index Terms—Distributed stream join; Theta-join; Cost effective 1 I NTRODUCTION T HE big data era has urged the techniques for real-time data processing in various domains, such as stock trading analysis, traffic monitoring and network measurement. Introduction of dis- tributed and parallel processing frameworks enables the efficient processing of massive streams. However, the load balancing issue [16], [21], [23], [34], [35] is emerging as a common problem to those parallel processing frameworks, which deteriorates the performance of the distributed platform on big data processing. Extensive efforts are devoted to tackle the load balancing issue for different operators, including summarization [6], aggregation [25], join [31], [35], [38], [39], etc. Among these operations, θ-join [9], [22], [26] is recognized as the most challenging one, due to its large variety in join predicate options {>, , <, , =, 6=}. In practice, θ-join should be supported as a basic opera- tion. For instance, band-joins are common in spatial applications [37], and similarity joins between data sets are also necessary in correlation analysis [26]. There are a handful of approaches designed for join operator over fast and continuous streams, which falls into two categories:1-DSP (1 -d imensional s pace p rocessing) model [4], [5], [14], [18], [25] and 2-DSP (2 -d imensional s pace p rocessing) model [9], [22], [26], [31]. We use m to denote the task instance in DSPE. For 1-DSP, we map the key space K in any stream to m tasks by a dividing function F(e.g., range function, Junhua Fang and Yan Zhao are with the Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, China. E-mail: {jhfang, zhaoy}@suda.edu.cn Kai Zheng is with University of Electronic Science and Technology of China, China. Email: [email protected]. (Corresponding author) Rong Zhang and Aoying Zhou are with School of Data Sci- ence and Engineering, East China Normal University, China. Email: [email protected], [email protected] Xiaofang Zhou is with the School of Information Technology and Elec- trical Engineering, The University of Queensland, Australia. E-mail: [email protected]. consistent hashing, etc.), which is defined as F (k i ) m k , with k i K and 0 m k m - 1. As shown in Fig.1(a), it has three tasks numbered by m 02 . Any key from either stream R or S is supposed to be assigned to one of the tasks. On the other hand, 2-DSP arranges tasks into two dimensional space, where each dimension represents the division of one stream. As in Fig.1(b), it organizes the tasks as a matrix, represented as m ij , with 0 i 1 and 0 j 1. And stream S and R are divided along the horizontal and vertical dimension respectively. More specifically, the processing mode of these two models are explained as follows: R S r 0 r 1 r 2 s 0 s 1 s 2 s 0 s 1 s 2 r 0 r 1 r 2 R S m 0 m 1 m 2 R L S L S J r 0 s 0 r 0 s 0 r 1 s 1 r 1 s 1 r 2 s 2 r 2 s 2 θ θ θ R J (a) 1-DSP Model R r 0 r 1 m 00 m 01 m 10 m 11 r 0 s 0 r 0 s 1 r 1 s 0 r 1 s 1 : Random : Join routing : Storage routing : Storage and join : Join only : Task θ θ θ θ S s 0 s 1 (b) 2-DSP Model Fig. 1. Two Typical Paradigms of Existing Stream Join Models. 1-DSP. In Fig.1(a), stream R (S) is split into three sub-streams r i (s i ) for join processing and stored locally in m i , with 0 i 2. However, for non-equi join, a join tuple may get involved in multiple local join tasks according to join predicates (θ). In addition to the locally stored sub-stream r i in task m i , additional sub-stream r 0 i is sent to m i for join purpose with locally stored tuples s i , which contain tuples to be joined but not included in r i . Therefore, 1-DSP has two routing methods, one for local storage (Storage routing, i.e., R L ) and the other one for join routing (Join routing, i.e., R J ) defined by the join predicate as shown in

Transcript of JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed...

Page 1: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

A-DSP: An Adaptive Join Algorithm for DynamicData Stream on Cloud System

Junhua Fang Rong Zhang Yan Zhao Kai Zheng Xiaofang Zhou IEEE fellow Aoying Zhou

Abstract—The join operations, including both equi and non-equi joins, are essential to the complex data analytics in the big data era.However, they are not inherently supported by existing DSPEs (Distributed Stream Processing Engines). The state-of-the-art joinsolutions on DSPEs rely on either complicated routing strategies or resource-inefficient processing structures, which are susceptible todynamic workload, especially when the DSPEs face various join predicate operations and skewed data distribution. In this paper, wepropose a new cost-effective stream join framework, named A-DSP (Adaptive Dimensional Space Processing), which enhances theadaptability of real-time join model and minimizes the resource used over the dynamic workloads. Our proposal includes 1) a joinmodel generation algorithm devised to adaptively switch between different join schemes so as to minimize the number of processingtask required; 2) a load-balancing mechanism which maximizes the processing throughput; and 3) a lightweight algorithm designed forcutting down unnecessary migration cost. Extensive experiments are conducted to compare our proposal against state-of-the-artsolutions on both benchmark and real-world workloads. The experimental results verify the effectiveness of our method, especially onreducing the operational cost under pay-as-you-go pricing scheme.

Index Terms—Distributed stream join; Theta-join; Cost effective

F

1 INTRODUCTION

T HE big data era has urged the techniques for real-time dataprocessing in various domains, such as stock trading analysis,

traffic monitoring and network measurement. Introduction of dis-tributed and parallel processing frameworks enables the efficientprocessing of massive streams. However, the load balancing issue[16], [21], [23], [34], [35] is emerging as a common problemto those parallel processing frameworks, which deteriorates theperformance of the distributed platform on big data processing.Extensive efforts are devoted to tackle the load balancing issue fordifferent operators, including summarization [6], aggregation [25],join [31], [35], [38], [39], etc. Among these operations, θ-join [9],[22], [26] is recognized as the most challenging one, due to itslarge variety in join predicate options {>,≥, <,≤,=, 6=}.

In practice, θ-join should be supported as a basic opera-tion. For instance, band-joins are common in spatial applications[37], and similarity joins between data sets are also necessaryin correlation analysis [26]. There are a handful of approachesdesigned for join operator over fast and continuous streams, whichfalls into two categories:1-DSP (1-dimensional space processing)model [4], [5], [14], [18], [25] and 2-DSP (2-dimensional spaceprocessing) model [9], [22], [26], [31]. We use m to denote thetask instance in DSPE. For 1-DSP, we map the key space K inany stream to m tasks by a dividing function F(e.g., range function,

• Junhua Fang and Yan Zhao are with the Institute of Artificial Intelligence,School of Computer Science and Technology, Soochow University, China.E-mail: {jhfang, zhaoy}@suda.edu.cn

• Kai Zheng is with University of Electronic Science and Technology ofChina, China. Email: [email protected]. (Corresponding author)

• Rong Zhang and Aoying Zhou are with School of Data Sci-ence and Engineering, East China Normal University, China. Email:[email protected], [email protected]

• Xiaofang Zhou is with the School of Information Technology and Elec-trical Engineering, The University of Queensland, Australia. E-mail:[email protected].

consistent hashing, etc.), which is defined as F (ki) � mk, withki ∈ K and 0 ≤ mk ≤ m − 1. As shown in Fig.1(a), it hasthree tasks numbered by m0∼2. Any key from either stream Ror S is supposed to be assigned to one of the tasks. On theother hand, 2-DSP arranges tasks into two dimensional space,where each dimension represents the division of one stream. Asin Fig.1(b), it organizes the tasks as a matrix, represented as mij ,with 0 ≤ i ≤ 1 and 0 ≤ j ≤ 1. And stream S and R aredivided along the horizontal and vertical dimension respectively.More specifically, the processing mode of these two models areexplained as follows:

R

S

r0 r1 r2

s0 s1 s2s0, s1

, s2,

r0, r1

, r2,

R,

S,

m0 m1 m2

RL

S L S J

r0s0

r0,

s0,

r1s1

r1,

s1,

r2s2

r2,

s2,θ θ θ

RJ

(a) 1-DSP Model

R

r0

r1

m00 m01

m10 m11

r0 s0 r0 s1

r1 s0 r1 s1

: Random

: Join routing

: Storage routing

: Storage and join

: Join only

: Task

θ

θ

θ

θ

Ss0 s1

(b) 2-DSP Model

Fig. 1. Two Typical Paradigms of Existing Stream Join Models.

1-DSP. In Fig.1(a), stream R (S) is split into three sub-streams ri(si) for join processing and stored locally in mi, with 0 ≤ i ≤ 2.However, for non-equi join, a join tuple may get involved inmultiple local join tasks according to join predicates (θ). Inaddition to the locally stored sub-stream ri in task mi, additionalsub-stream r′i is sent to mi for join purpose with locally storedtuples si, which contain tuples to be joined but not included in ri.Therefore, 1-DSP has two routing methods, one for local storage(Storage routing, i.e., RL) and the other one for join routing(Join routing, i.e., RJ ) defined by the join predicate as shown in

Page 2: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Fig.1(a). During the join process, the destination of the incomingtuple, e.g., ri, is determined by two different routing strategies:storage routing RL decides its storage task, while join routing RJdetermines to which task(s) ri shall be assigned to produce joinresults. In other words, the RJ (SJ ) acts only when RL(SL) cannot make the complete results for any join predicate; otherwise,join routing is useless where RJ = ∅ (SJ = ∅)2-DSP. Join-matrix model is a typical represent for 2-DSP, inwhich a partitioning scheme is employed on each dimension tosplit the incoming stream data randomly into a set of mutuallyexclusive sub-streams to promise the balance requirement. Asshown in Fig.1(b), streamR is randomly split into two sub-streamsr0 and r1 along the vertical side, with r0 ∩ r1 = ∅ (s0 and s1

for S respectively). Therefore, R onθ S is decomposed into foursub-tasks, each of which mij (0 ≤ i, j ≤ 1) takes a pair of sub-streams from R and S. The join-matrix model can always returncomplete results for any join predicate as it can promise each pairof tuples from two data streams meet once.

TABLE 1Summary of the advantages and disadvantages of different models

Resource saving Arbitrary join Workload balance1-DSP X × ×2-DSP × X X

A-DSP(Our) X X X

In fact, different processing models have their respectiveperformance characteristics. Tab.1 exhibits the advantages anddisadvantages of 1-DSP and 2-DSP from perspectives of resourceconsumption, application scope, and operational status. In accor-dance to the routing strategies in Fig.1, it is not difficult to findthat 1-DSP is a content-sensitive partitioning scheme, while 2-DSPuses the random partition strategy determining it is a content-insensitive one. This also explains the fundamental differencesexhibited in Tab.1. Specifically, 1-DSP is a resource-saving joinmodel because it is unnecessary to maintain a specific processingarchitecture. Instead, it consumes resources on demand. However,having a content-sensitive distribution strategy decides that 1-DSPis only better for the small-scale join (e.g., equal-join) and clumsyin a dynamic distribution and skewed environment. For 2-DSP,although its content-insensitive partition strategy enables it tosupport for the arbitrary join predicate and handle data skewnessperfectly, this is at the high cost of task consumption because thismodel must maintain a two-dimensional processing architecture.

In this paper, we propose a flexible and adaptive model, namedA-DSP (Adaptive Dimensional Space Processing), for parallelstream join. By integrating the processing strategies of 1-DSPand 2-DSP. In other words, we break the limit of processingscheme in 1-DSP and 2-DSP with a new solution for parallelθ-join to explore a much larger optimization space: maximizingthe resource utilization, supporting various join predicates andguaranteeing workload balance among parallel tasks. In contrast toup-to-date studies, A-DSP has several advantages as follows. First,this scheme achieves high flexibility by easily redirecting the keysto new worker threads with simple editing on the routing function.Second, it is highly efficient when the system faces θ-join anddynamic incoming workload, by providing a general strategy togenerate the partition function for data balance under dynamicstream. Moreover, workload redistribution with the scheme isscalable and effective, by allowing the system to respond promptly

TABLE 2Table of Notations

Notations DescriptionR,S R, S streamC / C̃ tuple matrix without/with ordered join key

CA /C̃A complete information area of C/C̃ca / c̃a coverage area without/with ordered join keyTaskA the area that is feasible to a single taskc(i,j) the cell located in the ith row and jth column

dc(i,j), c(i′,j′)c the rectangle from c(i,j) to c(i′,j′)c̃ax the xth c̃a

TaskAij (TaskAcaxij ) the TaskA in the ith row, jth column of ca(cax)

α/β(αcax /βcax ) the TaskA number of rows/columns in ca(cax)c̃ar/c̃as/c̃a−→ the tuple set of rows/columns/output set in area c̃aV (Vh) memory size of task (half size Vh = V

2)

to the short-term workload fluctuation even when there are heavytraffic of tuples present in the incoming data stream.

Although our preliminary work in [12] has already optimizedthe join performance by designing a varietal matrix join modelbased on 2-DSP, it is still fails to get rid of the inherent and insur-mountable disadvantages of regular matrix model. To fully unleashthe power of our A-DSP model for minimizing the operationalcost while maximizing the processing throughput over variousjoin predicate requirement, a suite of optimization techniques areintroduced for our syncretic processing scheme to fully exploit thebenefits of 1-DSP and 2-DSP. In summary, the major contributionsof this work are summarized as:

• We propose the A-DSP join model to reduce the opera-tional cost with full support of arbitrary join predicateswhile guaranteeing the correctness of join results.

• We devise the scheme generation algorithms for our A-DSP model. Moreover, we present a detailed theoreticalanalysis for proposed algorithms, and prove their usabilityand correctness.

• We design a lightweight computation model to supportrapid migration plan generation, which incurs minimaldata transmission overhead and processing latency.

• We empirically evaluate our model against the state-of-the-art solutions on both standard benchmarks and real-world stream data workloads with detailed explanationson experimental results.

The remainder of this paper is organized as follows. Section 2introduces the preliminaries. Section 3 describes how to generatethe processing scheme. Section 4 explains the detailed implemen-tation of our proposed method. Section 5 shows the experimentalresults. Section 6 reviews the existing studies on stream joinprocessing in distributed systems. Section 7 finally concludes thepaper and discusses future research directions.

2 PRELIMINARIES

2.1 DefinitionsJoin matrix model organizes the parallel processing tasks as amatrix to handle the join operation between two incoming streams,where each side of which corresponds to one stream. In order todescribe our work more intuitively, we define the universal set oftuples arranged by join matrix model as tuple matrix, denoted byC . The tuple matrix enumerates all tuple pairs of the differentstreams, that is, assigning the Cartesian product of R and S (R×

Page 3: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

S) as a tuple matrix. We use cell c(i,j) to denote the tuple pairof the ith row and jth column of tuple matrix. Tab.2 lists thenotations commonly used in the rest of the paper.

Definition 1. Given the streams R and S, if their tuple matrixorders its two side tuples both by their join key, then the matrix isdefined as Ordered tuple matrix, denoted by C̃ .

Definition 2. In tuple matrix C/C̃ , we refer to the set of cells thatcovered by a rectangle as Coverage area, denoted by ca/c̃a. Therectangle is represented by dc(i,j), c(i′,j′)c, meaning its upper leftcorner is c(i,j) and lower right is c(i′,j′).

Definition 3. In C̃ , we define the set of cells comprised by a seriesof 1) no-interval, 2) non-overlapping, and 3) results completionc̃as as Complete information area, denoted by C̃A. The con-straints of c̃as are defined as:

1) No-interval. This property requires that the complete setof c̃as is able to cover all incoming tuples in both the hor-izontal and vertical directions, which can be expressed as⋃̃ca∈C̃

c̃ar = R and⋃̃

ca∈C̃c̃as = S.

2) Non-overlapping. This property requires that each cell inC̃A should be covered by the complete set of c̃as at mostonce, which can be expressed as ∀c̃ai, c̃aj ∈ C̃A, i 6= j,then c̃ai ∩ c̃aj = ∅.

3) Results completion. This property requires that all joinresults in C̃ should be covered by C̃A, which can beexpressed as

⋃̃ca∈C̃

c̃a−→ = C̃−→.

Definition 4. In tuple matrix, we call the coverage area thatassigned to single one task instance as Task area, denoted byTaskA.

Fig.2 shows a toy example of join operation using tuple matrix.Each stream (R/S) has 12 tuples and their join key is denoted byR.k/S.k. Its join condition is |R.k − S.k| ≤ 2, and cells whichproduce join results are marked in grey. Then, the tuple matrix inFig.2(a) can be seen as a C̃ according to Def.1, and in Fig.2(c), theunion of ca1 and ca2 can be regarded as a C̃A according to Def.3.Furthermore, based on Def.3, C will be treated as a completeinformation area, since its out-of-order tuples make each cell inC become a potential result (CA = C), which can be reflectedby Fig.2(b).

2.2 ModelsBased on the above definitions, the essence of parallel processingof those different methods is partitioning the whole tuple matrixinto multi-TaskA s. As shown in Tab.3, 1-DSP and 2-DSP can beexpressed as splitting C̃A and CA into multi-TaskA s, respectively.Fig.2(a) and Fig.2(b) show their data partition results. They alsoreflect the characteristics of the two methods shown in Tab.1,that is, 1-DSP uses less resources, but exists workload imbalance,while 2-DSP is the opposite. Specifically, the workload of eachtask instance includes input workload and output workload, whichcan be measured by the number of tuples flowing in and out [31].For example in Fig.3, we assume the maximum input workloadis 8 and the maximum output workload is 5. Then, 4 tasks inFig.2(a) set an imbalance status while 9 tasks in Fig.2(b) providea prefect balance both in input and output workload. In fact, onlythe balance of input workload needs to be taken into accountwhile using 2-DSP, for its tuple-based random routing strategy

can ensure the output results being evenly distributed in tasks. Oursubsequent approach inherits this advantage.

TABLE 3Summary of methods.

Methods Partition way Example1-DSP C̃A −→ TaskA s Fig.2(a)2-DSP CA −→ TaskA s Fig.2(b)

A-DSP(Our) C̃APC̃A−−→ ca s

Pca−−→ TaskA s Fig.2(c) & Fig.2(d)

The partitioning process of A-DSP model mainly consists oftwo phases: partition C̃A into c̃as and partition c̃a into TaskA s,denoted by PC̃A and Pca, respectively. For each c̃a generated byPC̃A, Pca takes it as an unordered ca, and then, partitions it intomulti-TaskA s using 2-DSP. Intuitively, using 1-DSP to generatec̃as in PC̃A may cause a workload imbalance status. However,those c̃as in our A-DSP model are usually carried by multipletasks, which can smooth the workload imbalance. In other words,c̃a in A-DSP can act as a cushioning layer for workload imbalance.This characteristic is similar to that of task grouping in [22].Fig.2(c) shows that PC̃A takes c̃a1 and c̃a2 as C̃A. Then, Pca inFig.2(d) takes c̃a1 as ca1 and splits it into four TaskA s, takes c̃a2

as ca2 and splits it into two TaskA s. Compared with 2-DSP, thisscheme consumes less resources (6 task instances) while ensuringeach task workload no excess the ideal workload. However, theremay be another partition manner which consumes 4 tasks, i.e.,Fig.5, and that is what we are after. Next, we are ready to definethe optimization goal of our A-DSP.

2 2 12 13 18 19 19 19 23 25 33 33

1

1

11

11

12

22

22

26

27

27

27

34

TaskA

S.kR.k

(a) 1-DSP: Partition C̃A by TaskA s

33 13 19 2 19 2 19 25 33 23 12 18

11

11

27

27

26

1

34

22

22

27

1

12

S.kR.k

TaskA

5

10

(b) 2-DSP: Partition CA by TaskA s

2 2 12 13 18 19 19 19 23 25 33 33

1

1

11

11

12

22

22

26

27

27

27

34

ca1

ca2

CA~S.k

R.k

5 10

~

~

(c) PC̃A: Partition C̃A by c̃a

2 13 18 19 2 19 19 12 25 33 33 231

12

11

11

1

22

27

27

22

34

27

26

S.kR.k

ca1

ca2

TaskA

5

10

5 10(d) Pca: Partition ca by TaskA s

Fig. 2. A Toy Example of Different Join Models.

2.3 Optimization Goal

For ca, its resource cost includes Memory, Network and CPU,which can be modeled as follows:

Memory. Since tuples are duplicated along rows or columnsin Pca, values of αca and βca in ca determine the memoryconsumption Cm considering the join operation, defined as

Cm = |car| · βca ·Rstatesize + |cas| · αca

· Sstatesize , (1)

Page 4: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

where Rstatesize and Sstatesize are the state sizes in stream R and S,respectively.

Network. When we duplicate the tuples along rows or column-s, it consumes the network bandwidth Cn, which is the totalcommunication cost of data transmission among tasks. Cn isdefined as

Cn = |car| · βca ·Rtuplesize + |cas| · αca

· Stuplesize , (2)

where Rtuplesize and Stuplesize are the tuple sizes in stream R andstream S, respectively.

CPU. For theta-join, we define its computation complexity as

Cc = |car| · |cas|. (3)

In general, it satisfies Rtuplesize ∝ Rstatesize and Stuplesize ∝ Sstatesize

[14]. Then we get Cn ∝ Cm based the above discussion andanalysis. In other words, minimizing memory usage is equivalentto optimizing network cost. Furthermore, Cc is a constant valuewhich is |car| · |cas| and is independent of αca and βca. Based onthis, we can conclude that optimizing memory is a representativechoice.

Suppose the maximum memory size for each task is V and thewhole tuple matrix is split into ε cas. We formulate our objectiveas an optimization problem defined as follows:

min∑x∈[1,ε]

αcax · βcax , (4)

s.t. Constrains in Def.3; (5)

∀x ∈ [1, ε],∀i ∈ [1, αcax ],∀j ∈ [1, βcax ],

|(TaskAcaxij )r|+ |(TaskAcaxij )s| ≤ V. (6)

As the shown in above formulations, our optimization objective is(Exp.4) minimizing the resource cost, while (Exp.5) guaranteeingthe correctness of processing and (Exp.6) load balance among alltask instances. In fact, under the premise of workload constraint,how to determine C̃A to ensure the minimum task consumptionis a combinatorial NP-hard problem, as it can be reduced to theSubset-sum problem [28].

3 SCHEME GENERATION

3.1 Partition ca into TaskA sBefore discussing how to partition ca into TaskA s, we firstintroduce a theorem for guiding the formulation of partition plan.

Theorem 1. Assuming we use matrix model to handle stream joinoperation, if there exist α and β in ca with |ca

r|α = |cas|

β = Vh,then the processing cost for car onθ cas is minimum.

Proof. Assuming the length and width of TaskA are a (a > 0)and b (b > 0), respectively, we have the following expression:

a · b ≤ (a+ b)2

4. (7)

Equ.7 can be modified as the following expression:

a+ b ≥ 2 ·√a · b. (8)

In the above Equ.7∼8, a+b can be seen as the memory volume ofTaskA and a · b represents the calculation amount. Each task hasthe predefined CPU and memory resources. According to Equ.8,for the given CPU resource Cc (represented by the calculationarea a·b), square (a=b) has the least memory usage Cm. Forthe given memory Cm, square (a=b) will produce the least cells

and maximize the computation ability according to Equ.7. Thenetwork cost Cn can be represented by memory usage Cm asdeclared in Sec.2.3. Based on the above, we can conclude that ifthe length ( |ca

r|α ) and the width ( |ca

s|β ) of the subrange derived

from ca partition are equal, that is, equal to half of the memorysize (Vh), then the system consumes the least resources includingCPU, memory and network.

|car |=

6

|cas|=6

ca

[6 , 6]

(a) Calculation Area of Matrix

|car |=

6

|cas|=6TaskA00 TaskA01

TaskA10 TaskA11

[5 , 5] [5 , 1]

[1 , 5] [1 , 1]

(b) 2× 2 Partition Scheme M

Fig. 3. A Toy Example of Scheme Partition at |car|=6 and |cas|=6.

According to Theo.1, the numbers of row and column shouldbe d |ca

r|Vhe and d |ca

s|Vhe, respectively. Then the total number of

tasks used in ca can be expressed as

N = d |car|

Vhe · d |ca

s|Vhe. (9)

Fig.3 shows a ca partition example with the stream volumes for Rand S are |car| = 6GB and |cas| = 6GB. Given task memoryV = 10GB and Vh = 5GB, ca can be divided as shown inFig.3(b). The generated scheme in Fig.3(b) takes up 4 tasks withtwo rows and two columns (4 divisions: s0, s1, r0 and r1). Amongthose TaskA s, we load data to rows and columns by Vh in a top-down manner. In the last row and column of ca, the TaskA s mayhave less data divisions compared to previous TaskA s and arecalled fragment tasks filled with fragment data. In this example,TaskA00 is first fed by Vh for both car and cas streams and thenTaskA00 = (5GB, 5GB). However the last column and row cannot be filled up. According to Equ.9, though we can calculate thenumber of tasks needed for joining processing, workload are notevenly distributed among tasks because of the fragment data. Thentasks for fragment data may have free resources compared to othertasks, and it will be more obvious for the large matrix.

In order to facilitate the subsequent description, we namethe two join streams as a primary stream P and a secondarystream D. Then we take the primary stream P as the basisvolume to calculate how much memory should be assigned forit in each task and accordingly, the secondary stream D gets theremaining portion of memory in each cell. Specifically, we use Pγto represent the number of divisions generated by primary stream,and then for the secondary stream, its division number Dc

γ can becalculated as

Dcγ = d |D|

V − |P |Pγe. (10)

P may be either stream R or S. We use Nc to represent thenumber of tasks for processing join between P and D in ca whichis calculated as:

Nc = Pγ ·Dcγ = Pγ · d

|D|V − |P |Pγ

e. (11)

Page 5: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

[6.5, 3.5]

|car |=

9

regular

adjustive

(Primary)(Secon

dary)

TaskA00 TaskA01

TaskA10 TaskA11

|cas|=7

[6.5, 3.5]

[2.5, 3.5] [2.5, 3.5]

(a) Regular Matrix

|cas|=7

|car |=

9

TaskA00 TaskA01

[6.5, 3.5] [6.5, 3.5]

[2.5, 3.5

+3.5]

TaskA10

(b) Optimal Matrix

Fig. 4. Example with the Optimized Scheme

In Fig.3, if we take R as the primary stream P and the numberof divisions generated by primary stream is Pγ = d |ca

r|Vhe,

then the matrix scheme of example in Fig.3(a) should have onlyone column with two cells. However, Dc

γ calculated in Equ.10will have fragment tasks if its rounding up value is not equalto the rounding down value. For example in Fig.4(a), givenV = 10GB, |car| = 9GB, and |cas| = 7GB, a ca with α = 2and β = 2 will be generated. Sine S is the primary stream, eachtask first gets data assignment from S by |ca

s|β = 3.5GB and the

remaining space 6.5GB can be used for divisions from car . Thememory utilization percentage of the two tasks in the last row is60%.

An optimized partition scheme of Fig.4(a) is shown as Fig.4(b), which takes up only 3 tasks for car onθ cas. In Fig.4(b), thereplica |TaskAs01| = 3.5GB of cas in TaskA01 are sent to TaskA10

to join with subset |TaskAr10| = 2.5GB of car to promise thecorrectness and completeness of join results. It is feasible to movetuples in TaskAs01 to TaskA10 to complete the join work for thememory threshold V = 10GB remains satisfied.

Compared to the regular matrix scheme in Fig.4(a), the optimalworkload assignment in Fig.4(b) can make better use of resourceamong tasks. Let’s take Fig.4(b) to explain the generation ofthe irregular line, which is the last row. Given the divisions forprimary stream S, which is β = Pγ = 2 in Fig.4(a), we dividethose α rows for secondary stream R into two types: regular ones(the first row ) and adjustive ones (the second row). For regularones, we have αf = Df

γ lines of segments (labeled as regular inFig.4(a)), where Df

γ = b |D|V− |P |Pγ

c. By using these regular division,

we may not manage all the workload for secondary stream R,since Df

γ · (V −|P |Pγ

) ≤ |D|. For the remaining loads DL fromD calculated as Equ.12, it is expected to add additional Pδ (as inEqu.12) number of columns along the last row with Pδ < Pγ .

DL = |D| −Dfγ · (V −

|P |Pγ

)

Pδ = d |P |V −DL

e(12)

The difference from previous regular matrix is that we willnot add one whole row (column) of tasks for the join work withthe purpose of making full use of system resources. Accordingly,the optimal number of tasks No for R onθ S can be calculated asEqu.13:

No =

{Pγ ·Df

γ , |D| = Dfγ · (V −

|P |Pγ

),

Pγ ·Dfγ + Pδ, otherwise.

(13)

As described in Theo.1, we use Vh as the divisor to findPγ for generating an optimal scheme by probing all four cases.

Specifically, we make Pγ ∈ {d |car|

Vhe, b |ca

r|Vhc, d |ca

s|Vhe, b |ca

s|Vhc}

in Equ.13.The algorithm of finding an optimal partition scheme is de-

scribed in Alg.1. Firstly, the minimal number of tasks is deter-mined in line 1 according to Equ.13, by enumerating all the caseof round up and down; then in line 2 ∼ 5, the number of rowsα and columns β can be calculated according to values of P andPγ . If Pγ ∈ {d |ca

r|Vhe, b |ca

r|Vhc}, R is the primary stream, or else S

is the primary one. The scheme of varietal matrix generated fromAlg.1 can be expressed as < α, β, Pδ,+ > which means that thevarietal matrix has α rows, β columns and has Pδ additional cellin + (+ ∈ row, column).

algorithm 1 Partition ca to TaskA sinput: Stream car , Stream cas, Memory size Voutput: Row number α, Column number β, additional cells Pδ ,

additional position +1: No ← Get the minimum task consumption according to

Equ.13 where Pγ ∈ {d |car|

Vhe, b |ca

r|Vhc, d |ca

s|Vhe, b |ca

s|Vhc}.

2: if Pγ ∈ {d |car|

Vhe, b |ca

r|Vhc} then

3: α← Pγ , β ← Dfγ , Pδ ← No −Pγ ·Df

γ , +← column4: else5: α← Df

γ , β ← Pγ , Pδ ← No − Pγ ·Dfγ , +← row

6: return α, β, Pδ,+

We still use Fig.4(b) to explain Alg.1. According to Equ.12 ,Equ.13 and line 1 in Alg.1, S is the primary stream and R is thesecondary one, in which β = 2, Df

γ = b 910−7/2c = 1. TaskA00

and TaskA01 are supposed to have regular workload assignmentswith 3.5GB from cas, and the remaining memory is used forcar, which will be 10− 3.5 = 6.5GB. Since car has load morethan 6.5GB, we add one more row which is the adjustive divisionfor the remaining data of car that DL = 9 − 6.5 = 2.5GB. Insuch a case, the TaskA s in regular division lines have full usageto memory 6.5 + 3.5 = 10GB as in TaskA00 and TaskA01. Foradjustive purpose, we add Pδ = 1 TaskA to join the remaining dataDL in R with S, which belongs to TaskA TaskA10 and TaskA11

in Fig.4(a). In this adjustive row, DL will join with the subset ofcas for TaskA10 and TaskA11 (3.5GB) and then the workload inTaskA10 comes to be 9.5GB. The varietal matrix scheme shownin Fig.4(b) can be expressed as < 1, 2, 1, row >.

Theorem 2. Alg.1 will consume less tasks than using Theo.1directly and ensure the correctness of operation when using matrixmodel for car onθ cas with the memory size of each task V .

Proof. Assume that there exists another varietal matrix M ′ withscheme < α′, β′, P ′δ,+

′ >, and the number of tasks N ′ usedin M ′ is smaller than No. To find the smaller No, Alg.1 triesall possible values that make |P |Pγ nearest to Vh according toEqu.12, Equ.13 and line 1. In other words, it is impossible for|car|α′ and |ca

s|β′ to get closer to Vh than |ca

r|α and |ca

s|β . According

to Theo.1 that squared cell shape consumes the minimal resources,the assumption of existing regular matrix M ′ does not hold.

3.2 Partition C̃ into c̃a s (ca s)

In this subsection, we introduce a heuristics algorithm to ac-complish the optimization problem with theoretical perspectives.Specifically, here we present how A-DSP provides a theoretical

Page 6: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

guarantee upper-bound for task consumption to partition C̃ intoc̃a s (ca s) . Using tuple matrix model to handle θ-join, a keyobservation is that the distribution of C̃A−→ has the feature ofMonotonicity. The monotonicity can be expressed as if c(i,j) isnot a result cell, then either all cells (k, l) with k ≤ i and l ≥ j,or all cells (k, l) with k ≥ i and l ≤ j are also not resultcells [26], [31]. It should be pointed out that the ”6=” operationis not monotonic, for its operation semantic making each cell inC̃ become a potential result. In such cases, our A-DSP model willadaptively adopt the 2-DSP architecture, which can be reflectedfrom Alg.2 later. Next, we define a partition way for C̃A asfollows.

Definition 5. Splitting the area c̃a into two sub-areas: c̃ad andc̃ad, if c̃ad ∪ c̃ad = c̃a and c̃ad ∩ c̃ad = ∅, we call this areapartition action as c̃a’s area dichotomy, denoted as AD(c̃a).

Based on the above Def.5, we call d as the split point and theminimum task consumption of C̃A based on area dichotomy canbe expressed as:

NAD(C̃A) = Mind∈C̃AD{NAD(C̃A

d) +NAD(C̃Ad)}, (14)

where C̃AD is the universal set of split point in C̃A. Equ.14 isan iteration expression to enumerate all split scheme to find theminimal task consumption. Thus, we have the following theorem.

Theorem 3. Among all schemes generated by C̃A’s area di-chotomy, Equ.14 can denote the one which with minimum taskconsumption.

Proof. Assume there is a partition scheme with the set of splitpoint {di}, where di is the split point of the ith iteration partition,and Nmin is the task consumption by this partition scheme.If Nmin < NAD(C̃A), then we have {di} * C̃AD, whichcontradicts to the assumption that C̃AD is the universal set ofsplit point in C̃A in Equ.14.

Essentially, Equ.14 is a combinatorial mathematics problemand its iteration times is a Catalan number [17]. Directly adoptinga exhaustive enumeration method to obtain the partition schemewill undoubtedly incurs the high computational complexity, reach-ing up to O( 4|C̃AD|

|C̃AD|32 ·√π

). Therefore, we define an iteration

termination condition as follows.

Definition 6. Using area dichotomy to split the coverage areac̃a into two sub-areas c̃ad and c̃ad, the iteration terminationcondition is ∀d ∈ c̃aD , N c̃a ≥ N c̃ad +N c̃ad .

Alg.2 lists the steps of C̃A partition based on the abovediscussions. At the initialization stage of Alg.2, the matrix isgenerated by the Cartesian produce of the two join streams. First,it finds the split point according to Def.5 in line 4. Second,Alg.2 judges whether to stop the iterative process, and gathersthe corresponding split results in lines 5 ∼ 10. Finally, it picks outthe partition result which with the minimal task consumption asits output in line 11. We continue using the example in Fig.2to illustrate Alg.2, and the partition result is shown in Fig.5.Alg.2 takes the whole tuple matrix C̃ as iterative entry and splitsit into two areas: ca1(dc(1,1), c(5,5)c) and ca2(dc(6,6), c(12,12)c),using area dichotomy. For ca1, Alg.2 stops splitting it becausethe continuing split action can not reduce task consumption. Dueto N ca2 = 4 and NAD(ca2) = 2, Alg.2 picks the fewer tasks

2 12 18 2 13 19 19 23 19 25 33 33

1

11

11

1

12

26

22

22

27

34

27

27

S.kR.k

ca1

ca21

ca31

ca2

Fig. 5. Partition the Example in Fig.2using Alg.2

1th Vh (β-1)th Vh βth Vh

Fig. 6. Illustration of the Worst Caseof using Area Dichotomy.

consumption scheme by splitting dc(6,6), c(12,12)c into two sub-areas. Finally, Alg.2 generates a scheme that consumes 4 tasks.

algorithm 2 Partition C̃ to ca s

input: C̃output: ca s

1: c̃a← C̃A2: function PARTITIONAREA(c̃a)

/*c(b,b)/c(e,e) : the beginning/ending cell of c̃a*/3: foreach c(i,j), c(i′,j′) in c̃a do4: if dc(b,b), c(i,j)c ∪ dc(i′,j′), c(e,e)c =c̃a and dc(b,b), c(i,j)c ∩ dc(i′,j′), c(e,e)c = ∅ then

/*According to Def.5*/5: if Ndc(b,b),c(i,j)c + Ndc(i′,j′),c(e,e)c < N c̃a then

/* According to Def.6*/6: PARTITIONAREA(dc(b,b), c(i,j)c)7: PARTITIONAREA(dc(i′,j′), c(e,e)c)8: else9: cas(i,j) ← dc(b,b), c(e,e)c

10: CAs← cas(i,j)

/* cas(i,j) : the set of coverage area be derived from the firstiteration, CAs: the set of cas(i,j)*/

11: get ca s (cas ∈ CAs) with the minimal∑ca∈casN

ca

/* According to Theo.3*/12: return ca s

Next, we discuss the task consumption upper-bound of Alg.2.

Definition 7. Given area c̃a, suppose there is a minimum taskconsumption N c̃a

1 , and the task consumption by Alg.2 is N c̃a2 , we

refer to the ratio of Nc̃a1

N c̃a2as consumption coefficient, denoted by ξ.

Lemma 1. For the area c̃a with αc̃a ≥ 2 and βc̃a ≥ 2, if Alg.2stops splitting it, then the cell set RCc in c̃a, where

RCc = {c(k,l)|(αc̃a − 1) · Vh + 1 ≥ k ≥ Vh, (βc̃a − 1) · Vh+

1 ≥ l ≥ Vh} − c(αc̃a−1)·Vh+1,βc̃a−1)·Vh+1),

should be all the result.

Proof. To verify in the worst case, we use the naive method tomeasure the task consumption according to Equ.9, where N c̃a =αc̃a · βc̃a = d |c̃a

r|Vhe · d |c̃a

s|Vhe. The formation of RCc can be

accumulated in three steps:Step-1 (According to Def.6): The precondition of Lem.1 thatAlg.2 stops splitting area c̃a means ∀d ∈ c̃aD, N c̃a < N c̃ad +N c̃ad . Then, the cells {c(k,l)|k%Vh = 0, l%Vh = 0, k 6=

Page 7: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

|c̃ar|, l 6= |c̃as|} (e.g., the red cells in Fig.6) are definitely notsplit points, otherwise, the scheme cannot be the worst case amongall partition schemes. They should be contained by RCc.Step-2 (According to Def.5): Due to the fact that area dichotomysplits the area c̃a into two sub-areas and c(i,j) in c̃a is not a splitpoint, then, existing cells c(i,j′)∈c̃a(i < i′) and c(i′,j) ∈ c̃a(j <j′) are both result. Based on this, the cells in RCc are expandedto

{c(k,l), c(k,l+1), c(k+1,l)|k%Vh = 0, l%Vh = 0, k 6= |c̃ar|,l 6= |c̃as|}.

Step-3 (According to monotonicity): The monotonicity defines thatthe cell(s) between result cells are also results. Combining this, theresult cells RCc can be expressed as

{c(k,l)|(αc̃a − 1) · Vh + 1 ≥ k ≥ Vh, (βc̃a − 1) · Vh + 1

≥ l ≥ Vh} − c(αc̃a−1)·Vh+1,βc̃a−1)·Vh+1),

which equivalents to the description of Lem.1 (surrounded by bluedotted line in Fig.6).

Then, the upper-bound of ξ can be expressed as follows:

Theorem 4. Using Alg.2 to partition C can ensure an upper-bound of its task consumption: ξ ≤ αC ·βC

αC ·βC−αC−βC+3 , whereαC ≥ 2 and βC ≥ 2.

Proof. According to Lem.1, the minimal task consumption isconsisted of three parts: dc(1,1), c(Vh,Vh)c consumes 1 task,dc(Vh+1,Vh+1), c((αc̃a−1)·Vh,(βc̃a−1)·Vh)c consumes (αc̃a − 1) ·(βc̃a − 1) task(s) and dc((αc̃a−1)·Vh+1,(βc̃a−1)·Vh+1), c(|R|,|S|)cconsumes 1 task. Then, we have N c̃a

2 ≥ 1 + (αc̃a − 1) ·(βc̃a − 1) + 1. Since ξ =

N c̃a1N c̃a2

and N c̃a1 = αc̃a · βc̃a, we

have ξ ≤ αc̃a·βc̃a1+(αc̃a−1)·(βc̃a−1)+1

. Making C as c̃a, Theo.4 isverified.

The statement of Theo.4 gives an estimation or hint on theworst case of task consumption. However, just as the unique dis-tribution shown in Fig.6, the probability of a worst case scenariois negligible, which can be validated by the experimental resultsin Sec.5. Algo.2 uses area dichotomy to split the coverage areac̃a into two sub-areas c̃ad and c̃ad, and its iteration terminationcondition is ∀d ∈ c̃aD, N c̃a ≥ N c̃ad + N c̃ad , its computationalcomplexity is not more than T(|c̃ar|·|c̃as|), where T(|c̃ar|·|c̃as|)=κ· T(|c̃ar| · |c̃as|/2)+O(|c̃ar| · |c̃as|). As illustrated in Fig.6 andTheo.4, the cells {c(k,l)|k%Vh = 0, l%Vh = 0, k 6= |c̃ar|, l 6=|c̃as|} (the four red cells in Fig.6) are definitely not split points,then κ = 4 and T(|c̃ar| · |c̃as|) = O((|c̃ar| · |c̃as|)2).

4 IMPLEMENTATION

The adaptive processing architecture is shown in Fig.7 and canbe decomposed into the following five steps: Step-1: Monitoringworkload. It collects information about the tuple matrix and thespace occupied by the two join streams, and the task consumptionof the existing processing scheme. On receiving the reportinginformation, the controller first evaluates the number of taskconsummation according to Algo.1 and Algo.2. Step-2: Gener-ating processing scheme. It produces a new scheme according toworkload and resource information, as presented in Sec.3; Step-3: Making task-load mapping. It expects to find the task-loadmapping function so as to maximize non-moving data volume

Controller

1. Monitoring workload

5. Migration plan

S Routing

R Routing

Tasks

3. Making task-load mapping

2. Generating processing scheme.

4.Generating migration plan

: Date stream

: Control stream

R S

Migration plan

… … … …

… …

Fig. 7. Architecture of Adaptive Processing.

when transforming from the old scheme to the new scheme. Thisstep will be described in Sec.4.2. Step-4: Generating migrationplan. It schedules the data migration among tasks according totask-load mapping scheme. This step will be described in detailin Sec.4.3. Step-5: Defining consistent protocol according tomigration plan. This step defines consistent protocols to ensurethe correctness of system.

We omit the detailed description of step-5 in this paper since itis same as [9]. Moreover, the implementation process of our modelshould include two phases: PC̃A (partition C̃A into c̃a s) and Pca(partition c̃a(ca) into TaskA s). After the final scheme generatedby Alg.1 and Alg.2, the workload adjustment strategy among ca sis a key-based issue which has been explored in our previous work[11], so that will not be reiterated in this paper. Due to the fact thatthe routing strategy is the essential step for the incoming stream,we first introduce it in the following subsection.

4.1 Routing Strategy

In view of our A-DSP model is a composite structure, its routingstrategy includes two layers as listed in Alg.3.

algorithm 3 Routing Strategy Algorithminput: Input tuple: τ , Mapping function: Foutput: Routing destination RD: {(ca, ‡, th)}

1: k ← E(τ) /* extract the key of tuple τ*/2: / ∗ ∗∗Step-1: Routing tuples in C̃ ∗ ∗ ∗/.3: Ca← ROUTINGINC̃(C̃, k, F ) /* Ca is the set of ca*/4: / ∗ ∗∗Step-2: Routing tuples in ca ∗ ∗ ∗/.5: RD← ROUTINGINCA( Ca,τ )6: return RD7: function ROUTINGINC̃(C̃, k, F )8: K∗ ← F (k)9: foreach k∗ ∈ K∗ do

10: x← BasicPartition(k∗), Ca← cax

11: return Ca12: function ROUTINGINCA(Ca,τ )13: foreach ca∗ ∈ Ca do14: if τ ∈ R then15: ‡ ← row, th← RandomInt[0 ∼ (αca

∗ − 1)]16: else17: ‡ ← column, th ← RandomInt[0 ∼ (βca

∗ −1)]

18: RD← (ca∗, ‡, th)

19: return RD

Step-1: Routing tuples in C̃. This step is responsible for routingtuples into ca(s). To ensure that one tuple could be sent to

Page 8: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

various destinations, here using a content-sensitive manner, whichaccording to a basic mapping function F (k). To this end, Alg.3first extracts the join key from each incoming tuple in line 1.Then it figures out which key(s) the tuple should pair in line8. According to the pair key, Alg.3 generates the destination(s)for the incoming tuple, based on a basic routing function (e.g.,consistent hash) in lines 9∼11.Step-2: Routing tuples in ca. This step is responsible fordetermining which task(s) in ca is(are) destination(s). Alg.3 firstidentifies which stream the inputting tuple comes from in lines 14∼ 17. Accordingly, all task(s) in the row or column are selectedas destination(s) in line 18. In Alg.3, the destination is denotedby (ca, ‡, th), means the input tuple should be send to all taskslocated at ‡th(‡ ∈ (row, column), th ∈ N ) in ca.

In example of Fig.2(d), for the tuple from stream S and withjoin key 18, Alg.3 first determines the first ca(ca1) is its direction,which according to the join predicate expression(|R.k−S.k| ≤ 2)and storage strategy(i.e.,{k1 ∼ k20} � ca1, {k21 ∼ k40} �ca2). Since ca1 has two columns, Alg.3 randomly selects the 0th

column as its destination line (i.e., destination is (ca1, column, 0)in Alg.3).

4.2 Scheme Changing

There are two criteria which should be guaranteed during theprocess of ca’s scheme changing. The first one is to ensure thecorrectness of process and the second one is to lower migrationcost during the process of scheme changing as far as possible.To simplify our description, some additional notations used in thefollowing sections are summarized in Tab.4.

TABLE 4Table of Additional Notations

Notations Descriptionmij task instance corresponding to TaskAij

hxRij /hxSij the sub-stream R/S that has been stored in mx

ij , repre-sented as a range [b, e]

Mo/Mn old task matrix and new task matrix for cak/l k-th row and l-th column in new scheme

hRij/hSij data has been stored in mij in ca

sRkl/sSkl data for mkl in new scheme Mn in ca, represented as a

range [b, e] on stream car/cas

tpi the mapping of tasks between old and now schememp the migration plan

TP /MP the set of tpi/mp

We now use m and M to denote task instance and task matrixrespectively, and mij corresponding to calculation area TaskAijin ca. Furthermore, we use mij and mkl to represent the taskinstance in Mo and Mn respectively, where i(k) and j(l) are therow number and column number of Mo (Mn). We define thecorrelation coefficient λijkl to reflect the volume of data overlapbetween mij and mkl, which can be calculated as λijkl = |hRij ∩sRkl|+ |hSij ∩ sSkl|.

Alg.4 shows the task mapping method, which aims to min-imize the states migration when using the new scheme Mn.Specifically, Alg.4 first enumerates all the possible mappings inform of tpi =< mij ,mkl, λ

ijkl > as shown in lines (1 ∼ 4); then

it selects the optimal mappings which have larger λijkl among allmapping sets in lines (6 ∼ 10). Task mapping TP generated byAlg.4 gathers the largest cumulative values of λijkl. Because eachmij or mkl appears in TP only once at most, then the correlation

algorithm 4 Task Mapping TP Generationinput: Old task matrix scheme Mo, New task matrix scheme Mn

output: Task mapping set TP1: foreach mij in Old scheme Mo do2: foreach mkl in New scheme Mn do3: λijkl ← (hRij ∩ sRkl) · |car|+ (hSij ∩ sSkl) · |cas|4: TPI ←< mij ,mkl, λ

ijkl > / ∗ TPI is a temp set ∗/

5: Initialize TP = Null6: foreach tpi in TPI in descending order based on λijkl do7: if mij or mkl in tpi has been appeared in TP then8: Continue9: < mij ,mkl >→ TP

10: return TP

coefficient λijkl is independent of others. Our algorithm alwaysselects the mapping pair with biggest λijkl, and then it will generatethe maximum cumulative value of λijkl. In other words, Alg.4finally produces the minimal migration cost during the processof scheme changing.

Considering scheme generation in Sec.3 and task mappinggeneration discussed above, we design an advanced scheme map-ping for reducing migration cost. In order to make it easy forexplanation, we take a one dimension scheme division as anexample and treat its total volume as unit ”1”, shown in Fig.8,where hi(si) represents the range processed by task i.

A naive method is to divide the stream into even ranges andassign each range to one task as shown in Fig.8(a). In old scheme,there are 4 tasks (No.0∼3) and each task manages 25% percentageof the whole stream, represented as hi (i ≤ 3). When it is scaledout to 5 tasks, 5 ranges are generated evenly and each range sj(j ≤ 4) maps to one task. Based on Alg.4, we can figure out thebest mapping from {hi} to {sj} shown in Fig.8(a), representedby the dotted line. The shadowed range in old scheme will bemigrated among tasks, and the newly added task will be assignedthe data in range [2/5, 3/5] labeled as s4. The total volume ofdata migration is 1

20 + 15 + 1

20 = 310 .

140

24

34 1

0 115

25

35

45

h0 h1 h2 h3

s0 s1 s4 s2 s3

Old scheme:

New scheme:

(a) Original

140

24

34 1

0 115

14

920

1920

s0 s1 s2 s3

24

34

1420

s4 s4 s4 s4

h0 h1 h2 h3

: Migration range

(b) Optimized

Fig. 8. Advanced Scheme Generation and Task Mapping.

Given that join correctness is independent of the order of tuplesas long as we can guarantee that the same row (column) has thesame states as described in Sec.2. In this context, we can conductan optimization based on start-point-alignment method for taskmapping as follows. We first align the same range start points for{sj} in new scheme to those from {hi} in old scheme, if any.In Fig.8(b), the old scheme is scaled out to 5 tasks. We align thesame start range points between 4 tasks such as s0 ∼ s3 fromthe new scheme and h0 ∼ h3 from the old scheme. Each hi cutsdown the shadowed 1/20(=1/4 − 1/5) of the range and movesthem to the newly inserted task that is s4. In such a case, the totalmigration volume is 1

20 + 120 + 1

20 + 120 = 1

5 , which is less than themigration cost generated by Alg.4. For scaling down, it is almost

Page 9: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

the same. We align the start points of new tasks with some of theold ones, and split ranges managed by the removing nodes (to bedeleted) to the tasks kept in the new scheme.

4.3 Migration Plan Generation

Figuring out the part of data which should be migrated amongtasks is a necessary information in the process of scheme chang-ing. We use ♦ to represent the join stream that ♦ ∈ {car, cas}and n♦kl means the data that are lacked in mkl, and then the rangeof data n♦kl is:

n♦kl = s♦kl − (s♦kl ∩ h♦kl) (15)

Then we define the migration plan as mp =< mij ,mkl, ν♦ >

that can be read as: for stream ♦, the data ν♦ should be copiedfor mij in old scheme Mo to mkl in new scheme Mn, where ν♦

is:ν♦ = h♦ij ∩ n

♦kl (16)

For eachmkl, we use d♦kl to represent the data which are no longerneeded and should be moved out, and d♦kl can be calculated as:

d♦kl = h♦kl − (s♦kl ∩ h♦kl) (17)

In this case, the migration plan can be represented as mp =<�,mkl, d

♦kl >. The migration plan with mark � can be read as:

for stream ♦, delete data d♦kl from mkl.The generation of migration plan is described in Alg.5 and

can be divided into four steps as follows: Step-1: Discardingstates that are no longer needed in Mn. As shown in line 1of Alg.5, this step is used to clean useless states generated byold scheme Mo. Specifically, some states will be discarded whenscheme change happens. Step-2: Filling stream data for taskinstances in the new scheme. We can get the whole data set instream R or S by combining the data from the first row or the firstcolumn in Mo. As in line (2 ∼ 11), we can fill each task instancewith data in Mn using tuples which are stored in the first row orcolumn in Mo. Step-3: Handling the last row or column withirregular scheme. This step plays the role that making schemefull use of resources. It redistributes the subset of data of primarystream located in task instances Pδ ∼ Pγ in lines (12 ∼ 18).Step-4: Deleting useless tuples. It deletes data with � mark inmp under the new scheme Mn in lines (19 ∼ 21).

In above steps, step-1 will be triggered when the old schemeMo is an irregular matrix and step-3 will be triggered when thenew scheme Mn is irregular. To make it comprehensible, we firstwalk through an example with regular scheme transformation (justuse step-2 and step-4 in Alg.5) and then discuss the details of theirregular scheme generation.

A scheme changes from 2 × 2 to 2 × 3 as depicted in Fig.9and we also treat the total volume of car/cas as unit ”1”. In oldscheme Mo, each task manages half size of stream data from car

and cas shown in Fig.9(a): each row manages half size of car

with hR00 = hR01 = [0, 12 ] and hR10 = hR11 = [ 1

2 , 1]; each columnmanages half size of cas with hS00 = hS10 = [0, 1

2 ] and hS10 =hS11 = [ 1

2 , 1]. When the workload of stream cas increases, systemmay scale out by adding one more column with two tasks usinga 2 × 3 scheme as shown in Fig.9(b). In this case, data partitionsto car are unchanged where tasks in the first row still managehalf size of volume ( sR0j = [0, 1

2 ], j ∈ {0, 1, 2}) and tasks in thesecond row manage the other half (sR1j = [ 1

2 , 1], j ∈ {0, 1, 2}).Stream cas should be split into three partitions for three columns,

algorithm 5 Migration Plan Generationinput: Old scheme Mo, New scheme Mn, Task mapping TPoutput: Migration plan MP

1: Discard states with mark � state for each task in new scheme2: foreach row i with column 0 in old scheme Mo do3: foreach task mkl in new scheme Mn do4: if hRi0∩nRkl 6= ∅ then /∗According to Equ.15&16∗/

5: < mi0,mkl, hRi0 ∩ nRkl >→MP

6: Update nRkl7: foreach column j with row 0 in old scheme Mo do8: foreach task mkl in new scheme Mn do9: if hS0j ∩ nSkl 6= ∅ then

10: < m0j ,mkl, hS0j ∩ nSkl >→MP

11: Update nSkl12: foreach task x located between Pδ and Pγ do13: if P = R then / ∗ y ∈ [0, Pδ − 1] ∗ /14: < �,mxDfγ

,myDfγ, sRxDfγ

>→MP

15: Set task in mxDfγin Mn as inactive

16: else if P = S then17: < �,mDfγx

,mDfγy, sSDfγx

>→MP

18: Set task in mDfγxin Mn as inactive

19: foreach task mkl in new scheme Mn do20: < �,mkl, d

Rkl >→MP / ∗According to Equ.17 ∗ /

21: < �,mkl, dSkl >→MP

22: return MP

each of which manages 13 range of data, that is sSi0 = [0, 1

3 ],sSi1 = [ 1

3 ,23 ] and sSi2 = [ 2

3 , 1], with i ∈ {0, 1}.According to the discussion in Sec.4.2, if we have the optimal

partitioning scheme with minimal migration cost, TP is {<mo

00,mn00 >,< mo

01,mn02 >,< mo

10,mn10 >,< mo

11,mn12 >}.

In Fig.9(b), we pair the relevant tasks between Mo and Mn byassigning the same numbers for tasks. The tasks tagged with rednumbers 5© and 6© in m01 and m11 are not paired. m01 issupposed to manage data nR01 = [0, 1

2 ] and nS01 = [ 13 ,

23 ]; m11

is supposed to manage data nR11 = [ 12 , 1] and nS11 = [ 1

3 ,23 ].

According to Alg.5, in m01, sR01 is generated by duplicating car

data from m00. Since cas is reallocated by splitting into 3 partsfor the insertion of a new column, it first generates the completedata set by combining hS00 and hS01 in scheme Mo, and then sS01

is generated by replicating cas data from mS00 with hS00 by range

[ 13 ,

12 ] and from mS

01 with hS01 by range [ 12 ,

23 ]. Then the data

for sS01 is deleted from these task instances. m11 in Mn will beassigned data in the same way as m01 does.

5 EVALUATION

5.1 Experimental Setup

Environment: We implement the approaches and conduct all theexperiments on top of Apache Storm [1]. The Storm system isdeployed on a cluster of 21 HP blade instances, each of which runsCentOS 6.5 operating system and is equipped with two Intel Xeonprocessors (E5335 at 2.00GHz) with four cores and 32GB RAM.Overall, there are 300 virtual machines available exclusively forour experiments, each with dedicated memory resources of 2GB.Data sets: We test the proposed algorithms using two types ofdata sets. The first TPC-H data sets is generated by the dbgen tool

Page 10: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

0

①:

③:

②:

④:

1

1

cas

car

(a) Old Scheme: Stream Distribution

0

①:

③: ⑥:1

1

②:

④:

⑤:

ca s

car

(b) New Scheme: Stream Distribution

Fig. 9. Stream Distribution Example

shipped with TPC-H benchmark [2]. Before feeding the data tothe stream system, we pre-generate and pre-process all the inputdata sets. We adjust the data sets with different degrees of skew onthe join attributes under Zipf distribution by choosing a value forskew parameter z. By default, we set z = 1. The second data setis 10GB social data1 (coming from Weibo which is the biggestChinese social media data) which consists of 20,000,000 real feedtuples. We run self-join on social data to find the correlation degreeamong tuples.Queries: We experiment on four join queries, namely one equi-join from the TPC-H benchmark and three synthetic band-joins. The equi-join, EQ5 , represents the most expensive op-eration in query Q5 from the benchmark. BNCI , BMR andSocial data query are all band-joins, which are different inthe range of join. Specifically, BNCI is the θ-join query thatcorresponds to a small range, while BMR is the theta-join querythat represents a big range. BMR and Social data query are fullband-joins which require each tuple from one stream meets all thetuples from the other stream. The EQ5 and BNCI are used in [9],[22].Baseline Approaches: In A-DSP, we compress the tuple matrixin our experiments using a matrix compression method proposedin [31], which generates the sample matrix having a much smallersize than the original matrix and guarantees the sample tuple ma-trix being high fidelity. For the purpose of comparison, we evaluatefour different distributed stream join algorithms as follows.

Square [12] figures out the number of tasks through a simpleand easy way defined Equ.9.

Dynamic [9] takes join matrix as its processing model andassumes that the number of tasks in a matrix must be a power of2. Since the model needs to maintain its matrix structure, if theworkload of one stream increases twofold, Dynamic will double

1. http : //open.weibo.com/wiki/2/statuses/user timeline

cells along the side corresponding to this stream. Meanwhile, cellsalong the other side will get reduced by half. In addition, duringsystem scaling out, Dynamic splits the states of every node into4 nodes if storage of any task exceeds half of specified memorycapacity.

Readj [14] is designed to minimize the load of restoring thekeys based on the hash function, implemented by key reroutingover the keys with maximal workload. The migration plan of keysfor load balance is generated by pairing tasks and keys. For eachtask-key pair, their algorithm considers all possible swaps to findthe best move alleviating the workload imbalance. In Readj, σ isa configurable parameter, deciding which keys should take part inaction of swap and move. Given a smaller σ, Readj tends to trackmore candidate keys and thus finds better migration plans. In orderto make fair comparison, in each of the experiment, we run Readjwith different σ′s and only report the best result from all attempts.

Bi6 and Bi [22] handle θ-join using a complete bipartitegraph. On one side of the bipartite graph, a data stream R isdecomposed into substreams by a key-based hash function, eachpartition of which is stored and maintained by a computation node.In the following part of this paper, we use Bi to represent there isonly one subgroup for each side of join-biclique andBi6 to denotethere are six subgroups for each side. For routing the incomingtuple, they use a two-tier architecture of the router, consistingof shuffler and dispatcher. The shuffler, implemented as a spout,ingests the input streams and forwards every incoming tuple to thedispatcher. The dispatcher, implemented as a bolt, forwards everyreceived tuple from the shuffler to the corresponding units in thejoiner to store and join.Performance Metrics: We evaluate resource utilization and sys-tem performance through the following metrics:

Tasknumber is the total number of tasks consumed bysystem and each task is assigned with a specified quota of memoryspace V .

ExecutionT ime is consumption time taken to deal with acertain amount of data.

Thoughtput is the average number of tuples that processedby system per second (time unit).

Average join request account is the average number of tupleswhich should be joined within each task.

Migration volume is the total number of tuples whichshould be moved to other tasks during scheme changing.

ResurceCost (RC) is the cumulative usage volume of acertain resource which can be expressed as RC =

∑r∈R r · tr ,

where r is the measurement of resource consumption, which maybe memory, network or CPU, R is the set of r and tr is the durationtime for those resource r.

5.2 Full History Stream Joins

Fig.10 demonstrates the trend on task consumption and executiontime during loading all 16GB data into Storm system. Themaximum input rate can be set to consume all the computingpower of each task, using enough parallelism for spout in Storm.

During data loading as in Fig.10(a), our algorithm A-DSPhas stable performance while Dynamic meets sharp increase intask number. This is because Dynamic has a strict requirementthat the number of tasks must be a power of two, and then,Dynamicmust quadruple its tasks and may waste resources whensystem scales out. Contrarily, our algorithm A-DSP generates theprocessing scheme based on current workload. Square limits the

Page 11: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

2 4 6 8 10 12 14 16Loading Volume(GB)

0

100

200

300

Task N

um

ber

A-DSPSquareDynamicBi

6

Bi

(a) Task Number Consumption

z=0.5 z=1 z=2Different Skew

0

500

1000

1500

2000

Execution T

ime(s

econd)

A-DSP Square Dynamic Bi6 Bi

(b) Execution Time for EQ5

z=0.5 z=1 z=2Different Skew

0

500

1000

1500

2000

Execution T

ime(s

econd)

A-DSP Square Dynamic Bi6 Bi

(c) Execution Time for BNCI

z=0.5 z=1 z=2Different Skew

0

500

1000

1500

2000E

xecu

tio

n T

ime

(se

co

nd

s)

A-DSP Square Dynamic Bi6 Bi

(d) Execution Time for BMR

Fig. 10. Task Consumption and Execution Time of Full History Join

TaskA must be square shape and the divides CA directly makes ituse more computing resources. Since Bi is designed for memoryoptimization, it is obvious that Bi uses the minimal number oftasks. Specifically, Bi6 sets up 6 subgroups for each stream andevery subgroup contains one task initially. As data flows in, tasksinside subgroups will scale out dynamically. Figs.10(b), 10(c) and10(d) depict processing efficiency under different data skewnessexecutingEQ5

,BNCI andBMR. A-DSP and matrix-join methodsincluding Square and Dynamic, can process join stably sincedata are distributed to tasks randomly. AlthoughBi takes up fewertasks, its efficiency is almost three times lower than others due toits lack of computing resources. Specifically, when z = 2, severerskewness may cause tuple broadcast among groups. BMR is awide range of band-join, the advantages of subgrouping in Bi6completely disappear as shown in Fig.10(d). Due to the randomtuples distribution on join-matrix models, the execution time ofSquare andDynamic is immune to data skewness. Furthermore,our A-DSP uses a hybrid routing strategy which also can bebenefited from the random distribution manner.

In Fig.10, the execution time of Dynamic is similar with ourmethods, but it is at the expense of more tasks. Since Bi and Bi6take minimizing memory usage as the optimization goal, it maydecrease processing efficiency when it is lack of CPU resourcesas shown in Fig.10(d). From Fig.10 we can conclude that ouralgorithms are more scalable and efficient.

5.3 Window-based Stream Joins

For this group of experiments, we have 64GB data with windowsize 180 seconds and slide size 60 seconds. We set stream flow-in rate at about 3 · 105 tuples per second to make full use ofCPU resources. To facilitate the description, we define the tuplethat is distributed to the storage side for join as join request.To further validate the effects of different band-joins and dataskewness on system performance, we run window-based joins totestify throughput under different models.

0.5 0.75 1 1.25 1.5 1.75 2Different Skew

0

1

2

3

Thro

ughput(

Tuple

)

105

A-DSP Square Dynamic Bi6 Bi

(a) Throughput for EQ5

0.5 0.75 1 1.25 1.5 1.75 2Different Skew

5

10

15

Avg J

oin

Request A

ccount(

Tuple

s/s

)

104

A-DSP Square Dynamic Bi6 Bi

(b) Join Request Account for EQ5

0.5 0.75 1 1.25 1.5 1.75 2Different Skew

0

1

2

3

4

Th

rou

gh

put(

Tu

ple

)

105

A-DSP Square Dynamic Bi6 Bi

(c) Throughput for BNCI

0.5 0.75 1 1.25 1.5 1.75 2Different Skew

0.6

0.8

1

1.2

1.4

1.6

Avg J

oin

Re

qu

est

Acco

unt(

Tu

ple

s/s

)

105

A-DSP Square Dynamic Bi6 Bi

(d) Join Request Account for BNCI

0.5 0.75 1 1.25 1.5 1.75 2Different Skew

0

1

2

3T

hro

ughput(

Tuple

)

105

A-DSP Square Dynamic Bi6 Bi

(e) Throughput for BMR

0.5 0.75 1 1.25 1.5 1.75 2Different Skew

0.6

0.8

1

1.2

1.4

1.6

Avg J

oin

Request A

ccount(

Tuple

s/s

)

105

A-DSP Square Dynamic Bi6 Bi

(f) Join Request Account for BMR

Fig. 11. Throughput and Tuple Account of Window-based Join for Differ-ent Queries

Fig.11 shows system throughput and average number of joinrequests received by task in different processing schemes forqueries EQ5

, BNCI and BMR under different data skewness.Fig.11(a), 11(c) and 11(e) compare throughput of each algorith-m. Throughput of matrix-based algorithms is obviously higherthan those of bipartite-graph-based ones which result from databroadcasting among groups and are lacking of CPU resources.While executing EQ5

under low-skew data, Bi6 can route joinrequests more selectively to subgroups, guaranteeing the executionefficiency within each task. With severer skewness, however,the amount of data broadcasting among subgroups lows downsystem performance greatly. While executing BNCI and BMR,the band-join operations incur more broadcasts among subgroupsand aggravate the CPU loads compared to EQ5

. It is obviousin Fig.11(f), where the large range band-join requests make Bi6lose advantages of selection operation on grouping data. BecauseBi has only one subgroup for each stream and it broadcasts thesame amount of data for these three kinds of queries as shownin Fig.11(b), 11(d) and 11(f). Furthermore, the reason that Bi6 isunaffected by BMR is that BMR is a big range band-join. Matrix-based algorithms share the CPU load equally among tasks, butsuffer from consuming more tasks. In this group of experiments,the task usages of A-DSP,Square,Dynamic,Bi6 and Bi are 13,16, 64, 12 and 6, respectively.

Page 12: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

5.4 AdaptivityIn order to validate the adaptivity of our processing scheme, wesimulate two scenarios as follows: 1) we do selectivity trans-mission for join operation by altering the calculation range ofjoin predicate; 2) we keep on varying stream volume ratio by% = |R|

|S| under the specified total stream volume of 40 GB. Boththose experiments are run based on window and all settings aresimilar to Sec.5.3. For the threshold controlling scales out or down,Dynamic uses its own setting in [9] and other algorithms changescheme when memory load is higher than eighty or lower thanfifty percent of configured for each task.

ds = 0 ds = 3 ds = 0

10

20

30

40

Ta

sk N

um

be

r

A-DSP Square Dynamic Bi ReadJ

(a) Task Consumption

ds = 0 ds = 3 ds = 0.5

1

1.5

2

2.5

Th

rou

gh

pu

t(T

up

le)

105

A-DSP Square Dynamic Bi ReadJ

(b) Throughput Performance

Fig. 12. Performance under Various Selectivity

5.4.1 Adaptivity on Selectivity TransmissionTo better understand our A-DSP can adaptively change the dimen-sions of its processing architecture to satisfy various operationrequirements, we disrupt the time series of Social data and run theSocial data query with various selectivity. The different selectivityscenarios are generated by changing the correlative days, i.e., ds,where ds = |S1.date − S2.date|. Specifically, we use ds = 0,ds = 3 and ds = ∞ to denote the equivalent, range and fulljoin operation, respectively. Fig.12(a) shows the task consumptionof different approaches under above settings. Regardless of thejoin types, Square and Dynamic always consume the mosttasks. This is because they are both based on the fixed matrixarchitecture. Conversely, both Bi and ReadJ consume the leasttasks, for they do not need to store copies of the incoming stream.However, their complicated routing strategies and broadcast actioninevitably be an obstacle to the throughput performance, as shownin Fig.12(b). Fig.12(a) shows that our A-DSP model can consumeresource on-demand, which based on the specific join operation.Furthermore, Fig.12(b) reflects A-DSP can enable the system toprovide stable high throughput although it consumes less tasks.

5.4.2 Adaptivity on Relative Stream Volume ChangeIn Fig.13(a), we adjust % every three minutes. Accordingly, themigration volumes of each approach are displayed in Fig.13(b).Both Bi and Bi6 incur less migration cost because they containthe lesser redundant data. However, this is at the cost of a higherCPU consumption, which inevitably decreases the throughput.Dynamic suffers from high migration cost because this modelrequires its task number must be a power of two. Our A-DSPalgorithm explores the most appropriate processing scheme whichtakes relatively small migration cost.

5.5 Throughput and LatencyOur synthetic workload generator creates snapshots of tuples fordiscrete time intervals from an integer key domain K . The tuples

2 4 6 8 10 12 14 16 18 20Time(minute)

0

2

4

6

Re

lative

Str

ea

m V

olu

me

Ra

tio

(a) Relative Stream Volume Change %

1 2 3 4 5 6Migration Series

0

5

10

15

20

25

30

Mig

ration V

olu

me(G

B)

A-DSPSquareDynamicBi

6Bi

(b) Migration Volume

Fig. 13. Performance with Relative Stream Volume Change

follow Zipf distributions controlled by skewness parameter z, byusing the popular generation tool available in Apache project. Weuse parameter f to control the rate of distribution fluctuationacross time intervals. At the beginning of a new interval, ourgenerator keeps swapping frequencies between keys from differenttask instances until the change on workload is significant enough.i.e., Li(d)−Li−1(d)

L̄≥ f . In the above expression, Li(d) and

Li−1(d) mean the total workload of task instance d in the ith and(i-1)th time interval (window), respectively. L̄i means the averageload of all tasks in the ith time interval. Then, f can be used todenote the workload rangeability of task d between now and thelast time interval.

Fig. 14 shows the throughput and latency with varying dis-tribution change frequency running on EQ5 (Fig. 14(a) andFig. 14(b)) and BMR (Fig. 14(c) and Fig. 14(d)), respectively.In Fig. 14, we draw the theoretical limit of the performancewith the line labeled as Ideal, which processing correspondingquery using Square in the condition of adequate resources areavailable. Obviously, Ideal always generates a better throughputand lower processing latency than any others, but cannot be usedin real-world application, for its resource waste. When varyingthe distribution change frequency f , both the throughput andlatency ofBi6 change dramatically in Fig. 14(a) and Fig. 14(b). Inparticular, Bi6 works well only in the case with less distributionvariance (smaller f ) in Fig. 14(a) and Fig. 14(b). In the meantime,both Bi and Dynamic alway have a low throughput and highprocessing latency in Fig. 14, this is because the broadcast actionin Bi and the limit number of task using in Dynamic. Onthe other hand, our A-DSP algorithm always performs well, withperformance very close to the optimal bound set by Ideal.

To expose the cost of our approach in finding the rightprocessing scheme, Fig.15 shows the adjustment times and migra-tion latency of each approach under varying distribution changefrequency. We run this group experiments on BMR and the rela-tive stream volume change parameter is configured as Fig.13(a).Fig.15(a) shows A-DSP incurs more scheme adjustment duringthe data distribution change. However, its migration latency is stillacceptable to the system as shown in Fig.15(b). This is because wehave designed a lightweight computation model (Alg.4 and Alg.5)to support rapid migration plan generation, which incurs minimaldata transmission overhead and processing latency. Square usesthe same migration plan generation strategies, and it has a similarperformance. Dynamic incurs a smaller adjustment time since itadjusts its processing scheme only after the relative stream volumechanged. Due to the face that Dynamic needs to maintain a largeamount of copy data, its migration latency is the one biggest asshown in Fig.15(b). Bi and Bi6 using the fewest processing tasks

Page 13: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

0.5 1 1.5 2 2.5f

0

1

2

3

4T

hro

ughput(

Tuple

)105

A-DSPSquareDynamicBi

6

BiIdeal

(a) Stream dynamics vs throughput

0.5 1 1.5 2 2.5f

0

50

100

150

200

Pro

cessin

g L

ate

ncy(m

s) A-DSP

SquareDynamicBi

6

BiIdeal

(b) Stream dynamics vs latency

0.5 1 1.5 2 2.5f

0

1

2

3

4

Thro

ughput(

Tuple

)

105

A-DSPSquareDynamicBi

6

BiIdeal

(c) Stream dynamics vs throughput

0.5 1 1.5 2 2.5f

0

50

100

150

200

Pro

cessin

g L

ate

ncy(m

s)

A-DSPSquareDynamicBi

6

BiIdeal

(d) Stream dynamics vs latency

Fig. 14. Throughput and Latency with Varying Distribution Change Fre-quency, where (a) and (b) Run on a EQ5

, and (c) and (d) on BMR.

0.5 1 1.5 2 2.5f

0

2

4

6

8

10

Adj

ustm

ent T

imes

A-DSPSquareDynamicBi

6

Bi

(a) Dynamics vs adjustment times

0.5 1 1.5 2 2.5f

0

20

40

60

80

100

Mig

ratio

n La

tenc

y(S

econ

ds)

A-DSPSquareDynamicBi

6

Bi

(b) Dynamics vs migration latency

Fig. 15. Adjustment Times and Migration Latency with Varying Distribu-tion Change Frequency.

make them generate a smaller migration volume and latency.

5.6 Performance on Real Data

5.6.1 Throughput under limited resources

To better understand the performance of the approaches in action,we present the dynamics of the throughput on two real workloads,especially when the system is limited the using number of tasks.On Social data, we implement a band-join topology on Storm,with distributing tuples to tasks for store and join on keywords.On Stock data, a self-join on the data over sliding window isimplemented, which maintains the recent tuples based on thesize of the window over intervals. The results are available inFig. 16, showing that our method A-DSP always produce a higherthroughput. Though Dynamic is content-insensitive, it produces alower throughput than A-DSP, for the available resources alwayscannot meet its demand. On both Social data and Stock data withdifferent available tasks, Bi and Bi6 have little change in theperformance of throughput, this is because the biggest obstacleto their performance is the huge broadcast tuples for this queries.The throughput of Dynamic only be one-third of A-DSP when thelimitation task number is 32, which leads to huge resource wasteand is definitely undesirable to cloud-based streaming processingapplications.

16 32 64Task Number

0.5

1

1.5

2

2.5

Thro

ughput(

Tuple

)

105

A-DSP Square Dynamic Bi6 Bi

(a) Social data

16 32 64Task Number

0.5

1

1.5

Thro

ughput(

Tuple

)

105

A-DSP Square Dynamic Bi6 Bi

(b) Stock data

Fig. 16. Performance under Limit Tasks.

2 4 6 8 10Loading Volume(GB)

0

100

200

300

Task N

um

ber A-DSP

SquareDynamicBi

6

Bi

(a) Task Consumption

2 4 6 8 10Loading Volume(GB)

0

200

400

600

800

1000

1200

Execution T

ime(s

econds) A-DSP

SquareDynamicBi

6

Bi

(b) Execution Time

2 4 6 8 10Loading Volume(GB)

0

100

200

300

Bandw

idth

(Mbps/s

econd) A-DSP

SquareDynamicBi

6

Bi

(c) Bandwidth Consumption

Cost of Task and Bandwidth

0

5

10

15

Re

so

urc

e C

ost

104

A-DSPSquareDynamicBi

6

Bi

Bandwidth CostTask Cost

(d) Resource Consumption

Fig. 17. Performance on Real Data

5.6.2 Resource utilization efficiency

To prove the usability of our algorithm, we do band-join Socialdata query on 10GB Weibo datasets with the settings introducedin Sec.5.2. We load the 10GB dataset continuously, and mea-sure resource consumption by different algorithms. In Fig.17(a),bipartite-graph-based models Bi and Bi6 use less tasks for itis mainly designed for memory optimization as explained inFig.10. Our method A-DSP provides a flexible matrix schemewhich applies for new tasks according to its real load whileDynamic scales out in a generous way. Our algorithm is efficientin completing all the work while the bipartite-graph-based modelsare the slowest for its lack of CPU resources, shown in Fig.17(b).

Fig.17(c) shows the bandwidth usage for each algorithm whenloading data into Storm system. Though Bi uses less tasks thanA-DSP as shown in Fig.17(a), its bandwidth usage is more thanours for it broadcasts the same amount of incoming tuples to alltasks in the other side of bipartite-graph.Dynamic costs the mostbandwidth. Fig.17(d) provides the overall resource cost (Task andBandwidth) by time duration, as computed by RC =

∑r∈R r · tr

and we consider cost by continuing occupation to resources. Itconfirms that A-DSP can deal with the band-join more economi-cally.

Page 14: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

6 RELATED WORK

In the early 21st century, considerable researches have been un-folded on designing efficient stream join operators in a distributedenvironment with a cluster of machines. As mentioned in Sec.1,there are two categories model for join operation in distributedstream processing systems, namely 1-DSP and 2-DSP.

1-DSP. Photon [3] is designed by Google to join data streamssuch as web search queries and user clicks on advertisements.It utilizes a central coordinator to implement fault-tolerant andscalable joining through key-value matching, but cannot handle θ-join. D-Streams [36] adopts mini-batch on continuous data streamsin a blocking way on Spark. Though it upholds θ-join well,under the constraint of window size, some tuples may miss eachother and such batch-mode computing can only give approximateresults. TimeStream [27] equipped with resilient substitution anddependency tracking mechanisms provides both MapReduce-likebatch processing and non-blocking stream processing. However, itsuffers from the excessive communication cost due to maintenanceof distributed join state. Join-Biclique [22] based on bipartite-graph model supports full-history and window-based stream joins.Compare to the join-matrix model used in DYNAMIC [9], itreduces data backup redundancy and improves resource utiliza-tion. In the meantime, considerable efforts have been put on theelasticity feature of the distributed stream processing systems [7],[13], [15], [20], [32], dealing with when and how to efficientlyscale in or out computing resources.

2-DSP. In recent years, join-matrix model for parallel and dis-tributed joins has been restudied. Intuitively, it models a joinbetween two data sets R and S as a matrix, where each side ofwhich corresponds to one relation. Stamos et al. [29] proposeda symmetric fragment and replicate method to support parallelθ-joins. This method relies on replicating input tuples to ensureresult completeness and extending the fragment and replicatealgorithm [10], which suffers from high communication andcomputation cost. Okcan [26] proposed techniques in MapRe-duce that adopts join-matrix model and supports parallel θ-joins.He designed two partitioning scheme, that is 1-Bucket and M-Bucket. 1-Bucket scheme is content-insensitive but incurs highdata replication; M-Bucket scheme is content-sensitive in that itmaps a tuple to a region based on the join key. Because of theinherent nature of MapReduce, these two algorithms are offlineand require that all the data statistics must be available beforehand,which are not suitable for stream join processing. In data streamprocessing, Elseidy et al. [9] presented a (n,m)- mapping schemepartitioning the matrix into J(J = n × m ) equal regions andpropose algorithms which adjust the scheme adaptively to datacharacteristics in real time. However, [9] assumes that the numberof partitions J must be powers of two. What’s more, the the matrixstructure suffers from bad flexibility because when the matrixneeds to scale out(down), it must add(remove) the entire row orcolumn cells.

Due to the characteristics of data streams, e.g., infinitenessand instantaneity, conventional join algorithms using blockingoperations, e.g., sorting, cannot work any more. For stream joinprocessing, much effort has been put into designing non-blockingalgorithms. Wilschut presents the symmetric hash join SHJ [33]. Itis a special type of hash join and assumes that the entire relationscan be kept in main memory. For each incoming tuple, SHJprogressively creates two hash tables, one for each relation, and

probes them mutually to identify joining tuples. XJoin [30] andDPHJ [19] both extend SHJ by allowing parts of the hash tablesto be spilled out to the disk for later processing, greatly enhancingthe applicability of the algorithm. Dittrich [8] designed a non-blocking algorithm called PMJ that inherits the advantages ofsorting-based algorithms and also supports inequality predicates.The organic combination of XJoin and PMJ is conductive to therealization of HMJ [24], presented by Mokbel. However, all theprevious algorithms belong to centralized algorithms and rely ona central entity doing join computation, which cannot be appliedin distributed computing directly.

7 CONCLUSION

This paper presents a new parallel stream join mechanism for θ-join in distributed stream processing engines. Inspired by 1-DPSand 2-DPS model, we propose a A-DSP model to promise pro-cessing correctness together with efficient resource usages. A-DSPcan handle parallel stream joins in a more flexible and adaptivemanner, especially to address the demands on operational costminimization over the cloud platform. Additionally, we designalgorithms with moderate complexity to generate task mappingschemes and migration plans to fulfill the objective on operationalcost reduction. Empirical studies show that our proposal incurs thesmallest cost when the stream join operator is run under the pay-as-you-go scheme. In the future, we will explore the possibility onmore flexible and scalable mapping and migration strategies withsampled matrix in pursuit of a smaller migration cost. We willalso try to design a new mechanism, to ensure the correctness ofprocessing results when there exist(s) extremely dynamic key(s).

ACKNOWLEDGMENTS

Rong Zhang and Aoying Zhou are supported by the Key Programof National Natural Science Foundation of China (No. 2018YF-B1003402) and NSFC (No. 61672233). Kai Zheng is supported byNSFC (No. 61972069, 61836007, 61832017, 61532018). JunhuaFang is supported by NSFC (No.61802273), the PostdoctoralScience Foundation of China under Grant (No. 2017M621813),the Postdoctoral Science Foundation of Jiangsu Province of Chinaunder Grant (No. 2018K029C), and the Natural Science Founda-tion for Colleges and Universities in Jiangsu Province of Chinaunder Grant (No. 18KJB520044). Xiaofang Zhou is supported byNSFC (No.61772356). This work is also partially supported bythe Major Program of Natural Science Foundation, EducationalCommission of Jiangsu Province (No. 19KJA610002).

REFERENCES

[1] Apache Storm. http://storm.apache.org/.[2] The TPC-H Benchmark. http://www.tpc.org/tpch.[3] R. Ananthanarayanan, V. Basker, S. Das, and et al. Photon: fault-tolerant

and scalable joining of continuous data streams. In SIGMOD, pages577–588, 2013.

[4] M. Anis Uddin Nasir, G. De Francisci Morales, and et al. The power ofboth choices: Practical load balancing for distributed stream processingengines. In ICDE, pages 137–148, 2015.

[5] L. Cheng, S. Kotoulas, T. E. Ward, and G. Theodoropoulos. Robust andskew-resistant parallel joins in shared-nothing systems. In CIKM, pages1399–1408. ACM, 2014.

[6] G. Cormode and S. Muthukrishnan. An improved data stream summary:the count-min sketch and its applications. Journal of Algorithms,55(1):58–75, 2005.

[7] J. Ding, T. Z. J. Fu, R. T. B. Ma, M. Winslett, Y. Yang, Z. Zhang,and H. Chao. Optimal operator state migration for elastic data streamprocessing. CoRR, abs/1501.03619, 2015.

Page 15: JOURNAL OF LA A-DSP: An Adaptive Join Algorithm for ... · in a dynamic distribution and skewed environment. For 2-DSP, although its content-insensitive partition strategy enables

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

[8] J.-P. Dittrich, B. Seeger, and et al. Progressive merge join: A genetic andnon-blocking sort-based join algorithm. In VLDB, pages 299–310, 2002.

[9] M. Elseidy, A. Elguindy, A.and Vitorovic, and C. Koch. Scalable andadaptive online joins. In VLDB, 2014.

[10] R. S. Epstein, M. Stonebraker, and E. Wong. Distributed query processingin a relational data base system. In SIGMOD, pages 169–180, 1978.

[11] J. Fang, R. Zhang, T. Z. Fu, Z. Zhang, A. Zhou, and X. Zhou. Distributedstream rebalance for stateful operator under workload variance. TPDS,2018.

[12] J. Fang, R. Zhang, X. Wang, T. Z. Fu, Z. Zhang, and A. Zhou. Cost-effective stream join algorithm on cloud system. In CIKM, pages 1773–1782, 2016.

[13] T. Z. J. Fu, J. Ding, R. T. B. Ma, M. Winslett, Y. Yang, and Z. Zhang. Drs:Dynamic resource scheduling for real-time analytics over fast streams. InICDCS, 2015.

[14] B. Gedik. Partitioning functions for stateful data parallelism in streamprocessing. VLDBJ, 23(4):517–539, 2014.

[15] L. Gu, M. Zhou, Z. Zhang, M.-C. Shan, A. Zhou, and M. Winslett.Chronos: An elastic parallel framework for stream benchmark generationand simulation. In ICDE, pages 101–112, 2015.

[16] B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Load balancing inmapreduce based on scalable cardinality estimates. In ICDE, pages 522–533, 2012.

[17] P. Hilton and J. Pedersen. Catalan numbers, their generalization, andtheir uses. The Mathematical Intelligencer, 13(2):64–75, 1991.

[18] R. Huebsch, M. Garofalakis, J. Hellerstein, and I. Stoica. Advancedjoin strategies for large-scale distributed computation. In VLDB, pages1484–1495, 2014.

[19] Z. Ives, D. Florescu, M. Friedman, A. Levy, and D. Weld. An adaptivequery execution system for data integration. In SIGMOD, pages 299–310,1999.

[20] C. Jin, A. Zhou, J. X. Yu, J. Z. Huang, and F. Cao. Adaptive schedulingfor shared window joins over data streams. Frontiers of ComputerScience in China, 1(4):468–477, 2007.

[21] Y. Kwon, M. Balazinska, and et al. Skewtune: mitigating skew inmapreduce applications. In SIGMOD, 2012.

[22] Q. Lin, B. C. Ooi, Z. Wang, and C. Yu. Scalable distributed stream joinprocessing. In SIGMOD, pages 811–825, 2015.

[23] B. Liu, Y. Zhu, M. Jbantova, and et al. A dynamically adaptive distributedsystem for processing complex continuous queries. In VLDB, pages1338–1341, 2005.

[24] M. Mokbel, M. Lu, and W. Aref. Hash-merge join: Non-blocking joinalgorithm for producing fast and early join results. In ICDE, pages 251–263, 2004.

[25] M. A. U. Nasir, M. Serafini, and et al. When two choices are not enough:Balancing at scale in distributed stream processing. In ICDE, 2016.

[26] A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. InSIGMOD, pages 949–960, 2011.

[27] Z. Qian, Y. He, C. Su, and et al. Timestream: reliable stream computationin the cloud. In Eurosys, pages 1–14, 2013.

[28] C.-P. Schnorr and M. Euchner. Lattice basis reduction: Improvedpractical algorithms and solving subset sum problems. Mathematicalprogramming, 66(1-3):181–199, 1994.

[29] J. W. Stamos and H. C. Young. A symmetric and replicate algorithm fordistributed joins. TPDS, 4(12):1345–1354, 1993.

[30] T. Urhan and M. Franklin. Dynamic pipeline scheduling for improvinginteractive query performance. In VLDB, 2001.

[31] A. Vitorovic, M. ElSeidy, and C. Koch. Load balancing and skewresilience for parallel joins. In ICDE, 2016.

[32] L. Wang, M. Zhou, Z. Zhang, Y. Yang, A. Zhou, and D. Bitton. Elasticpipelining in an in-memory database cluster. In SIGMOD, pages 1279–1294, 2016.

[33] A. Wilschut and P. Apers. Dataflow query execution in a aarallel main-memory environment. Distributed and Parallel Databases, 1(1):103–128, 1993.

[34] Y. Xing, J. Hwang, U. Cetintemel, and S. Zdonik. Providing resiliencyto load variations in distributed stream processing. In VLDB, pages 775–786, 2006.

[35] Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew inparallel joins in shared-nothing systems. In SIGMOD, pages 1043–1052,2008.

[36] M. Zaharia, T. Das, and et al. Discretized streams: fault-tolerantstreaming computation at scale. In SIGOPS, pages 423–438, 2013.

[37] K. Zheng, Y. Zheng, N. J. Yuan, S. Shang, and X. Zhou. Online discoveryof gathering patterns over trajectories. TKDE, 26(8):1974–1988, 2014.

[38] Y. Zhou, Y. Yan, B. C. Ooi, K.-L. Tan, and A. Zhou. Optimizingcontinuous multijoin queries over distributed streams. In CIKM, pages221–222. ACM, 2005.

[39] Y. Zhou, Y. Yan, F. Yu, and A. Zhou. Pmjoin: Optimizing distributedmulti-way stream joins by stream partitioning. In DASFAA, pages 325–341. Springer, 2006.

Junhua Fang is a lecturer in Advanced DataAnalytics Group at School of Computer Scienceand Technology, Soochow University, Suzhou.Before joining Soochow University, he earnedhis Ph.D degree in computer science from EastChina Normal University, Shanghai, in 2017. Heis a member of CCF. His research focuses ondistributed database and parallel streaming ana-lytics.

Rong Zhang is a member of China Comput-er Federation. She received her Ph.d. degreein computer science from Fudan University in2007. She joined East China Normal Universi-ty since 2011 and is currently a professor inthe university. From 2007 to 2010, she workedas an expert researcher in NICT, Japan. Hercurrent research interests include knowledgemanagement, distributed data management anddatabase benchmarking.

Yan Zhao received the Master degree in Ge-ographic Information System from University ofChinese Academy of Sciences, in 2015. She iscurrently a PHD student in Soochow University.Her research interests include spatial databaseand trajectory computing.

Kai Zheng is a Professor of Computer Sciencewith University of Electronic Science and Tech-nology of China. He received his PhD degreein Computer Science from The University ofQueensland in 2012. He has been working inthe area of spatial-temporal databases, uncer-tain databases, social-media analysis, in memo-ry computing and block chain technologies. Hehas published over 100 papers in prestigiousjournals and conferences in data managementfield such as SIGMOD, ICDE, VLDB Journal,

ACM Transactions and IEEE Transactions. He is a member of IEEE.

Xiaofang Zhou received the BSc and MSc de-grees in computer science from Nanjing Uni-versity, China, in 1984 and 1987, respectively,and the PhD degree in computer science fromthe University of Queensland, Australia, in 1994.He is a professor of computer science at theUniversity of Queensland and adjunct professorin the School of Computer Science and Tech-nology, Soochow University, China. His researchinterests include spatial and multimedia data-bases, high performance query processing, web

information systems, data mining, bioinformatics, and e-research. He isa fellow of the IEEE.

Aoying Zhou is a professor on Computer Sci-ence at East China Normal University (ECNU),where he is heading the School of Data Sci-ence & Engineering. Before joining ECNU in2008, Aoying worked for Fudan University atthe Computer Science Department for 15 years.He is the winner of the National Science Fundfor Distinguished Young Scholars supported byNSFC and the professorship appointment underChangjiang Scholars Program of Ministry of E-ducation. He is now acting as a vice-director of

ACM SIGMOD China and Database Technology Committee of ChinaComputer Federation. He is serving as a member of the editorial boardsVLDB Journal, WWW Journal, and etc. His research interests includedata management, in-memory cluster computing, big data benchmark-ing and performance optimization.