An Adaptive Partitioning Scheme for Ad-hoc and Time ... · storage system provides improved query...

An Adaptive Partitioning Scheme for Ad-hoc and

Time-varying Database Analytics

by

Anil Shanbhag

B.Tech. in Computer ScienceIndian Institute of Technology Bombay, 2014

Submitted to the Department of Electrical Engineering and ComputerScience

in partial fulfillment of the requirements for the degree of

Master of Science in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2016

c© Massachusetts Institute of Technology 2016. All rights reserved.

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Department of Electrical Engineering and Computer Science

May 19, 2016

Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Samuel Madden

Professor of Electrical Engineering and Computer ScienceThesis Supervisor

Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Leslie A. Kolodziejski

Chairman, Department Committee on Graduate Students

An Adaptive Partitioning Scheme for Ad-hoc and

Time-varying Database Analytics

by

Anil Shanbhag

Submitted to the Department of Electrical Engineering and Computer Scienceon May 19, 2016, in partial fulfillment of the

requirements for the degree ofMaster of Science in Electrical Engineering and Computer Science

Abstract

Data partitioning significantly improves query performance in distributed databasesystems. A large number of techniques have been proposed to efficiently partition adataset, often focusing on finding the best partitioning for a particular query work-load. However, many modern analytic applications involve ad-hoc or exploratoryanalysis where users do not have a representative query workload. Furthermore, work-loads change over time as businesses evolve or as analysts gain better understandingof their data. Static workload-based data partitioning techniques are therefore notsuitable for such settings. In this thesis, we present Amoeba, an adaptive distributedstorage system for data skipping. It does not require an upfront query workload andadapts the data partitioning according to the queries posed by users over time. Wepresent the data structures, partitioning algorithms, and an efficient implementationon top of Apache Spark and HDFS. Our experimental results show that the Amoebastorage system provides improved query performance for ad-hoc workloads, adapts tochanges in the query workloads, and converges to a steady state in case of recurringworkloads. On a real world workload, Amoeba reduces the total workload runtimeby 1.8x compared to Spark with data partitioned and 3.4x compared to unmodifiedSpark.

Thesis Supervisor: Samuel MaddenTitle: Professor of Electrical Engineering and Computer Science

Acknowledgments

I would like to thank Alekh Jindal, Qui Nguyen, Aaron Elmore, Jorge Quiane and

Divyakanth Agarwal who have contributed many ideas to this work and helped build

the system.

I would also like to thank Prof. Samuel Madden, my thesis supervisor, for being

a constant source of guidance and feedback in this project and outside.

Finally, I am always grateful to my family and friends, who encouraged me and

supported me along the way.

5

Contents

1 Introduction 13

2 Related Work 17

3 System Overview 21

4 Upfront Data Partitioning 23

4.1 Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Upfront Partitioning Algorithm . . . . . . . . . . . . . . . . . . . . . 27

5 Adaptive Repartitioning 31

5.1 Workload Monitor and Cost Model . . . . . . . . . . . . . . . . . . . 32

5.2 Partitioning Tree Transformations . . . . . . . . . . . . . . . . . . . . 33

5.3 Divide-And-Conquer Repartitioning . . . . . . . . . . . . . . . . . . . 36

5.4 Handling Multiple Predicates . . . . . . . . . . . . . . . . . . . . . . 39

6 Implementation 41

6.1 Initial Robust Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7 Discussion 45

7.1 Leveraging Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.2 Handling Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

8 Evaluation 47

8.1 Upfront Partitioning Performance . . . . . . . . . . . . . . . . . . . . 47

8.2 Micro-benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7

8.3 Amoeba on Real Workload . . . . . . . . . . . . . . . . . . . . . . . 53

9 Conclusion 55

Appendices 60

A Fast Remote Reads 61

8

List of Figures

1-1 Example partitioning tree with 8 blocks . . . . . . . . . . . . . . . . . 14

3-1 Amoeba Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4-1 Partitioning Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . 25

4-2 Upfront Partitioning Algorithm Example. . . . . . . . . . . . . . . . . 26

5-1 Node swap in the partitioning tree. . . . . . . . . . . . . . . . . . . . 34

5-2 Illustrating adaptive partitioning when predicate A2 appears repeatedly. 35

5-3 Node pushdown in partitioning tree. . . . . . . . . . . . . . . . . . . 35

5-4 Node rotation in partitioning tree. . . . . . . . . . . . . . . . . . . . . 35

7-1 Heterogenous Replication. . . . . . . . . . . . . . . . . . . . . . . . . 45

8-1 Ad-hoc query runtimes for different attributes of TPC-H lineitem. . . 48

8-2 Comparing the upload time in Amoeba with HDFS . . . . . . . . . . 49

8-3 Comparing performance of upfront partition tree vs kd-tree . . . . . . 49

8-4 Query runtimes for changing query attributes on TPC-H lineitem. . . 51

8-5 Query runtimes for changing predicates on the same attribute of TPC-

H lineitem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

8-6 Cumulative Optimizer Runtime Across 100 queries . . . . . . . . . . 52

8-7 Cumulative Repartitioning Cost . . . . . . . . . . . . . . . . . . . . . 52

8-8 Total runtimes of the different approaches . . . . . . . . . . . . . . . 53

A-1 Response time with varying data locality (%) . . . . . . . . . . . . . 61

9

List of Tables

5.1 The cost and benefit estimates for different partitioning tree transfor-

mations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

11

Chapter 1

Introduction

Collecting data is increasingly becoming easier and cheaper, leading to ever-larger

datasets. This big data, ranging from sources such as sensors to server logs has the

potential to uncover business insights and help businesses make informed decisions,

but only if they can analyze it effectively. For this reason, companies have adopted

distributed database systems as the go-to solution for storing and analyzing their

data.

Data partitioning is a well-known technique for improving the performance of

distributed database applications. For instance, when selecting subsets of the data,

having the data pre-partitioned on the selection attribute allows skipping irrelevant

pieces of the data, i.e., without scanning the entire dataset. Joins and aggregations

also benefit from data partitioning. Because of these performance gains, the database

research community has proposed many techniques to find good data partitioning for

a query workload. Such workload-based data partitioning techniques assume that

the query workload is provided upfront or collected over time [1, 2, 3, 4, 5, 6, 7].

Unfortunately, in many cases a static query workload may not be known a priori.

One reason for this is that modern data analytics is data-centric and tends to in-

volve ad-hoc and exploratory analysis. For example, an analyst may look for anoma-

lies and trends in a user activity log, such as from web servers, network systems,

transportation services, or any other sensors. Such analyses are ad-hoc and a repre-

sentative set of queries is not available upfront. To illustrate, production workload

traces from a Boston-based analytics startup reveal that even after seeing 80% of the

13

A4

B5

D4

1 2

D5

3 4

C6

B3

5 6

D4

7 8

Figure 1-1: Example partitioning tree with 8 blocks

workload, the remaining 20% of the workload still contains 57% new queries. These

workload patterns are hard to collect in advance. Furthermore, collecting the query

workload is tedious as analysts would typically like to start using data as soon as

possible, rather than having to provide a workload before obtaining acceptable per-

formance. Providing a workload upfront has the further complexity that it can overfit

the database to that workload, requiring all other queries to scan unnecessary data

partitions to compute answers.

Distributed storage systems like HDFS [8] store large files as a collection of blocks

of fixed size (for HDFS the blocks size is usually 64/128MB). A block acts as the

smallest unit of storage and gets replicated across multiple machines. The key idea

is to exploit this block structure to build and maintain a partitioning tree on top

of the table. A partitioning tree is a binary tree which partitions the data into a

number of small partitions of roughly the size of a block. Each such partition contain

a hypercube of the data. Figure 1-1 shows an example partitioning tree for a 1GB

dataset over 4 attributes with block size 128MB. The data is split into 8 blocks, the

same as what would be created by a block-based system, however each block now has

additional metadata. For example, block 1’s tuple satisfy A ≤ 4 & B ≤ 5 & D ≤ 4.

This results in it being possible to answer any query by reading a subset of partitions.

The partitioning tree is improved over time based on queries submitted by the user.

We implemented this idea in Amoeba. Amoeba is designed with three key

properties in mind: (1) it requires no upfront query workload, while still providing

good performance for a wide range of ad-hoc queries; (2) as users pose more queries

over certain attributes, it adaptively repartitions the data, to gradually perform better

on queries over frequent attributes and attribute ranges; and (3) it provides robust

performance that avoids over-fitting to any particular workload.

The system exposes a relational storage manager, consisting of a collection of

tables, with support for predicate-based data access i.e. scan a table with a set

14

of predicates to filter over. For example, executing scan over table employee with

predicate age ≥ 30 and 1000 ≤ salary ≤ 2000. The data stored is partitioned and

as a result we end up accessing a subset of the data. The system is self-tuning and

as users start submitting queries to the system, it specializes the partitioning to the

observed patterns over time.

Query optimizers tend to push predicates down to scan operator and big data

systems like Spark SQL [9] allow custom predicate-based scan operators. Amoeba

integrates as a predicate-based scan operator and the re-partitionings’ happening to

change data layout are invisible to the user.

Adaptive partitioning/indexing is extensively used in modern single node in-

memory column stores for achieving good performance. These techniques, called

Cracking [10] have been used to generate adaptive index on a column based on in-

coming queries. Partial sideways cracking [11] extends it to generate adaptive index

on multiple columns. Cracking happens on each query and maintains additional

structures to create the index. The reason cracking cannot be applied to a dis-

tributed setting is because the cost of re-partitioning is very high. Each round of

re-partitioning needs to be carefully planned to amortize the cost associated with it.

Our approach is complementary to many other physical storage optimizations.

For example, decomposed storage layouts (i.e., column-stores) are designed to avoid

reading columns that aren’t accessed by a query. In contrast, partitioning schemes,

including ours, aim to avoid reading entire partitions of the dataset. Although our

prototype does not use a decomposed storage model, there is nothing about our

approach that cannot work in such a setting: individual columns or column groups

could easily be separately partitioned and accessed in our approach.

In summary, we make the following major contributions:

• We describe a set of techniques to aggressively partition a dataset over several

attributes and propose an algorithm to generate a robust initial partitioning

tree. Our robust partitioning tree spreads the benefits of data partitioning

across all attributes in the schema. It does not require an upfront query work-

load, and also handles data skew and correlations (Chapter 4).

• We describe an algorithm to adaptively repartition the data based on the ob-

served workload. We piggyback on query processing to repartition only the

15

accessed portions of the data. We present a divide-and-conquer algorithm to

efficiently pick the best repartitioning strategy, such that the expected benefit of

repartitioning outweighs the expected cost. To the best of our knowledge, this

is the first work to propose adaptive data partitioning for analytical workloads

(Chapter 5).

• We describe an implementation of our system on top of the Hadoop Distributed

File System (HDFS)1 and Spark. This storage system consists of: (i) upfront

partitioning pipeline to load the dataset into Amoeba, (ii) an adaptive query

executor used to read data out of the system (Chapter 6).

• We present a detailed evaluation of the Amoeba storage system on real and

synthetic query workloads to demonstrate three key properties: (i) robustness:

the system gives improved performance over ad-hoc queries right from the start,

(ii) adaptivity: the system adapts to the changes in the query workload, and

(iii) convergence: the system approaches the ideal performance when a partic-

ular workload repeats over and over again. We also evaluate our results on a

real query workload from a local startup.

1Amoeba could equally work with any other distributed file system as well.

16

Chapter 2

Related Work

Database partitioning and indexing have a rich history in the database literature.

Partitioning involves organizing the data in a structured manner while indexing cre-

ates auxiliary structures which can be used to accelerate queries. Data partitioning

could be vertical (typically used for projection and late materialization) or horizontal

(typically used for selections, group-by, and joins). Vertical partitioning is typically

useful for analytical workloads and has been studied heavily in the past, both for

static [12, 13] and dynamic [14] cases. Horizontal partitioning has been considered

both for transactional and analytical workloads. Amoeba essentially does adaptive

horizontal partitioning based on a partitioning tree. Broadly, the related work can be

grouped into three categories: workload-based partitioning tools, multi-dimensional

indexing and adaptive indexing in single node systems.

Workload-based partitioning: For transactional workloads, researchers have pro-

posed fine-grained partitioning [4], a hybrid of fine and coarse-grained [5], and skew

aware partitioning [6]. For analytical workloads, different researchers have proposed

to leverage deep integration with the optimizer in order to make better decisions [3],

take the interdependence of difference design decisions into account [2], and even

integrate vertical and horizontal partitioning decisions [1]. Traditional database par-

titioning, however, is still workload-driven, and requires that the workload is either

provided upfront or monitored and collected over time. MAGIC [15] aims to support

multiple kinds of queries by declustering data on multiple attributes. The data is

17

arranged into directories which could be distributed across processors. MAGIC also

requires the query workload as well as the resource requirements in order to come

up with the directory in the first place. As a result, the directories need to reconfig-

ured every time the workload changes. Similar to MAGIC, both Oracle and MySQL

support sub-partitioning to create nested partitions on multiple attributes [16, 17].

However, the sub-partitions are useful only if the outer attributes appear in the group-

by or join clause.

Big data storage systems, such as HDFS, partition datasets based on size. Devel-

opers can later create attribute-based partitioning using a variety of data processing

tools, e.g. Hive [18], SCOPE [19], Shark [20], and Impala [21]. However, such a

partitioning is no different than traditional database partitioning as (i) partitioning

is a static one time activity, (ii) the partitioning keys must be known a-priori and

provided by users. Apart from single table partitioning, Recently, [7] proposed to

create data blocks in HDFS based on the features extracted from each input tuple.

Again, the features are selected based on a workload and the goal is to cluster tuples

with similar features in the same data block.

Multi-dimensional Indexing: Partitioning has also been considered in the con-

text of indexing. For example, researchers have proposed to partition a B+-Tree [22]

on primary keys. These indexes are typically partitioned on a single attribute. Re-

cently, Teradata proposed multi-level partitioned primary indexes [23]. However, the

partitioned attributes are still based on a query workload and they can be used only

for selection predicates.

Multidimensional indexing has been extensively investigated in the database lit-

erature. Examples include K-d Trees, R-Trees, and Quad-Trees. These index struc-

tures are typically used for spatial data with 2 dimensions. Octree, which divides

the data into octets, is used with 3-dimensional data, such as 3D graphics. Several

other binary search trees have been proposed in the literature, such as splay tree [24].

Recent approaches layer multidimensional index structures over distributed data in

large clusters. This includes SpatialHadoop [25], MD-HBase [26], and epiC [27], or

adapting the multidimensional index to the workload in TrajStore [28]. However, all

of these multidimensional indexing approaches typically consider data locality and

18

2-dimensional spatial data.

Adaptive Indexing: Adaptive indexing techniques, such as database cracking [29,

10, 30, 11, 31, 32, 33] have been successful in single node in-memory column stores.

Cracking adaptively build an index as the queries are processed. This is done by using

the selection predicate in each incoming query as a hint to recursively split the dataset

into finer grained partitions. As a result, cracking piggybacks the query processing to

amortize the cost of indexing over a sequence of queries. Cracking happens on each

query and maintains additional structures to create the index. The reason cracking

cannot be applied to distributed data store is that the cost of re-partitioning is very

high. Each round of re-partitioning needs to be carefully planned to amortize the

cost associated with it.

19

Chapter 3

System Overview

The system exposes a relational storage manager, consisting of a collection of tables.

A query to the system is of the form <table, (filter predicates)>, for example <

employee, (age > 30, 100 ≤ salary ≤ 200) >. As the table is stored partitioned based

on the table’s partitioning tree, Amoeba is able to answer the query by accessing

only the relevant data blocks. Database query optimizers [34] can pushdown selects

past group-by and join operators down to the table. The table scan along with the

filters form the input to Amoeba.

Sampledrecords

Querylogs

Block 0Block 1

Block 2

…

Rawdata

Optimizer

Queryplanner

Planexecutor

PredicatedScan Query

Spark RDD

Adaptive Query Executor

Storage engine

Upfrontpartitioner

Update index

Repartitiondata

Index

Figure 3-1: Amoeba Architecture

21

Figure 3 shows the overall architecture of Amoeba. The three key components

of Amoeba storage system are as follows:

(i) Upfront partitioner. The upfront partitioner partitions a dataset into blocks and

spreads them throughout a block-based file system. The blocks are created based

on attributes in the dataset, without requiring a query workload. As a result, users

immediately get improved performance on ad-hoc workloads.

(ii) Storage Engine. The storage engine builds on top of a block-based storage system

to store tables. Each table represents a dataset loaded using the upfront partitioner.

The table contains an index file which stores the partitioning tree used to partition

the dataset and the partitioned dataset as a collection of data blocks. In addition,

we also store a query log containing the most recent queries that accessed the dataset

and a sample of the dataset whose use is described later.

(iii) Adaptive Query Executor. The adaptive query executor takes queries in the

form of a predicated scan and returns back the matching tuples. Since Amoeba

internally stores the data partitioned by the partition tree, it is able to skip many

data blocks while answering queries. The query first goes to the optimizer. When

the data blocks accessed by the query are not perfectly partitioned, the optimizer

considers repartitioning some or all of the accessed data blocks, as they are accessed

by the query and using the query predicates as cut-points. We use a cost model to

evaluate the expected cost and benefit of repartitioning. Due to de-clustering, we

end up performing random I/Os for each data block. This is ok, large block sizes in

distributed file systems [35] combined with fast network speeds lead to remote reads

being almost as fast as local reads (See Appendix A). We sacrifice some data locality

in order to quickly locate the relevant portions of the data on each machine in a

distributed setting.

The Amoeba storage system is self-tuning , lightweight (both in terms of the up-

front and repartitioning costs), and does not increase the storage space requirements.

In the rest of the thesis, we focus on building an efficient predicate-based data access

system. Future work will look at developing query processing algorithms (e.g., join

algorithms), on top of this system.

22

Chapter 4

Upfront Data Partitioning

A distributed storage system, such as HDFS, subdivides a dataset into smaller chunks,

called blocks. Blocks are created based on size, such that each block except the last is

of B bytes (usually 64MB) and gets independently replicated on R machines, where

R is the number of replicas (usually 3). The upload happens in parallel, however it is

expensive and involves writing out R copies of the dataset to disk across the cluster.

The upfront data partitioning pipeline exploits this block structure to create blocks

based on attributes, i.e., it splits the data by attributes rather than partitioning with-

out regard to the values in each block. This is similar to content-based chunking [36]

and feature-based blocking [7], however our approach does not depend on a query

workload. The key idea is to integrate attribute-based partitioning with data block-

ing, i.e., splitting a dataset into data blocks, in the underlying storage system using a

partitioning tree. This helps Amoeba achieve improved query performance on almost

all ad-hoc queries, compared to standard full scans, without having any information

about the query workload. This partitioning also serves as a good starting point for

the adaptive data repartitioner to improve upon, as discussed in Chapter 5.

We first present the key ideas used in the upfront partitioner. Then, we describe

our partitioning algorithm to come up with a partitioning tree for a given dataset.

23

4.1 Key Ideas

(1) Balanced Binary Tree. We represent the partitioning tree as a balanced binary

tree, i.e., we successively partition the dataset into two until we reach the minimum

partition size. Each node in the tree is represented as Ap where A is attribute being

partitioned on and p is the cut-point. All tuples with A ≤ p go to the left subtree and

rest go to the right subtree. The leaf nodes in the tree are buckets, each having a

unique identifier and the file name in the underlying file system. This file contains the

tuples that satisfy the predicates of all nodes traversing upwards from the bucket to

the root of the tree. Note that an attribute can appear in multiple nodes in the tree.

Having multiple occurrences of an attribute in the same branch of the tree increases

the number of ways the data is partitioned on that attribute.

(2) Heterogenous Branching. Figure 4-1(a) shows a partitioning tree analogous to the

k-d tree. This tree can only accommodate as many attributes as the depth of the tree.

For a dataset size D, minimum partition size P , and n way partitioning over each

attribute, the partitioning tree contains blognDPc attributes. With n = 2, D = 1TB,

and P = 64MB, we can only accommodate 14 attributes in the partitioning tree.

However, many real-world schemas have way more attributes. To accommodate more

attributes, we introduce heterogeneous branching to partition different branches of

the partitioning tree on different attributes. Hence, we sacrifice the best performance

on a few attributes to achieve improved performance over more attributes. This is

reasonable as without a workload, there is no reason to prefer one attribute over

another. Figure 4-1(b) shows a partitioning tree with heterogenous branching. After

partitioning on attributes A and B, the left side of the tree partitions on C while

the right side partitions on D. Thus, we are now able to accommodate 4 attributes,

instead of 3. However, attributes C and D are each partitioned on 50% of the data.

As a result, ad-hoc queries would now gain partially but over all the four attributes,

which makes the partitioning robust.

The number of attributes in the partitioning tree, with c as the minimum fraction

of the data partitioned by each attribute and r as the number of replicas, is given

as 1c· blognD

Pc. With n = 2, D = 1TB, P = 64MB and c = 50%, the number of

attributes that can be partitioned is 28. Note that the number of attributes that can

24

C C C C

A

B B

D1 D2

D11

D

D12 D21 D22

C C D D

A

B B

D1 D2

D11

D

D12 D21 D22

C D E F

Replicate

A B

D1 D2

D11

D

D12 D21 D22

(a) Partitioning Tree.

C C C C

A

B B

D1 D2

D11

D

D12 D21 D22

C C D D

A

B B

D1 D2

D11

D

D12 D21 D22

C D E F

Replicate

A B

D1 D2

D11

D

D12 D21 D22

(b) Heterogenous Branching.

Figure 4-1: Partitioning Techniques.

be partitioned increases with the dataset size. This shows that with larger dataset

sizes, upfront partitioning is even more useful for quickly finding the relevant portions

of the data.

(3) Hedging Our Bets. We define the allocation of an attribute i at each node j in the

tree as the number of ways the node partitions that attribute (nij) times the fraction

of the dataset this partitioning is applied to (cij), i.e., the total allocation of attribute

i is given as:

Allocationi = Σcij · nij

Allocation as defined above gives the average fanout of an attribute. For example, in

Figure 4-1(b), attribute B has an allocation of (2 ∗ 0.5 + 2 ∗ 0.5) = 2, while attribute

C has an allocation of (2 ∗ 0.25 + 2 ∗ 0.25) = 1. If we distribute the allocation equally

among all attributes, then the maximum per-attribute allocation for |A| attributes

and b buckets is given as b1/|A|. For example, if there are 8 buckets, and 3 attributes,

the allocation (average fanout) per attribute is 81/3 = 2.

The key intuition behind our upfront partitioning algorithm is to compute this

maximum per-attribute allocation, and then place attributes in the partitioning tree

so as to approximate this ideal allocation.

(4) Handling skew and correlations efficiently. Real world datasets are often skewed

(e.g., recent sales, holiday season shopping, etc.), and have attributes that are cor-

25

attributes={A,B,C,D} buckets = 8 depth = 3

alloc[A] = 1.189alloc[B] = 1.189 alloc[C] = 1.189 alloc[D] = 1.189

alloc[A] = -0.811alloc[B] = 1.189 alloc[C] = 1.189 alloc[D] = 1.189





alloc[A] = -0.811alloc[B] = -0.311 alloc[C] = 0.189 alloc[D] = 0.189

alloc[A] = -0.811alloc[B] = -0.311 alloc[C] = 0.189 alloc[D] = -0.311

(i) (ii) (iii) (iv) (v) (vi) (vii)

B C

A

B

AA

D4 D4 B5 D6

B7 C3

A5

A2 D4 B5 D6

B7 C3

A5

A2 A2 B5 D6

B7 C3

A5

B7 B7 B5 D6

A2 C3

A5

Swap D4 Swap D4 Pushdown B7 Rotate A5

A2

B5 D6

C3

A5

B C

A

D D B D

B C

A

D D B

B C

A

D D

B C

A

D

B7

B7

Figure 4-2: Upfront Partitioning Algorithm Example.

related (e.g., state and zipcode). As a result, if we uniformly partition by attribute

value, some branches of the partitioning tree could have much more data than others,

resulting in unbalanced final partitions. We would then lose the benefit of partition-

ing due to either very small or very large partitions. To illustrate, consider a skewed

dataset D1 = {1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8} and two partitionings, one based on the

domain values and the other on the median value.

Pdomain(D1) = [{1, 1, 1, 2, 2, 2}, {3, 4}, {5, 6}, {7, 8}]

Pmedian(D1) = [{1, 1, 1}, {2, 2, 2}, {3, 4, 5}, {6, 7, 8}]

We can see that Pdomain(D1) is clearly unbalanced whereas Pmedian(D1) produces bal-

anced partitions. Likewise, to illustrate the effect of correlations, consider a dataset

D2 with two attributes country name and salary :

D2(country, salary) ={(X, $25), (X, $40), (X, $60), (X, $80), (Y, $600),

(Y, $700), (Y, $850), (Y, $950)}

Pdomain(D2) =[{(X, $25), (X, $40), (X, $60), (X, $80)}, {(Y, $600),

(Y, $700), (Y, $850), (Y, $950)}]

Pmedian(D2) =[{(X, $25), (X, $40)}, {(X, $60), (X, $80)}, {(Y, $600),

(Y, $700)}, {(Y, $850), (Y, $950)}]

Partitioning D2 on country followed by salary results in only two partitions when

splitting the salary over its domain (0 through 1000). This is because the salary

distributions are correlated with the country. However, we get all four partitions

when using the medians successively on each node.

26

Algorithm 1: UpfrontPartitioning

Input : Attribute[] attributes, Int datasetSize, Int partitionSizeOutput: Tree partitioningTree

1 numBuckets ← datasetSize / partitionSize;2 depth ← log2(numBuckets);3 foreach a in attributes do4 allocation[a] ← nthroot(numBuckets, size(attributes));

5 root ← CreateNode();6 CreateTree(root, depth, allocation);7 return root;

Amoeba avoids this problem by performing a breadth-first traversal while con-

structing the partitioning tree. At each node, we place the attribute which has the

maximum allocation remaining. Once the attribute is chosen, we sort the data ar-

riving at the node on the attribute and choose the median as the pivot. We split

the data on the pivot and proceed to assign (attribute, pivot) for the left and right

child nodes. Finding the median in the data handles skew and correlation between

attributes, thereby ensuring that child nodes get equal portions of data. This leads

to balanced partitions in the end. In order to find the medians efficiently, we first

make one pass over the data to construct a sample and then find the median at each

node using the sample. We refer to Chapter 6 for more details on implementation.

4.2 Upfront Partitioning Algorithm

We now describe our upfront partitioning algorithm. The goal of the algorithm is

to generate a partitioning tree, which balances the benefit of partitioning across all

attributes in the dataset. This means that same selectivity predicates on any two

attributes X and Y should have similar speed-ups, compared to scanning the entire

dataset. Notice that this is different from a k-d tree [37] which typically partitions

the space by considering the attributes in a round robin fashion, until the smallest

partition size is reached. Before we describe the algorithm, let us first look at the key

ideas in our upfront partitioning algorithm.

Algorithm 1 shows the upfront partitioning algorithm, which takes in the set

27

Algorithm 2: CreateTree

Input : Tree tree, Int depth, Int[] allocation

1 Queue nodeQueue ← {tree.root};2 while nodeQueue.size > 0 do3 node ← nodeQueue.pollFirst();4 if depth > 0 then5 node.attr ← leastAllocated(allocation);6 node.value ← findMedian(node.attr);7 node.left ← CreateNode();8 node.right ← CreateNode();9 allocation[node.attr] -= 2.0/2maxDepth - depth;

10 nodeQueue.add(node.left);11 nodeQueue.add(node.right);12 depth -=1;

13 else14 node ← newBucket ();

of attributes, the dataset size, and the smallest partition size1 and produces the

partitioning tree. The algorithm computes the ideal allocation for each attribute and

then calls createTree on the root node. Note that we could also consider relative

weights of attributes when computing the ideal allocation for each attribute, in case

some attributes are more likely to be queried than others. Algorithm 2 shows the

createTree function. It performs a breadth-first traversal and assigns an attribute

to each node. The attribute to be assigned is given by the function leastAllocated,

which returns the attribute which has the highest allocation remaining. If two or more

attributes have the same highest allocation remaining, we randomly choose among

the ones that have occurred the least number of times in the path from the node to

the root. findMedian returns the median of the attribute assigned to this node. This

is done by finding the median in the sampled data which comes to this branch. The

algorithm starts with an allocation of 2 for the root node, since we are partitioning

the entire dataset into two partitions. Each time we go to the left or the right subtree,

we reduce the data we operate on by half. Once an attribute is assigned to a node, we

subtract from the overall allocation of the attribute (Line 9). The algorithm creates

a leaf-level bucket in case we reach the maximum depth (Line 14).

1For HDFS, we take the block size as the smallest partition size.

28

Figure 4-2 illustrates the steps when assigning attributes in a partitioning tree

over 4 attributes and 8 buckets (dataset size = 8 ∗ 64 = 512MB) and the allocation

remaining per attribute in each step. The algorithm starts in Step (i) with the root

node and performs a breadth-first traversal. Once attribute A is assigned in Step (i),

it has the least allocation remaining (in fact, we have used up all the allocation for

attribute A) and it is excluded from the possible options in Step (ii). We continue

placing attributes with the minimum allocation even after all four attributes have

been placed once in Step (iv). At the end of Step (v), attributes B, C, and D have

the same allocation remaining. For the next nodes in Steps (vi) and (vii), since C

already occurs on the path from the nodes to the root, we randomly choose between

B and D.

29

Chapter 5

Adaptive Repartitioning

In the previous chapter, we described an algorithm to partition a dataset on several

(or all) attributes, without knowing the query workload. However, as users begin to

query the dataset, it is beneficial to improve the partitioning based on the observed

queries. As in most online algorithms, we assume that the queries seen thus far are

indicative of the queries to come. Amoeba does this by doing transformations on

the partitioning tree based on the observed data access patterns. The key features of

the adaptive repartitioning module of Amoeba are:

• Piggybacked, meaning it interleaves with query processing and only accesses

the data which is read by input query, i.e., we do not access data that is not

read by queries during re-partitioning. This has two benefits: (i) we never

spend any effort in re-partitioning data that will not be touched by any query,

and (ii) query processing and re-partitioning modules share the scan thereby

reducing the cost of re-partitioning.

• Transparent, as the users do not have to worry about making the repartitioning

decisions and their queries remain unchanged with new access methods.

• Eventually convergent, meaning it converges to a final partitioning if a fixed

workload is repeated over and over again.

• Lightweight, as it does not penalize any query with high repartitioning costs: it

distributes the costs over several queries.

31

• Balances between adaptivity and robustness, meaning the system tries to stabi-

lize newly made repartitioning decisions as well as expiring older decisions.

In the rest of this chapter, we first describe our workload monitor and the cost

model to estimate the cost of a query over a given partitioning tree. Then, we

introduce three basic transformations used to transform a given partitioning tree.

We describe a bottom-up algorithm to consider all possible alternatives generated

from the transformation rules for inserting a single predicate. Finally, we discuss how

to handle multi-predicate queries.

5.1 Workload Monitor and Cost Model

Amoeba maintains a history of the queries seen by the system. We call this the query

window denoted by W . Each incoming query gets added into the query window as

< T, q > where T is the current timestamp and q is the query. We do not directly

restrict the size of the query window, instead we restrict the window to contain only

queries that happened in the past X hours. The intuition behind this is that the

older queries are stale and are not the representative of the queries to be seen. In all

our evaluation, we set W = 4 hours. The cost of a query q over a partitioning tree T

is given as:

Cost(T, q) =∑

b∈lookup(T,q)

nb

Where the function lookup(T, q) gives the set of relevant buckets for query q in

T . The cost of the query window is the sum of the cost of individual queries. A

query being executed may have some of its buckets re-partitioned. The added cost of

repartitioning a set of buckets B is given as:

RepartitioningCost(T, q) =∑b∈B

c · nb

32

The important parameter to note here is c. c represents the write-multiplier i.e.:

how expensive writes are compared to read. Changing c alters the properties of the

system: on one end setting c =∞ makes it imitate a system with no-repartitioning,

at the other end setting c = 0 makes it re-partition the data every time it sees benefit.

5.2 Partitioning Tree Transformations

We now describe a set of transformation rules to explore the space of possible plans

when re-partitioning the data. For now, we restrict ourselves to the problem of

exploring alternate trees for a query with a single predicate of the form A ≤ p,

denoted as Ap. Later in 5.4, we discuss how to handle other predicate forms and

adding multiple predicates into the tree.

Given a query with predicate Ap, we attempt to assign Ap to one of the nodes

(partition on A with cutpoint p) in the partitioning tree. Note that we do not have

to consider partitioning on any other attribute, i.e., predicates on other attributes

would have already been considered when there was a query on that attribute. Our

approach is to consider partitioning transformations that are local, i.e., that do not

involve rewriting the entire tree. These local transformations are cheaper and amor-

tizes the repartitioning effort over several queries. Below, let us first see the three

kinds of partitioning transformations that we consider during repartitioning. We will

then describe our algorithm for when to apply these transformations in order to im-

prove the query performance. Amoeba considers three kinds of basic partitioning

transformations:

(1) Swap. This is the primary data-reorganization operator in Amoeba. It replaces

an existing node in the partitioning with the incoming query predicate Ap. As we

repartition only the accessed portions of the data, we consider swapping only those

nodes whose left and right children are accessed by the incoming query. Applying

swap on an existing node involves reading both sub-branches, and restructuring all

partitions beneath the left subtree to contain data satisfying Ap and the right sub-

tree to contain data that does not satisfy Ap. Swaps can happen between different

attributes (Figure 5-1(a)), in which case both branches are completely rewritten in

the new tree. Swaps can also happen between two predicates of the same attribute

33

(Figure 5-1(b)), in which case the data moves from one branch to the other.

X Ap

Ap' Ap

X

Ap Ap

Ap

X X

Ap'

Z

Ap

X Y

XAp

X Y

Ap'

Q[Ap]

Q[Ap]

Q[Ap]

Q[Ap]

(a) Different Attribute Swap

X Ap

Ap' Ap

X

Ap Ap

Ap

X X

Ap'

Z

Ap

X Y

XAp

X Y

Ap'

Q[Ap]

Q[Ap]

Q[Ap]

Q[Ap]

(b) Same Attribute Swap

Figure 5-1: Node swap in the partitioning tree.

For example, if predicate Ap′ is ≤ 10 and predicate Ap is ≤ 5, then data moves

from left branch to right branch in the Figure 5-1(b), i.e., the left branch is com-

pletely rewritten while the right branch just has new data appended. Swaps serve the

dual purpose of un-partitioning an existing (less accessed) attribute while refining on

another (more accessed) attribute. Since both the swap attributes as well as their

predicates are driven by the incoming queries, they reduce the access times for the

incoming query predicates. Finally, note that it is cheaper to apply swaps at lower

levels in the partitioning tree since less data is rewritten. Applying them at higher

levels of the tree results in a much higher cost.

(2) Pushup. This transformation is used to push a predicate as high up the tree as

possible. This can be done when both the left and the right child of a node contain

the incoming predicate, as a result of a previous swap, as shown in Figure 5-3. Notice

that this is a logical partitioning tree transformation, i.e., it only involves rearranging

the internal nodes without any modification of the contents of leaf nodes1.

We check for a pushup transformation every time we perform a swap transformation.

The idea is to move important predicates (ones that have recently or frequently ap-

peared in the query sequence) progressively up the partitioning tree, from the leaves

right up to the root. This makes such important predicates less likely to be swapped

immediately, i.e., the tree is still robust, because swapping a node higher in the par-

titioning tree is much more expensive. Another advantage of node pushup is that it

causes a churn of the attributes assigned to higher nodes in the upfront partitioning.

When such a dormant node is pushed down, subsequent predicates can swap them

in a more incremental fashion, affecting fewer branches. Overall, node pushup allows

Amoeba to naturally cause less important attributes to be repartitioned more fre-

quently, thereby striking a balance between adaptivity and robustness. Note that if

1In this case, the physical transformation, i.e. swap, must have happened in one of the childsubtrees.

34

attributes={A,B,C,D} buckets = 8 depth = 3

alloc[A] = 1.189alloc[B] = 1.189 alloc[C] = 1.189 alloc[D] = 1.189






alloc[A] = -0.811alloc[B] = -0.311 alloc[C] = 0.189 alloc[D] = 0.189

alloc[A] = -0.811alloc[B] = -0.311 alloc[C] = 0.189 alloc[D] = -0.311

(i) (ii) (iii) (iv) (v) (vi) (vii)

B C

A

B

AA

D4 D4 B5 D6

B7 C3

A5

A2 D4 B5 D6

B7 C3

A5

A2 A2 B5 D6

B7 C3

A5

B7 B7 B5 D6

A2 C3

A5

Swap D4 Swap D4 Pushdown B7 Rotate A5

A2

B5 D6

C3

A5

B C

A

D D B D

B C

A

D D B

B C

A

D D

B C

A

D

B7

B7

Figure 5-2: Illustrating adaptive partitioning when predicate A2 appears repeatedly.

X Ap

Ap' Ap

X

Ap Ap

Ap

X X

Ap'

Z

Ap

X Y

XAp

X Y

Ap'

Q[Ap]

Q[Ap]

Q[Ap]

Q[Ap]

Figure 5-3: Node pushdown in partitioning tree.

possible, a pushup always happens as there is no cost associated with doing it.

(3) Rotate. The rotate transformation rearranges two predicates on the same at-

tribute such that more important (recently accessed or frequently appearing in the

query sequence) predicate appears higher up in the partitioning tree. Figure 5-4 shows

a rotate transformation involving predicates p and p′ on attribute A. The goal here

is to churn the partitioning tree such that predicates on less important attributes are

likely to be replaced first. Similar to the pushup transformation, rotate is a logical

transformation, i.e., it only rearranges the internal nodes of the partitioning tree and

always happens if possible.

X Ap

Ap' Ap

X

Ap Ap

Ap

X X

Ap'

Z

Ap

X Y

XAp

Y Z

Ap'

Q[Ap]

Q[Ap]

Q[Ap]

Q[Ap]

Figure 5-4: Node rotation in partitioning tree.

These three partitioning tree transformations can be further combined to capture

a fairly general set of repartitioning scenarios. Figure 5-2 shows how starting from

an initial partitioning tree, we first swap nodes D4 with incoming predicate A2 at the

lower level. Then, we pushup A2 one level above and finally rotate with nodes A5

and C3.

The upfront partitioning algorithm generates a partitioning tree which is fairly

balanced, i.e., all the leaf nodes get almost same number of tuples. Swapping nodes

based on predicates from incoming queries may lead to some leaves having more tuples

compared to the rest. This is not a problem as our cost model ensures that the skew

is actually beneficial if it arises. For example, if many queries access A ≤ 0.75, where

35

Transformation Notation Cost (C) Benefit (B)

Swap Pswap(n, n′)∑

b∈Tnc · nb

∑ki−0[Cost(Tn, qi)− Cost(Tn′ , qi)]

Pushup Ppushup(n, nleft, nright) C(PPtop(nleft)) + C(PPtop(nright)) B(PPtop(nleft)) +B(PPtop(nright))Rotate Protate(p, p

′) C(PPtop(nleft|right)) , for p’ on nleft|right B(PPtop(nleft|right)) , for p’ on nleft|rightNone Pnone(n) C(PBest(nleft)) + C(PBest(nright)) B(PBest(nleft)) +B(PBest(nright))

Table 5.1: The cost and benefit estimates for different partitioning tree transforma-tions.

A is uniformly distributed (0, 1), it is beneficial to add this node into the tree even

though it might lead to a skew. In the next section, we describe how we generate

alternate partitioning trees using these transformations.

5.3 Divide-And-Conquer Repartitioning

Given a query with predicate Ap and a partitioning tree T , there are many different

combinations of transformations that need to be considered. Consider for example a

simple 7-node tree, consisting of a root node X with two children Y and Z. Each of

Y and Z have two leaf nodes below them. Assuming all the leaf nodes are accessed,

the set of alternatives to be considered are: (i) Swap Y with Ap; (ii) Swap Z with

Ap; (iii) Swap both Y and Z, followed by pushup; (iv) Swap X with Ap; and (v) Do

nothing.

We propose a bottom-up approach to explore the space of all alternate reparti-

tioning trees. Observe that the data access costs over a partitioning tree Tn, rooted

at node n, could be broken down into the access costs over its subtrees, i.e.,

Cost(Tn, qi) = Cost(Tnleft, qi) + Cost(Tnright

, qi)

Where, Tnleftand Tnright

are subtrees rooted respectively at the left and the right

child of n. Thus, finding the best partitioning tree can be broken down into recursively

finding the best left and right subtrees at each level, and considering parent node

transformations only on top of the best child subtrees. For each transformation, we

consider the benefit and cost of that transformation and pick the one which has the

best benefit-to-cost ratio. Table 5.1 shows the cost and benefit estimates for different

transformations. For the swap transformation, denoted as Pswap(n, n′), we need to

recalculate the query costs. However, pushup and rotate transformations, denoted

36

Algorithm 3: getSubtreePlan

Input : Node node, Predicate predOutput: Plan transformPlan

1 if isLeaf(node) then2 return Pnone(node);3 else4 if isLeftRelevant(node,pred) then5 leftPlan ← getSubtreePlan(node.lChild, pred);

6 if isRightRelevant(node,pred) then7 rightPlan ← getSubtreePlan(node.rChild, pred);

/* consider swap */

8 if leftPlan.fullyAccessed and rightPlan.fullyAccessed then9 currentCost ←

∑i Cost(node,qi);

10 whatIfNode ← clone(node);11 whatIfNode.predicate ← newPred;12 swapNode(node, whatIfNode);13 newCost ←

∑i Cost(whatIfNode,qi);

14 benefit = currentCost - newCost;15 if benefit > 0 then16 updatePlanIfBetter(node, Pswap(node,whatIfNode));

/* consider pushup */

17 if leftPlan.ptop and rightPlan.ptop then18 updatePlanIfBetter(node, Ppushup(node, node.lChild, node.rChild));

/* consider rotate */

19 if node.attribute == predicate.attribute then20 if leftPlan.ptop then21 updatePlanIfBetter(node, Protate(node, node.lChild));

22 if rightPlan.ptop then23 updatePlanIfBetter(node, Protate(node, node.rChild));

/* consider doing nothing */

24 updatePlanIfBetter(node, Pnone(node));25 return node.plan;

as Pswap(n, n′) and Ppushup(n, nleft, nright) respectively, inherit the costs from children

subtrees. We also consider applying none of the transformations at a given node,

denoted as Pnone(n). This divide-and-conquer approach helps to significantly reduce

the candidate set of modified partitioning trees.

Given a query with single predicate Ap, we call getSubtreePlan(root, Ap). The

37

algorithm uses a divide-and-conquer approach to find the best plan for the given

predicate by recursively finding the best plan for each subtree, until we reach the

leaf. If the node is a leaf, we return a do-nothing plan (Lines 1-2). If not, we first

check if the left subtree is accessed, if yes we recursively call getSubtreePlan to find

the best plan for the left subtree (Lines 4-6). Similarly for the right subtree. Once

we have the best plans for the left and right subtree, we first consider swap rule

(Lines 10—19). We only consider swapping if both the subtrees are fully accessed.

Otherwise, we will need to access additional data in order to create new partitioning.

We perform a what-if analysis to analyze the plan produced by the swap transfor-

mation. This is done by replacing the current node with a hypothetical node having

the incoming query predicate. We then recalculate the new bucket counts at the leaf

level of this new tree using the sample. We now estimate the total query cost with the

hypothetical node present. In case the what-if node reduces the query costs, i.e., it

has benefits, we update the transformation plan of the current node. The update

method (updatePlanIfBetter) checks whether the benefit-cost ratio of the new plan

is greater than that of the best plan so-far. If so, we update the best plan. The

benefit-cost ratio is used to compare the alternative plans.

Next we check whether a pushup transformation is possible (Lines 21–23). plan.ptop

indicates if the plan results in the predicate p being at the root of the subtree. A

pushup transformation is possible only when both the child nodes have their root

as Ap. Since pushup is only a logical transformation, we do not need to perform a

what-if analysis. Instead, we simply check whether the pushdown results in better

benefit-cost ratio in the updatePlanIfBetter method. Then, we consider rotating

the node by bringing either the left or the right child up (Lines 24–31). Again, this is

a logical transformation and therefore we only check the benefit-cost ratio. Finally,

we check whether no transformation is needed, i.e., we simply inherit the transfor-

mations of the child nodes (Lines 32). The algorithm finally returns the best plan in

Line 33.

A plan contains action taken at the node, ptop to indicate if after the plan is

applied Ap is the root, fullyAccessed to indicate if the entire subtree is accessed

and pointers to the plan for the left and right child nodes. For the sake of brevity,

Algorithm 3 does not explicitly show the update of these attributes. The algorithm

38

Algorithm 4: getBestPlan

Input : Tree tree, Predicate[] predicates

1 while predicates 6= φ do2 prevPlan ← tree.plan;3 foreach p in predicates do4 Plan newPlan ← getSubtreePlan(tree.root, p);5 updatePlan(tree.root, newPlan);

6 if tree.plan 6= prevPlan then7 remove from predicates the newly inserted predicate;8 else9 break;

10 return tree.plan;

has a runtime complexity of O(QNlogN) where N is the number of nodes in the tree

and Q is the number of queries in the query window.

5.4 Handling Multiple Predicates

So far we assumed that a predicate is always of the form A ≤ p. It gets inserted in

the tree as Ap and on insertion, only the leaf nodes on the left side of the tree are

accessed. A > p is also inserted as Ap with the right side of the tree being accessed.

For A ≥ p and A < p, let p′ be p − δ where δ is the smallest change for that type.

We insert Ap′ into the tree. A = p is treated as combination of A ≤ p and A > p′.

Now lets consider queries with multiple predicates. Consider a simple query with

two predicate Ap and Ap2. The brute force approach is to consider choosing a set

of accessed non-terminal nodes to be replaced by Ap and then for every such choice,

choose a subset of remaining nodes to be replaced by Ap2. Thus, the number of choices

grows exponentially with the number of predicates. Amoeba uses a greedy approach

to work around this exponential complexity, as described in Algorithm 4. For each

predicate in the query, we try to insert the predicate into the tree. We find the best

plan for that predicate by calling getSubtreeP lan(root, pi) for the ith predicate (Lines

3-6). We take the best among the best plans obtained for different predicates and

remove the corresponding predicate from the predicate set. We then try to insert the

remaining predicates into the best plan obtained so far. The algorithm stops when

39

either all predicates have been inserted or when the tree stops changing (Lines 1 and

10). getBestP lan adds a multiplicative complexity of O(|P |2) where P is the set of

query predicates.

40

Chapter 6

Implementation

We now describe our implementation of Amoeba on top of HDFS and Apache Spark.

Notice that our ideas could be implemented on any other block-based distributed

storage system. The Amoeba storage system has more than 12, 000 lines of code

and it comprises of two modules: (i) a robust partitioning module that parses data

and writes out the initial partitions; and (ii) a query executor module performing the

distributed adaptive repartitioning.

6.1 Initial Robust Partitioning

This module takes in the raw input files (e.g., CSV) and partitions them across all

attributes. For this, it first builds the robust partitioning tree and then creates the

data blocks based on the partitioning tree.

Tree Construction. Recall that our robust partitioning algorithm needs to find

the median value of the partitioning attribute at each node of the tree. As finding

successive medians on the entire dataset is expensive, we instead employ a sample of

the input to estimate the median. We use block-level sampling to generate a random

sample of the input.

Amoeba uses a lazy parser to parse the input sample data lazily. The parser

detects: the record boundaries, the attribute boundaries inside a record and the data

types. Each value in the record is actually parsed only when it is accessed, i.e., lazily.

While Amoeba partitions on all attributes by default, developers can also specify

41

particular subsets of the attributes to partition on, in case they have some knowledge

of the query workload. In such a situation, only the partitioned attributes need to be

parsed. Finally, the lazy parser avoids copying costs by returning tuples as views on

the input byte buffer. The only time copying happens when the tuple is written to

an output byte buffer.

The sampled records and set of attributes are fed to the tree builder (Algorithm 2),

which produces the partitioning tree as the output. The tree builder successively sorts

the sample on different attributes in order to find the median at different nodes in the

tree. However, as each sort happens on a different portion of the sample, we sort on

different views of the same underlying data, i.e., the samples are not copied each time

they are sorted. When constructing the partitioning tree on a cluster of machines, we

collect the samples on each machine independently and in parallel. Later, we combine

the samples (via HDFS) and run the tree builder on a single machine to produce a

single partitioning tree across the entire dataset. The index is serialized and stored

as a file on HDFS.

Data Blocking. The second phase takes the partitioning tree and the input files

as input and creates the data blocks. During this phase, we scan the input files, for

each tuple we use the partitioning tree to find the leaf node it lands in. We use a

buffered blocker to collect the tuples belonging to each partition(leaf node) separately

and buffer them before flushing to the underlying file system, i.e., HDFS in our case.

Our current implementation creates a different HDFS file for each partition in the

dataset. However, future work could also integrate Amoeba deeply within HDFS.

Given that we do not assume any workload knowledge, Amoeba simply de-clusters

the partitioned blocks randomly across the cluster of machines, i.e., we use the default

random data placement policy of HDFS. This is reasonable because fetching relevant

data across the network is fine if we can skip more expensive disk reads, as noted in

the introduction.

Amoeba runs the loading and partitioning in parallel on each machine, using

the same partitioning tree (made available to all machines via HDFS). We employ

partition-level distributed exclusive locks (via Zookeeper) to ensure that buffered

writers on different machines don’t write to the same file at the same time. With

the current main memory and CPU capacities, having a file per partition (number of

42

partitions in the order of ten thousand) does not lead to any observable slowdown [38].

6.2 Query Execution

There are two main parts involved in query execution: (i) create an execution plan

which may involve re-partitioning some or all the of the data that is being accessed

by the query, and (ii) actually execute the plan.

Optimizer. Queries submitted to Amoeba first go to the optimizer. The optimizer

is responsible for generating an execution plan for the given query. It reads the current

tree file from the HDFS and uses Algorithm 4 to check if it is feasible to improve the

current partitioning tree. Note that while creating the plan, we also end up filtering

out partitions which do not match any of the query predicates. For example, if there

is a node A5 in the tree and one of the predicates in the query is A ≤ 4, then we don’t

have to scan any of the partitions in right subtree of the node. The plan returned is

used to create a new index tree and write it out to HDFS. From the plan, we now get

two set of buckets: 1) buckets that will just be scanned, and 2) buckets that will be

re-partitioned to generate a new set of buckets.

Plan Executor. Amoeba uses Apache Spark for executing the queries. We con-

struct a Spark job from the plan returned by the optimizer. We split the each of the

two sets of buckets into smaller sets called tasks. A task contains a set of buckets

such that the sum of the sizes of the buckets is not more than 4GB. Each task reads

the blocks from HDFS in bulk and iterates over the tuples in main-memory. Tasks

created from set 1 run with a scan iterator which simply reads a tuple at a time from

the buckets and returns the tuple if it matches the predicates in the query.

Tasks created from set 2 run with a distributed repartitioning iterator. The it-

erator reads the tree from HDFS. For each tuple, the iterator looks up in the new

partitioning tree to find its new partition id in addition to checking if it matches the

query predicates. It then re-clusters the data in main-memory according to the new

partitioning tree. Once the buffers are filled, the repartitioner flushes the new parti-

tions into physical files on HDFS. When the optimizer decides to repartition a large

subtree, the repartitioning work may end up being distributed across several tasks.

As a result, the workers need to coordinate while flushing the new partitions, i.e., so

43

that writes are done atomically. Again, we employ partition-level distributed exclu-

sive locks (via Zookeeper) for this synchronization. As a result of this synchronized

writing, each partition resides in a single file across the cluster.

Tasks are executed independently by the Spark job manager across the cluster of

machines and the result is exposed to the user as a Spark RDD. The user can use these

RDDs to do more analysis using the standard Spark APIs, e.g., run an aggregation.

44

Chapter 7

Discussion

In this chapter we discuss briefly ideas on how to improve the system to get better

performance.

7.1 Leveraging Replication

Distributed storage systems replicate data for fault-tolerance, e.g., 3x replication in

HDFS. Such replication mechanisms first partition the dataset into blocks and then

replicate each block multiple times. Instead, we can first replicate the entire dataset

and then partition each replica using a different partitioning tree.

C C C C

A

B B

D1 D2

D11

D

D12 D21 D22

C C D D

A

B B

D1 D2

D11

D

D12 D21 D22

C D E F

Replicate

A B

D1 D2

D11

D

D12 D21 D22

Figure 7-1: Heterogenous Replication.

For example, attributes {A,C,D} and {B,E, F} for the two replicas as in Fig-

ure 7.1. While the system is still fault-tolerant (same replication), recovery becomes

slower because we need to read several or all replica blocks in case of a block failure.

Essentially, we sacrifice fast recovery time for improved ad-hoc query performance.

45

To recap, the number of attributes in the partitioning tree, with a dataset size D,

minimum partition size P , n way partitioning over each attribute, c as the minimum

fraction of the data partitioned by each attribute, is given as a = 1c· blognD

Pc. Having

r replicas allows us to have a ∗ r number of attributes, with a attributes per replica

or increase the n for each attribute. Both these lead to improved query performance

due to greater partition pruning. Currently the system just splits the attribute set

into disjoint equals sets of attributes and builds a partitioning tree independently on

each set. There are interesting questions like can we group together attributes in a

non-random way so that cluster attributes accessed together, how to adapt across

partition trees which we plan to explore as future work.

7.2 Handling Joins

In order to accelerate join performance, distributed database system tend to do

co-partitioning. Hadoop++[39] and CoHadoop [40] proposed this scheme of co-

partitioning datasets in HDFS to speed up joins.

Co-partitioning can be achieved in Amoeba as well. The user can reserve the

d top levels in the partitioning tree for the join attribute on which he/she wants to

co-partition. This creates 2d partitions of the join attribute’s domain. The adaptive

query executor would incrementally re-partition to improve query performance based

on the filter predicates but it would not touch the topmost d levels which have been

reserved for co-partitioning.

Finally, the dataset may have multiple join attributes, each of which joins to

different datasets. Two tables can be co-partitioned, however a table can’t be co-

partitioned against multiple tables on different attributes. The default approach used

by systems today is to shuffle join. Shuffle join is much more expensive compared to

co-partitioned join as it requires a full network shuffle of the dataset. In Amoeba,

since the dataset is partitioned partially (as in not fully co-partitioned), it would be

possible to treat joins as a first class query and we could consider adaptively improving

the partitioning on the join attribute incrementally as well. It would converge on to

fully co-partitioned layout if only one join is frequently accessed. We plan to explore

this in the future.

46

Chapter 8

Evaluation

In this chapter, we report the experimental results on the Amoeba storage system.

The experiments are divided into three sub-sections: (i) we examine the benefits

and overheads of upfront data partitioning and compare it against a standard space

partitioning tree, (ii) we study the behaviour of the adaptive repartitioning under

different workload patters via micro benchmarks and (iii) we finally validate the

Amoeba system on a real world workload from a local startup company.

Setup. Our testbed consists of a cluster of 10 nodes. Each node has 32 2.07 GHz

Xeon cores, running on Ubuntu 12.04, 256 GB main-memory, and 11 TB of disk

storage. We generate the dataset in parallel on each node, so all data loading into

HDFS happens in parallel. The Amoeba storage system runs on top of current

stable Hadoop 2.6.0 and uses Zookeeper 3.4.6 for synchronization. We run queries

using Spark 1.3.1, with Spark programs running on Java 7. All experiments are run

with cold caches.

8.1 Upfront Partitioning Performance

We now study the impact of doing upfront partitioning. We analyze three aspects:

the benefit of doing upfront data partitioning, the added overhead as a result of doing

upfront partitioning and finally how it matches up against k-d trees[37] which is a

standard space partitioning tree.

We use the lineitem table from the TPC-H benchmark with scale factor 1000.

47

The table contains approximately 6 billion rows and 760 GB in size. The table has

16 columns. We use the TPC-H data generator to generate 1/10th of the data on

each machine, hence the data is uniformly distributed across all the machines. After

the upfront partitioning is completed, the data resides in HDFS 3-way replicated,

occupying 2.3 TB of space.

Attribute Full Scan Robust Partitioning

Robust Partitioning (per-replica)

Ideal Improvement Improvement (per-replica)

orderkey 1500 877.228 717.568 75 0.415181333333333 0.521621333333333

partkey 1500 757.293 544.222 75 0.495138 0.637185333333333

suppkey 1500 786.400 385.636 75 0.475733333333333 0.742909333333333

linenumber 1500 793.877 448.731 75 0.470748666666667 0.700846

quantity 1500 764.623 414.005 75 0.490251333333333 0.723996666666667

extendedprice 1500 799.784 251.845 75 0.466810666666667 0.832103333333333

discount 1500 852.987 358.950 75 0.431342 0.7607

tax 1500 794.281 316.465 75 0.470479333333333 0.789023333333333

returnflag 1500 1145.608 1116.729 75 0.236261333333333 0.255514

linestatus 1500 715.728 715.728 75 0.522848 0.522848

shipdate 1500 849.732 484.058 75 0.433512 0.677294666666667

commitdate 1500 838.666 497.604 75 0.440889333333333 0.668264

receiptdate 1500 802.018 475.641 75 0.465321333333333 0.682906

shipinstruct 1500 1039.717 676.373 75 0.306855333333333 0.549084666666667

shipmode 1500 859.939 365.166 75 0.426707333333333 0.756556

0 0.436538622222222 0.654723511111111

Tim

e (s

econ

ds)

0

400

800

1200

1600

orderk

ey

partk

ey

supp

key

linenu

mber

quan

tity

exten

dedp

rice

disco

unt

tax

return

flag

linesta

tus

shipd

ate

commitd

ate

receip

tdate

shipin

struc

t

shipm

ode

Full Scan Robust Partitioning Robust Partitioning (per-replica) Ideal

Figure 8-1: Ad-hoc query runtimes for different attributes of TPC-H lineitem.

Ad-hoc Query Processing. We first study ad-hoc query performance. We run

range queries of the following form on all the attributes: SELECT * FROM lineitem

WHERE start < A ≤ end; start and end are chosen randomly in the domain while

ensuring a 5% selectivity. Given that Amoeba distributes the partitioning effort

over all attributes, ad-hoc queries are expected to show a benefit right from the

beginning. Figure 8-1 shows the results. Two observations stand out: (i) Amoeba

(labeled “Robust Partitioning”) bridges the gap between the standard and the ideal

runtimes, and (ii) all attributes have similar advantage with Amoeba. Overall, as a

result of upfront partitioning we get 44% improvement on an average over full scan

versus no partitioning. Figure 8-1 also shows the runtimes for per-replica robust

partitioning, i.e., when we use a different partitioning tree for each replica. We can

see that per-replica robust partitioning improves the runtimes even further with an

average improvement of 65% over full scan. However, note that attributes such as

returnflag and linestatus still have the same runtime. This is because these are

very low cardinality attributes that cannot be partitioned further. Thus, Amoeba

indeed provides improved query performance over ad hoc workloads without being

given a workload upfront.

Partitioning overheads. Given that partitioning is an expensive operation in

database analytics, we now study the partitioning overheads in Amoeba. We first

48

Exhaustive Adaptive Workload-based Adaptive

1 9.0694579E+08 2997871289 3.14811328E+08 3002124738 6.3219558E+08 3002124738

2 9.1045811E+08 2997871289 3.12734144E+08 2997871289 3.16637952E+08 3002124738

3 9.0861056E+08 3002124738 3.1046384E+08 2359184298 1.5864256E+08 3002124738

4 9.1795782E+08 2513928775 3.16079552E+08 1265335315 8.1171424E+07 3002124738

5 9.1176288E+08 1595976302 3.79694176E+08 3002124738 4.2983852E+07 3002124738

6 6.8369171E+08 684385763 5.4918707E+08 1518682469 2.4999108E+07 2007317820

7 9.0694579E+08 2997871289 3.79694176E+08 3002126228 1.7504594E+07 2007317820

8 9.1045811E+08 1330870883 5.22116032E+08 2984199016 1.5004684E+07 2007317820

9 9.0861056E+08 908688569 4.7684624E+08 506427423 1.5004684E+07 2007317820

10 9.1795782E+08 917947303 5.4010598E+08 2844579882 1.5004684E+07 355650048

11 9.1176288E+08 911595775 3.13798816E+08 815518542 1.5004684E+07 23944808

12 6.8369171E+08 684385763 3.14920896E+08 612415327 1.5004684E+07 23944808

13 9.0694579E+08 1667001794 3.13349984E+08 1384334576 1.5004684E+07 23944808

14 9.1045811E+08 910379735 3.1682064E+08 3166572369 1.5004684E+07 23944808

15 9.0861056E+08 908688569 3.13323904E+08 1967528802 1.5004684E+07 23944808

16 9.1795782E+08 917947303 3.16836288E+08 2461879761 1.5004684E+07 23944808

17 9.1176288E+08 911595775 3.1925792E+08 1466482105 1.5004684E+07 23944808

18 6.8369171E+08 684385763 3.16945888E+08 1693384125 1.5004684E+07 23944808

19 9.0694579E+08 906950739 3.13777952E+08 1143868180 1.5004684E+07 23944808

20 9.1045811E+08 910379735 3.1750432E+08 1601779877 1.5004684E+07 23944808

21 9.0861056E+08 908688569 3.17076352E+08 901236093 1.5004684E+07 23944808

22 9.1795782E+08 917947303 3.15510656E+08 1317807849 1.5004684E+07 23944808

23 9.1176288E+08 911595775 5.4918707E+08 683999521 1.5004684E+07 23944808

24 6.8369171E+08 684385763 3.15740288E+08 992840181 1.5004684E+07 23944808

25 9.0694579E+08 906950739 3.1530192E+08 808139186 1.5004684E+07 23944808

26 9.1045811E+08 3.1772352E+08 1310146264 1.5004684E+07 23944808

27 9.0861056E+08 3.14936576E+08 1402616769 1.5004684E+07 23944808

28 9.1795782E+08 3.13344768E+08 857485053 1.5004684E+07 23944808

29 9.1176288E+08 3.12066112E+08 711028895 1.5004684E+07 23944808

30 6.8369171E+08 3.10677824E+08 1223661908 1.5004684E+07 23944808

31 9.0694579E+08

32 9.1045811E+08

33 9.0861056E+08

34 9.1795782E+08

35 9.1176288E+08

36 6.8369171E+08 6.7

37 9.0694579E+08 5.9

38 9.1045811E+08 6.1

39 9.0861056E+08 6.1

40 9.1795782E+08 6.0

41 9.1176288E+08 6.4

42 6.8369171E+08 6.2

43 9.0694579E+08

44 9.1045811E+08

45 9.0861056E+08

46 9.1795782E+08

47 9.1176288E+08

48 6.8369171E+08

49 9.0694579E+08

50 9.1045811E+08

51 9.0861056E+08

52 9.1795782E+08

53 9.1176288E+08

54 6.8369171E+08

55 9.0694579E+08

56 9.1045811E+08

57 9.0861056E+08

58 9.1795782E+08

59 9.1176288E+08

60 6.8369171E+08

61 9.0694579E+08

62 9.1045811E+08

63 9.0861056E+08

64 9.1795782E+08

65 9.1176288E+08

66 6.8369171E+08

67 9.0694579E+08

68 9.1045811E+08

69 9.0861056E+08

70 9.1795782E+08

71 9.1176288E+08

72 6.8369171E+08

73 9.0694579E+08

74 9.1045811E+08

75 9.0861056E+08

76 9.1795782E+08

77 9.1176288E+08

78 6.8369171E+08

79 9.0694579E+08

80 9.1045811E+08

81 9.0861056E+08

82 9.1795782E+08

83 9.1176288E+08

84 6.8369171E+08

85 9.0694579E+08

86 9.1045811E+08

87 9.0861056E+08

88 9.1795782E+08

89 9.1176288E+08

90 6.8369171E+08

91 9.0694579E+08

92 9.1045811E+08

93 9.0861056E+08

94 9.1795782E+08

95 9.1176288E+08

96 6.8369171E+08

97 9.0694579E+08

98 9.1045811E+08

99 9.0861056E+08

100 9.1795782E+08

Optimal Adaptive

1 3 5 7 9 11 13 15 17 19 21 23 25

Coef

ficie

nt o

f Va

riatio

n

00.5

11.5

22.5

1 10 20 50 100 1000

k-d TreeRobust Tree

Allo

catio

n

00.40.81.21.6

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

k-d TreeRobust Tree

Random

Cyclic

Drill-down

TPC-H Lineitem Attributes

TPC-H Scale Factor

Que

ry C

osts

Query Sequence

Tim

e (s

econ

ds)

0

2800

5600

8400

11200

14000

(a) Data Load TimeLineitem

Standard HDFSRobust PartitioningRobust Partitioning (per-replica)

#tup

les

0E+005E+089E+081E+092E+09

(c) Repartitioning costs over queries1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Tim

e (s

)

0125250375500

(b) Repartitioning Optimizer RuntimeRepartitioned Not Repartitioned

(b) Comparison with k-d Tree

(b) Comparison with Optimal

Figure 8-2: Comparing the upload time in Amoeba with HDFS

look at the overhead of upfront partitioning, i.e., we compare the data upload time

with Amoeba against the standard data upload time in HDFS. For a fair compar-

ison, Amoeba preserves the same format (row, uncompressed) and layout (text) as

in standard HDFS, i.e., it only differs in how the data is partitioned.

Figure 8-2 shows the data upload costs of Amoeba and standard HDFS for TPC-

H lineitem table (scale factor 1000). From the figure, we can see that the upload time

of Amoeba is 2.6 times that of HDFS. It goes further up to 3.5 times with per-replica

robust partitioning. This is reasonable given that the entire dataset needs to clustered

along all 16 dimensions of the lineitem table. Furthermore, the overhead is similar to

other workload-specific data preparation, such as indexing and co-partitioning [41, 39].

Comparison with k-d trees.


1 9.0694579E+08 2997871289 3.14811328E+08 3002124738 6.3219558E+08 3002124738

2 9.1045811E+08 2997871289 3.12734144E+08 2997871289 3.16637952E+08 3002124738

3 9.0861056E+08 3002124738 3.1046384E+08 2359184298 1.5864256E+08 3002124738

4 9.1795782E+08 2513928775 3.16079552E+08 1265335315 8.1171424E+07 3002124738

5 9.1176288E+08 1595976302 3.79694176E+08 3002124738 4.2983852E+07 3002124738

6 6.8369171E+08 684385763 5.4918707E+08 1518682469 2.4999108E+07 2007317820

7 9.0694579E+08 2997871289 3.79694176E+08 3002126228 1.7504594E+07 2007317820

8 9.1045811E+08 1330870883 5.22116032E+08 2984199016 1.5004684E+07 2007317820

9 9.0861056E+08 908688569 4.7684624E+08 506427423 1.5004684E+07 2007317820

10 9.1795782E+08 917947303 5.4010598E+08 2844579882 1.5004684E+07 355650048

11 9.1176288E+08 911595775 3.13798816E+08 815518542 1.5004684E+07 23944808

12 6.8369171E+08 684385763 3.14920896E+08 612415327 1.5004684E+07 23944808

13 9.0694579E+08 1667001794 3.13349984E+08 1384334576 1.5004684E+07 23944808

14 9.1045811E+08 910379735 3.1682064E+08 3166572369 1.5004684E+07 23944808

15 9.0861056E+08 908688569 3.13323904E+08 1967528802 1.5004684E+07 23944808

16 9.1795782E+08 917947303 3.16836288E+08 2461879761 1.5004684E+07 23944808

17 9.1176288E+08 911595775 3.1925792E+08 1466482105 1.5004684E+07 23944808

18 6.8369171E+08 684385763 3.16945888E+08 1693384125 1.5004684E+07 23944808

19 9.0694579E+08 906950739 3.13777952E+08 1143868180 1.5004684E+07 23944808

20 9.1045811E+08 910379735 3.1750432E+08 1601779877 1.5004684E+07 23944808

21 9.0861056E+08 908688569 3.17076352E+08 901236093 1.5004684E+07 23944808

22 9.1795782E+08 917947303 3.15510656E+08 1317807849 1.5004684E+07 23944808

23 9.1176288E+08 911595775 5.4918707E+08 683999521 1.5004684E+07 23944808

24 6.8369171E+08 684385763 3.15740288E+08 992840181 1.5004684E+07 23944808

25 9.0694579E+08 906950739 3.1530192E+08 808139186 1.5004684E+07 23944808

26 9.1045811E+08 3.1772352E+08 1310146264 1.5004684E+07 23944808

27 9.0861056E+08 3.14936576E+08 1402616769 1.5004684E+07 23944808

28 9.1795782E+08 3.13344768E+08 857485053 1.5004684E+07 23944808

29 9.1176288E+08 3.12066112E+08 711028895 1.5004684E+07 23944808

30 6.8369171E+08 3.10677824E+08 1223661908 1.5004684E+07 23944808

31 9.0694579E+08

32 9.1045811E+08

33 9.0861056E+08

34 9.1795782E+08

35 9.1176288E+08

36 6.8369171E+08 6.7

37 9.0694579E+08 5.9

38 9.1045811E+08 6.1

39 9.0861056E+08 6.1

40 9.1795782E+08 6.0

41 9.1176288E+08 6.4

42 6.8369171E+08 6.2

43 9.0694579E+08

44 9.1045811E+08

45 9.0861056E+08

46 9.1795782E+08

47 9.1176288E+08

48 6.8369171E+08

49 9.0694579E+08

50 9.1045811E+08

51 9.0861056E+08

52 9.1795782E+08

53 9.1176288E+08

54 6.8369171E+08

55 9.0694579E+08

56 9.1045811E+08

57 9.0861056E+08

58 9.1795782E+08

59 9.1176288E+08

60 6.8369171E+08

61 9.0694579E+08

62 9.1045811E+08

63 9.0861056E+08

64 9.1795782E+08

65 9.1176288E+08

66 6.8369171E+08

67 9.0694579E+08

68 9.1045811E+08

69 9.0861056E+08

70 9.1795782E+08

71 9.1176288E+08

72 6.8369171E+08

73 9.0694579E+08

74 9.1045811E+08

75 9.0861056E+08

76 9.1795782E+08

77 9.1176288E+08

78 6.8369171E+08

79 9.0694579E+08

80 9.1045811E+08

81 9.0861056E+08

82 9.1795782E+08

83 9.1176288E+08

84 6.8369171E+08

85 9.0694579E+08

86 9.1045811E+08

87 9.0861056E+08

88 9.1795782E+08

89 9.1176288E+08

90 6.8369171E+08

91 9.0694579E+08

92 9.1045811E+08

93 9.0861056E+08

94 9.1795782E+08

95 9.1176288E+08

96 6.8369171E+08

97 9.0694579E+08

98 9.1045811E+08

99 9.0861056E+08

100 9.1795782E+08

Optimal Adaptive

1 3 5 7 9 11 13 15 17 19 21 23 25

Coef

ficie

nt o

f Va

riatio

n

00.5

11.5

22.5

1 10 20 50 100 1000

k-d TreeRobust Tree

Allo

catio

n

00.40.81.21.6

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

k-d TreeRobust Tree

Random

Cyclic

Drill-down


TPC-H Scale Factor

Que

ry C

osts

Query Sequence

Tim

e (s

econ

ds)

0

2800

5600

8400

11200

14000



#tup

les

0E+005E+089E+081E+092E+09


Tim

e (s

)

0125250375500


(b) Comparison with k-d Tree

(b) Comparison with Optimal

Figure 8-3: Comparing performance of upfront partition tree vs kd-tree

We analyze our upfront partitioning tree in comparison to k-d trees. Specifically,

49

we implemented a k-d tree which partitions on attributes in a round robin fashion, one

attribute at a time, until the partition size falls below the minimum size. This emu-

lates the standard way of performing data placement in a conventional k-d tree [37].

In contrast, recall that our robust partitioning algorithm places the attributes such

that all attributes have similar partitioning benefits. To measure the robustness of

the tree, we look at the variation in the values of the allocation metric for different

attributes. For our purposes, a tree having less variation in allocation is more robust.

The top of Figure 8-3 shows the allocation for each attribute of TPC-H lineitem. We

can see that the k-d tree has higher allocation for the first seven attributes, however

the remaining nine attributes are not partitioned at all. Robust partitioning, on the

other hand, distributes the allocation more evenly across all attributes. The bottom

of Figure 8-3 shows the coefficient of variation in allocation over different TPC-H scale

factors. The gap between k-d tree and our approach deteriorates with increasing scale

factor. As expected, we did observe that k-d tree does perform slightly better on the

attributes it is partitioned on and do full scans for the ones it is not partitioned on.

8.2 Micro-benchmarks

We now study the benefits of Amoeba over different workload patterns on the

lineitem dataset (same as in the previous subsection). We run range queries of the fol-

lowing form on all the attributes: SELECT * FROM lineitem WHERE start < A ≤ end;.

We choose start and end such that the selectivity of each query is 5% and select

A uniformly and at random from all attributes. We compare against standard (full

scan) and the ideal (exactly 5% of the data is accessed) runtimes. Notice that this

ideal runtime could only be achieved for a single attribute if we had partitioned the

dataset perfectly on that attribute.

To evaluate how well Amoeba adapts the data to changes in the workload, let us

consider two types of workload changes: a (i) shifting workload, which gradually tran-

sitions from one attribute to another, and a (ii) switching workload, which switches

(immediately) from one workload to another.

Figure 8-4 reports the results for two sets of 20 queries each. The first query set

starts with predicates on discount and shifts towards predicates on shipdate, i.e.,

50

Q.No. Adaptive (Cyclic Workload)

Adaptive (Random Workload)

Adaptive (Drill-down Workload)

Full Scan Ideal Ideal (drill down)

1 727.109 783.144 749.428 1500 75 150

2 700.24 705.974 2473.48 1500 75 75

3 3252.76 2693.4 358.483 1500 75 37.5

4 2011.565 1250.26 338.855 1500 75 18.75

5 1363.129 2707.26 350.724 1500 75 9.375

6 218.428 2131.14 331.671 1500 75 4.6875

7 2713.108 1267.12 340.017 1500 75 2.34375

8 1142.267 1839.42 340.609 1500 75 1.171875

9 252.009 1520.82 1295.85 1500 75 0.5859375

10 258.187 889.269 153.56 1500 75 0.29296875

11 241.23 652.543 147.64 1500 75 0.146484375

12 220.52 515.45 151.021 1500 75 0.0732421875

13 1331.734 989.724 147.755 1500 75 0.03662109375

14 257.011 673.078 152.415 1500 75 0.018310546875

15 242.49 778.024 150.198 1500 75 0.0091552734375

16 241.437 518.883 149.021 1500 75 0.00457763671875

17 238.362 553.043 149.929 1500 75 0.002288818359375

18 213.794 457.776 148.086 1500 75 0.0011444091796875

19 248.237 439.926 146.166 1500 75 0.00057220458984375

20 232.09 524.97 150.433 1500 75 0.000286102294921875

21 244.572 608.221 149.637 1500 75 0.000143051147460938

22 256.374 529.803 150.934 1500 75 0.0000715255737304688

23 252.614 253.293 145.403 1500 75 0.0000357627868652344

24 200.982 413.022 147.832 1500 75 0.0000178813934326172

25 246.366 272.664 149.198 1500 75 0.00000894069671630859

26 236.729 521.372 143.879 1500 75 0.0000044703483581543

27 256.608 284.209 146.021 1500 75 0.00000223517417907715

28 254.297 411.126 143.438 1500 75 0.00000111758708953857

29 223.374 323.209 154.56 1500 75 0.000000558793544769287

30 195.956 494.278 143.665 1500 75 0.000000279396772384644

31 249.693

Cyclic Access Pattern

Tim

e (s

econ

ds)

0

850

1700

2550

3400

Query Sequence1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Adaptive (Cyclic Workload)Adaptive (Random Workload)Adaptive (Drill-down Workload)Full ScanIdeal

Tim

e (s

econ

ds)

0

850

1700

2550

3400

(a) Random1 5 9 13 17 21 25 29

AdaptiveFull ScanIdeal

(b) Cyclic1 5 9 13 17 21 25 29


(c) Drill-down1 4 7 10 13 16 19 22 25 28


(b) Switching Workload1 6 11 16 21 26 31 36 41 46 51 56

Adaptive Full Scan Idealshipdate, receiptdate, shipped

quantity, extendedprice, discount


Tim

e (s

econ

ds)

0

1000

2000

3000

4000

(a) Shifting Workload1 4 7 10 13 16 19 22 25 28 31 34 37 40

Adaptive Full Scan Idealdiscount ➔ shipdate shipdate ➔ receiptdate

discount

shipdate shipdate

receiptdate

Figure 8-4: Query runtimes for changing query attributes on TPC-H lineitem.

the probability of predicating on the first (second) attributes decreases (increases) by

1/20th after each query. We can see that the first query set involves a repartitioning

first at query 2 (on discount) and again at query 10 (on shipdate when the workload

has shifted sufficiently). The second query set starts with 100% of queries predicating

on shipdate and shifts towards 100% of queries predicating on receiptdate. At-

tribute shipdate is now further refined at query 21 and finally the system repartitions

on receiptdate at query 33 when workload again changes sufficiently. The spikes

annotated in the graph involve major reorganization, each of them represents almost

50% of the data being re-partitioned. However, there is re-partitioning happening on

small fractions of the data as well in some queries and they show up as small spikes.

The query runtime approaches the ideal runtime. This shows the ability of our system

to adapt to changes in the workload.

Figure 8-4(b) shows a switching workload. Here, we start with an attribute

set (quantity, extendedprice, discount) then switch to a second attribute set

(shipdate, receiptdate, shipped) after 20 queries. We again switch back to the

first attribute set after 40 queries. The two interesting things in this experiment is

that: (i) the system adapts to the workload by quickly repartitioning in each query

set, (ii) the repartitioning efforts in subsequent query sets are lower because of larger

query history. Also, note that first the attribute set appears twice in the sequence

and the repartitioning effort is significantly less the second time. Thus, Amoeba

storage system can efficiently adapt to changing workloads across sets of attributes.

We now study how Amoeba adapts and eventually converges when a particular

workload is seen more often. In this experiment, we start from an initial robust

partitioning and run queries on a given attribute over and over again. We consider

three workload patterns: (i) random, ad-hoc query predicates with fixed selectivity,

51

Q.No. Adaptive (Cyclic Workload)

Adaptive (Random Workload)

Adaptive (Drill-down Workload)

Full Scan Ideal Ideal (drill down)

1 727.109 783.144 749.428 1500 75 150

2 700.24 705.974 2473.48 1500 75 75

3 3252.76 2693.4 358.483 1500 75 37.5

4 2011.565 1250.26 338.855 1500 75 18.75

5 1363.129 2707.26 350.724 1500 75 9.375

6 218.428 2131.14 331.671 1500 75 4.6875

7 2713.108 1267.12 340.017 1500 75 2.34375

8 1142.267 1839.42 340.609 1500 75 1.171875

9 252.009 1520.82 1295.85 1500 75 0.5859375

10 258.187 889.269 153.56 1500 75 0.29296875

11 241.23 652.543 147.64 1500 75 0.146484375

12 220.52 515.45 151.021 1500 75 0.0732421875

13 1331.734 989.724 147.755 1500 75 0.03662109375

14 257.011 673.078 152.415 1500 75 0.018310546875

15 242.49 778.024 150.198 1500 75 0.0091552734375

16 241.437 518.883 149.021 1500 75 0.00457763671875

17 238.362 553.043 149.929 1500 75 0.002288818359375

18 213.794 457.776 148.086 1500 75 0.0011444091796875

19 248.237 439.926 146.166 1500 75 0.00057220458984375

20 232.09 524.97 150.433 1500 75 0.000286102294921875

21 244.572 608.221 149.637 1500 75 0.000143051147460938

22 256.374 529.803 150.934 1500 75 0.0000715255737304688

23 252.614 253.293 145.403 1500 75 0.0000357627868652344

24 200.982 413.022 147.832 1500 75 0.0000178813934326172

25 246.366 272.664 149.198 1500 75 0.00000894069671630859

26 236.729 521.372 143.879 1500 75 0.0000044703483581543

27 256.608 284.209 146.021 1500 75 0.00000223517417907715

28 254.297 411.126 143.438 1500 75 0.00000111758708953857

29 223.374 323.209 154.56 1500 75 0.000000558793544769287

30 195.956 494.278 143.665 1500 75 0.000000279396772384644

31 249.693

Cyclic Access Pattern

Tim

e (s

econ

ds)

0

850

1700

2550

3400

Query Sequence1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Adaptive (Cyclic Workload)Adaptive (Random Workload)Adaptive (Drill-down Workload)Full ScanIdeal

Tim

e (s

econ

ds)

0

850

1700

2550

3400

(a) Random1 5 9 13 17 21 25 29


(b) Cyclic1 5 9 13 17 21 25 29


(c) Drill-down1 4 7 10 13 16 19 22 25 28


(b) Switching Workload1 6 11 16 21 26 31 36 41 46 51 56

Adaptive Full Scan Idealshipdate, receiptdate, shipped



Tim

e (s

econ

ds)

0

1000

2000

3000

4000

(a) Shifting Workload1 4 7 10 13 16 19 22 25 28 31 34 37 40

Adaptive Full Scan Idealdiscount ➔ shipdate shipdate ➔ receiptdate

discount

shipdate shipdate

receiptdate

Figure 8-5: Query runtimes for changing predicates on the same attribute of TPC-Hlineitem.

(ii) cyclic, with a set of query predicates repeating over and over again, and (iii) drill

down, where successive predicate narrow down the same data.

Figure 8-5 shows the results. We see that while the random query sequence is

slower to converge (will need many more queries), the cyclic and drill down query

sequence converge quite fast. In fact, the query times are almost 4 times faster than

full scan after 13 queries in the cyclic workload and 10 times faster than full scan

after 9 queries in the drill-down workload. Furthermore, we see that the systems stop

repartitioning the data in these workloads (except in the random workload case) and

reaches a stable state, thereby confirming that Amoeba converges in case a fixed

workload is observed over and over again.


1 9.0694579E+08 2997871289 3.14811328E+08 3002124738 6.3219558E+08 3002124738

2 9.1045811E+08 2997871289 3.12734144E+08 2997871289 3.16637952E+08 3002124738

3 9.0861056E+08 3002124738 3.1046384E+08 2359184298 1.5864256E+08 3002124738

4 9.1795782E+08 2513928775 3.16079552E+08 1265335315 8.1171424E+07 3002124738

5 9.1176288E+08 1595976302 3.79694176E+08 3002124738 4.2983852E+07 3002124738

6 6.8369171E+08 684385763 5.4918707E+08 1518682469 2.4999108E+07 2007317820

7 9.0694579E+08 2997871289 3.79694176E+08 3002126228 1.7504594E+07 2007317820

8 9.1045811E+08 1330870883 5.22116032E+08 2984199016 1.5004684E+07 2007317820

9 9.0861056E+08 908688569 4.7684624E+08 506427423 1.5004684E+07 2007317820

10 9.1795782E+08 917947303 5.4010598E+08 2844579882 1.5004684E+07 355650048

11 9.1176288E+08 911595775 3.13798816E+08 815518542 1.5004684E+07 23944808

12 6.8369171E+08 684385763 3.14920896E+08 612415327 1.5004684E+07 23944808

13 9.0694579E+08 1667001794 3.13349984E+08 1384334576 1.5004684E+07 23944808

14 9.1045811E+08 910379735 3.1682064E+08 3166572369 1.5004684E+07 23944808

15 9.0861056E+08 908688569 3.13323904E+08 1967528802 1.5004684E+07 23944808

16 9.1795782E+08 917947303 3.16836288E+08 2461879761 1.5004684E+07 23944808

17 9.1176288E+08 911595775 3.1925792E+08 1466482105 1.5004684E+07 23944808

18 6.8369171E+08 684385763 3.16945888E+08 1693384125 1.5004684E+07 23944808

19 9.0694579E+08 906950739 3.13777952E+08 1143868180 1.5004684E+07 23944808

20 9.1045811E+08 910379735 3.1750432E+08 1601779877 1.5004684E+07 23944808

21 9.0861056E+08 908688569 3.17076352E+08 901236093 1.5004684E+07 23944808

22 9.1795782E+08 917947303 3.15510656E+08 1317807849 1.5004684E+07 23944808

23 9.1176288E+08 911595775 5.4918707E+08 683999521 1.5004684E+07 23944808

24 6.8369171E+08 684385763 3.15740288E+08 992840181 1.5004684E+07 23944808

25 9.0694579E+08 906950739 3.1530192E+08 808139186 1.5004684E+07 23944808

26 9.1045811E+08 3.1772352E+08 1310146264 1.5004684E+07 23944808

27 9.0861056E+08 3.14936576E+08 1402616769 1.5004684E+07 23944808

28 9.1795782E+08 3.13344768E+08 857485053 1.5004684E+07 23944808

29 9.1176288E+08 3.12066112E+08 711028895 1.5004684E+07 23944808

30 6.8369171E+08 3.10677824E+08 1223661908 1.5004684E+07 23944808

31 9.0694579E+08

32 9.1045811E+08

33 9.0861056E+08

34 9.1795782E+08

35 9.1176288E+08

36 6.8369171E+08 6.7

37 9.0694579E+08 5.9

38 9.1045811E+08 6.1

39 9.0861056E+08 6.1

40 9.1795782E+08 6.0

41 9.1176288E+08 6.4

42 6.8369171E+08 6.2

43 9.0694579E+08

44 9.1045811E+08

45 9.0861056E+08

46 9.1795782E+08

47 9.1176288E+08

48 6.8369171E+08

49 9.0694579E+08

50 9.1045811E+08

51 9.0861056E+08

52 9.1795782E+08

53 9.1176288E+08

54 6.8369171E+08

55 9.0694579E+08

56 9.1045811E+08

57 9.0861056E+08

58 9.1795782E+08

59 9.1176288E+08

60 6.8369171E+08

61 9.0694579E+08

62 9.1045811E+08

63 9.0861056E+08

64 9.1795782E+08

65 9.1176288E+08

66 6.8369171E+08

67 9.0694579E+08

68 9.1045811E+08

69 9.0861056E+08

70 9.1795782E+08

71 9.1176288E+08

72 6.8369171E+08

73 9.0694579E+08

74 9.1045811E+08

75 9.0861056E+08

76 9.1795782E+08

77 9.1176288E+08

78 6.8369171E+08

79 9.0694579E+08

80 9.1045811E+08

81 9.0861056E+08

82 9.1795782E+08

83 9.1176288E+08

84 6.8369171E+08

85 9.0694579E+08

86 9.1045811E+08

87 9.0861056E+08

88 9.1795782E+08

89 9.1176288E+08

90 6.8369171E+08

91 9.0694579E+08

92 9.1045811E+08

93 9.0861056E+08

94 9.1795782E+08

95 9.1176288E+08

96 6.8369171E+08

97 9.0694579E+08

98 9.1045811E+08

99 9.0861056E+08

100 9.1795782E+08

Optimal Adaptive

1 3 5 7 9 11 13 15 17 19 21 23 25

Coef

ficie

nt o

f Va

riatio

n

00.5

11.5

22.5

1 10 20 50 100 1000

k-d TreeRobust Tree

Allo

catio

n

00.40.81.21.6

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

k-d TreeRobust Tree Random

Cyclic

Drill-down


TPC-H Scale Factor

Que

ry C

osts

Query Sequence

Tim

e (s

econ

ds)

0

2800

5600

8400

11200

14000



#tup

les

0E+005E+089E+081E+092E+09


Tim

e (s

)

0125250375500


(a) Comparison with k-d Tree (b) Comparison with Optimal

Figure 8-6: Cumulative Optimizer Runtime Across 100 queries


1 9.0694579E+08 2997871289 3.14811328E+08 3002124738 6.3219558E+08 3002124738

2 9.1045811E+08 2997871289 3.12734144E+08 2997871289 3.16637952E+08 3002124738

3 9.0861056E+08 3002124738 3.1046384E+08 2359184298 1.5864256E+08 3002124738

4 9.1795782E+08 2513928775 3.16079552E+08 1265335315 8.1171424E+07 3002124738

5 9.1176288E+08 1595976302 3.79694176E+08 3002124738 4.2983852E+07 3002124738

6 6.8369171E+08 684385763 5.4918707E+08 1518682469 2.4999108E+07 2007317820

7 9.0694579E+08 2997871289 3.79694176E+08 3002126228 1.7504594E+07 2007317820

8 9.1045811E+08 1330870883 5.22116032E+08 2984199016 1.5004684E+07 2007317820

9 9.0861056E+08 908688569 4.7684624E+08 506427423 1.5004684E+07 2007317820

10 9.1795782E+08 917947303 5.4010598E+08 2844579882 1.5004684E+07 355650048

11 9.1176288E+08 911595775 3.13798816E+08 815518542 1.5004684E+07 23944808

12 6.8369171E+08 684385763 3.14920896E+08 612415327 1.5004684E+07 23944808

13 9.0694579E+08 1667001794 3.13349984E+08 1384334576 1.5004684E+07 23944808

14 9.1045811E+08 910379735 3.1682064E+08 3166572369 1.5004684E+07 23944808

15 9.0861056E+08 908688569 3.13323904E+08 1967528802 1.5004684E+07 23944808

16 9.1795782E+08 917947303 3.16836288E+08 2461879761 1.5004684E+07 23944808

17 9.1176288E+08 911595775 3.1925792E+08 1466482105 1.5004684E+07 23944808

18 6.8369171E+08 684385763 3.16945888E+08 1693384125 1.5004684E+07 23944808

19 9.0694579E+08 906950739 3.13777952E+08 1143868180 1.5004684E+07 23944808

20 9.1045811E+08 910379735 3.1750432E+08 1601779877 1.5004684E+07 23944808

21 9.0861056E+08 908688569 3.17076352E+08 901236093 1.5004684E+07 23944808

22 9.1795782E+08 917947303 3.15510656E+08 1317807849 1.5004684E+07 23944808

23 9.1176288E+08 911595775 5.4918707E+08 683999521 1.5004684E+07 23944808

24 6.8369171E+08 684385763 3.15740288E+08 992840181 1.5004684E+07 23944808

25 9.0694579E+08 906950739 3.1530192E+08 808139186 1.5004684E+07 23944808

26 9.1045811E+08 3.1772352E+08 1310146264 1.5004684E+07 23944808

27 9.0861056E+08 3.14936576E+08 1402616769 1.5004684E+07 23944808

28 9.1795782E+08 3.13344768E+08 857485053 1.5004684E+07 23944808

29 9.1176288E+08 3.12066112E+08 711028895 1.5004684E+07 23944808

30 6.8369171E+08 3.10677824E+08 1223661908 1.5004684E+07 23944808

31 9.0694579E+08

32 9.1045811E+08

33 9.0861056E+08

34 9.1795782E+08

35 9.1176288E+08

36 6.8369171E+08 6.7

37 9.0694579E+08 5.9

38 9.1045811E+08 6.1

39 9.0861056E+08 6.1

40 9.1795782E+08 6.0

41 9.1176288E+08 6.4

42 6.8369171E+08 6.2

43 9.0694579E+08

44 9.1045811E+08

45 9.0861056E+08

46 9.1795782E+08

47 9.1176288E+08

48 6.8369171E+08

49 9.0694579E+08

50 9.1045811E+08

51 9.0861056E+08

52 9.1795782E+08

53 9.1176288E+08

54 6.8369171E+08

55 9.0694579E+08

56 9.1045811E+08

57 9.0861056E+08

58 9.1795782E+08

59 9.1176288E+08

60 6.8369171E+08

61 9.0694579E+08

62 9.1045811E+08

63 9.0861056E+08

64 9.1795782E+08

65 9.1176288E+08

66 6.8369171E+08

67 9.0694579E+08

68 9.1045811E+08

69 9.0861056E+08

70 9.1795782E+08

71 9.1176288E+08

72 6.8369171E+08

73 9.0694579E+08

74 9.1045811E+08

75 9.0861056E+08

76 9.1795782E+08

77 9.1176288E+08

78 6.8369171E+08

79 9.0694579E+08

80 9.1045811E+08

81 9.0861056E+08

82 9.1795782E+08

83 9.1176288E+08

84 6.8369171E+08

85 9.0694579E+08

86 9.1045811E+08

87 9.0861056E+08

88 9.1795782E+08

89 9.1176288E+08

90 6.8369171E+08

91 9.0694579E+08

92 9.1045811E+08

93 9.0861056E+08

94 9.1795782E+08

95 9.1176288E+08

96 6.8369171E+08

97 9.0694579E+08

98 9.1045811E+08

99 9.0861056E+08

100 9.1795782E+08

Optimal Adaptive

1 3 5 7 9 11 13 15 17 19 21 23 25

Coef

ficie

nt o

f Va

riatio

n

00.5

11.5

22.5

1 10 20 50 100 1000

k-d TreeRobust Tree

Allo

catio

n

00.40.81.21.6

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

k-d TreeRobust Tree Random

Cyclic

Drill-down


TPC-H Scale Factor

Que

ry C

osts

Query Sequence

Tim

e (s

econ

ds)

0

2800

5600

8400

11200

14000



#tup

les

0E+005E+089E+081E+092E+09


Tim

e (s

)

0125250375500


(a) Comparison with k-d Tree (b) Comparison with OptimalFigure 8-7: Cumulative Repartitioning Cost

Apart from the upload overhead, Amoeba also incurs optimization and repar-

titioning overhead. Figure 8-6 shows the optimizer runtime over a sequence of 100

52

queries. We show two bars, one when the repartitioning decisions are actually made

and one when the data is never actually repartitioned. The optimization time per

query is very small (about 5s) compared to the actual runtime of the queries. Finally,

Figure 8-7 shows the accumulated repartitioning costs in terms of the number of tu-

ples that are repartitioned. We observe that the re-partitioning is largely incremental

after the first few queries.

8.3 Amoeba on Real Workload

We obtained data from a Boston-based company that captures analytics regarding a

user’s driving trips. Each tuple in the dataset represents a trip taken by the user,

with the start time, end time, and a number of statistics associated with the journey

regarding various attributes of the driver’s speed and driving style. The data con-

sists of a single large fact table with 148 columns. To protect user’s privacy, we used

statistics provided by the company regarding data distributions to generate a syn-

thetic version of the data according to the actual schema. The total size of the data is

705GB. We also obtained a trace of ad-hoc analytics queries from the company (these

queries were generated by data analysts performing one-off exploratory queries on the

data). This trace consists of 105 queries, run between 04/19/2015 and 04/21/2015,

on the trip data. The queries sub-select different portions of the data, e.g., based on

trip time range, distances, and customer ids, before producing aggregates and other

statistics.

0 5 10

Amoeba (c = 2)

Amoeba (c = 4)

Spark (with range partition)

Spark

3.1

2.79

4.94

9.4

Total time taken (in hrs)

Figure 8-8: Total runtimes of the different approaches

We compared the performance of Amoeba versus two baselines: 1) using Spark

without any modifications, 2) using Spark with the data already partitioned on

53

upload time, which is most frequently accessed attribute (accessed in 78% of the

queries). The data was split equally across the machines and loaded in parallel using

the upfront data partitioner. The queries collected were then run in order.

Figure 8-8 shows the total query runtime for running the 105 queries using the dif-

ferent approaches. To remind the reader, c is the write multiplier (described in Section

5.1). c can be calibrated by measuring the runtime increase due to re-partitioning.

For our setup, c is 4, i.e.: writing out data (while re-partitioning) is four times more

expensive than just scanning the data. We observe that by range partitioning on

upload time, the total query runtime drops by 1.9x. Amoeba initially does worse off

as it does not have knowledge of the workload, however in the end it does 1.8x better

than Spark with data partitioned and 3.4x better than unmodified Spark. The c = 2

setting is more reactive in the sense it adapts to the query workload faster. It does

slightly worse than c = 4 setting as it ends up doing introducing changes to the tree

very soon, as the workload is ad-hoc some patterns are one-off and re-partitioning

done to improve them ends up being wasted effort. Finally, we observed that the

total runtime of the last 60 of the 105 queries on Amoeba is 19x lower than full scan

and 11x lower than using the range-partitioned data.

Thus, we see that Amoeba is useful for real-world ad-hoc querying workloads,

where we need to quickly access different subsets of data for further analysis.

54

Chapter 9

Conclusion

In this thesis, we presented Amoeba, an adaptive distributed storage manager tar-

geted at ad-hoc query workloads. Amoeba does not require developers to provide

upfront query workloads, which is not known apriori in several modern exploratory

data analyses. Rather, Amoeba distributes the partitioning effort over all attributes

in the dataset and later refines the partitioning based on how the data is actually

being used. We presented techniques to partition a dataset over several attributes, de-

scribed the robust partitioning algorithm to create the upfront partitioning, showed

transformations to adaptively change the upfront partitioning tree, and detailed a

divide-and-conquer algorithm to pick the best repartitioned tree. We also described

the Amoeba storage system built on top of HDFS and showed experimental results

with Spark. Our results on both real and synthetic workloads show that Amoeba

provides improved up-front query performance, improves the query performance even

further as the queries arrive, and eventually converges to a steady state when a par-

ticular workload repeats over and over again.

55

Bibliography

[1] Sanjay Agrawal, Vivek Narasayya, and Beverly Yang. Integrating Vertical andHorizontal Partitioning into Automated Physical Database Design. SIGMOD,2004.

[2] Daniel C. Zilio, Jun Rao, Sam Lightstone, Guy Lohman, Adam Storm, ChristianGarcia-Arellano, and Scott Fadden. DB2 Design Advisor: Integrated AutomaticPhysical Database Design. VLDB, 2004.

[3] Rimma Nehme and Nicolas Bruno. Automated Partitioning Design in ParallelDatabase Systems. SIGMOD, 2011.

[4] Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. Schism: a Workload-Driven Approach to Database Replication and Partitioning. VLDB, 2010.

[5] Abdul Quamar, K. Ashwin Kumar, and Amol Deshpande. SWORD: ScalableWorkload-Aware Data Placement for Transactional Workloads. EDBT, 2013.

[6] Andrew Pavlo, Carlo Curino, and Stan Zdonik. Skew-Aware Automatic DatabasePartitioning in Shared-Nothing, Parallel OLTP Systems. SIGMOD, 2012.

[7] Liwen Sun, Michael J. Franklin, Sanjay Krishnan, and Reynold S. Xin. Fine-grained Partitioning for Aggressive Data Skipping. SIGMOD, 2014.

[8] Hadoop Apache Project, http://hadoop.apache.org.

[9] Apache Spark, https://spark.apache.org.

[10] Stratos Idreos, Martin Kersten, and Stefan Manegold. Database Cracking. InCIDR, 2007.

[11] Stratos Idreos, Martin Kersten, and Stefan Manegold. Self-organizing TupleReconstruction In Column-stores. In SIGMOD, 2009.

[12] Martin Grund, Jens Kruger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, and Samuel Madden. HYRISE: A Main Memory Hybrid Storage En-gine. PVLDB, 4(2):105–116, 2010.

[13] Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, and Jens Dittrich. Trojan Data Lay-outs: Right Shoes for a Running Elephant. In ACM SOCC, pages 21:1–21:14,2011.

57

[14] Alekh Jindal and Jens Dittrich. Relax and let the database do the partitioningonline. In BIRTE, pages 65–80, 2011.

[15] S. Ghandeharizadeh and D. J. DeWitt. MAGIC: A Multiattribute DeclusteringMechanism for Multiprocessor Database Machines. IEEE Trans. Parallel Distrib.Syst., 1994.

[16] Oracle Subpartitioning, http://docs.oracle.com/cd/E17952 01/refman-5.5-en/partitioning-subpartitions.html.

[17] MySQL Subpartitioning, http://dev.mysql.com/doc/refman/5.1/en/partitioning-subpartitions.html.

[18] Apache Hive, https://hive.apache.org.

[19] Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken,and Darren Shakib. SCOPE: parallel databases meet MapReduce. The VLDBJournal, 21(5):611–636, 2012.

[20] Shark, http://shark.cs.berkeley.edu.

[21] Impala, http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html.

[22] Goetz Graefe. Sorting And Indexing With Partitioned B-Trees. CIDR, 2003.

[23] Young-Kyoon Suh, Ahmad Ghazal, Alain Crolotte, and Pekka Kostamaa. ANew Tool for Multi-level Partitioning in Teradata. CIKM, 2012.

[24] Daniel Dominic Sleator and Robert Endre Tarjan. Self-adjusting Binary SearchTrees. Journal of the ACM, 32(3):652–686, 1985.

[25] Ahmed Eldawy and Mohamed F. Mokbel. A Demonstration of SpatialHadoop:An Efficient MapReduce Framework for Spatial Data. In VLDB, 2013.

[26] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, and Amr El Abbadi. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location AwareServices. In MDM, 2011.

[27] Jinbao Wang, Sai Wu, Hong Gao, Jianzhong Li, and Beng Chin Ooi. IndexingMulti-dimensional Data in a Cloud System. In SIGMOD, 2010.

[28] Philippe Cudre-Mauroux, Eugene Wu, and Samuel Madden. TrajStore: AnAdaptive Storage System for Very Large Trajectory Data Sets. In ICDE, 2010.

[29] Martin Kersten and Stefan Manegold. Cracking the Database Store. In CIDR,2005.

[30] Stratos Idreos, Martin Kersten, and Stefan Manegold. Updating a CrackedDatabase. In SIGMOD, 2007.

58

[31] Stratos Idreos, Stefan Manegold, Harumi Kuno, and Goetz Graefe. MergingWhat’s Cracked, Cracking What’s Merged: Adaptive Indexing in Main-MemoryColumn-Stores. In PVLDB, 2011.

[32] Goetz Graefe, Felix Halim, Stratos Idreos, Harumi Kuno, and Stefan Manegold.Concurrency Control for Adaptive Indexing. In PVLDB, 2012.

[33] Felix Halim, Stratos Idreos, Panagiotis Karras, and Roland H. C. Yap. Stochas-tic Database Cracking: Towards Robust Adaptive Indexing in Main-MemoryColumn-Stores. In PVLDB, 2012.

[34] M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. Gray,P. P. Griffiths, W. F. King, R. A. Lorie, P. R. McJones, J. W. Mehl, G. R.Putzolu, I. L. Traiger, B. W. Wade, and V. Watson. System R: RelationalApproach to Database Management. ACM Trans. Database Syst., 1976.

[35] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system.SIGOPS Oper. Syst. Rev., 37(5):29–43, 2003.

[36] Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A. Acar, andRafael Pasquin. Incoop: MapReduce for Incremental Computations. In SoCC,2011.

[37] Jon Louis Bentley. Multidimensional binary search trees used for associativesearching. Commun. ACM, 1975.

[38] Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Stefan Richter, Stefan Schuh, AlekhJindal, and Jorg Schad. Only Aggressive Elephants are Fast Elephants. PVLDB,5(11):1591–1602, 2012.

[39] Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, VinaySetty, and Jorg Schad. Hadoop++: Making a Yellow Elephant Run Like aCheetah (Without It Even Noticing). PVLDB, 2010.

[40] Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Ozcan, Rainer Gemulla, AljoschaKrettek, and John McPherson. CoHadoop: Flexible Data Placement and ItsExploitation in Hadoop. VLDB, 2011.

[41] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. De-Witt, Samuel Madden, and Michael Stonebraker. A comparison of approachesto large-scale data analysis. In SIGMOD, 2009.

[42] Carsten Binnig, Ugur Cetintemel, Andrew Crotty, Alex Galakatos, Tim Kraska,Erfan Zamanian, and Stan Zdonik. The End of Slow Networks: It’s Time for aRedesign. arXiv:1504.01048 [cs.DB], April, 2015.

[43] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Disk-locality in Datacenter Computing Considered Irrelevant. In HotOS, 2013.

59

Appendices

60

Appendix A

Fast Remote Reads

One effect of Amoeba is to de-cluster the data, as values from a specific range of a

single attribute may not be clustered on a specific node. This means that answering

a query may involve reading many partitions from different nodes in the network. For

example, to find products with price = $125, we might need to read both partitions

1 and 2, which will be on different nodes. This is contrary to the traditional wisdom

in the database community, which argues for data locality by moving computation

to data, rather than moving data to answer queries. However, recent improvements

in datacenter network design have resulted in designs that provide full cross-section

bandwidth of 1 Gbit/sec or more between all pairs of nodes [42], such that network

throughput is no longer a bottleneck. Recent research has shown that accessing a

remote disk in a computing cluster is only about 8% lower throughput than reading

from a local disk [43].

Table 1

Data Locality Response Time (sec)

100 44271 50046 51227 524

Res

pons

e Ti

me

(sec

)

0

150

300

450

600

Data Locality (%)100 71 46 27

524512500442

Fig. 2. Varying Data Locality (%)100 71 46 27

524512500442

Response Time (sec)

Figure A-1: Response time with varying data locality (%)

To verify this in a real system, we ran a micro-benchmark on Hadoop MapReduce,

in which we measured the runtime of a map-only job (performing simple aggregation

in the combiner) while varying the locality of data blocks on HDFS. Figure 2 shows

61

the results from a 4-node cluster with full duplex 1 Gbit/sec network. Note that even

with locality as low as 27%, the job is just 18% slower than with 100% data locality.

Amoeba leverages this new hardware trend to aggressively de-cluster the data over

several (or all) attributes in the dataset.

62

An Adaptive Partitioning Scheme for Ad-hoc and Time ... · storage system provides improved query...

Documents

Transcript of An Adaptive Partitioning Scheme for Ad-hoc and Time ... · storage system provides improved query...