An Adaptive Partitioning Scheme for Ad-hoc and Time ... · storage system provides improved query...
Transcript of An Adaptive Partitioning Scheme for Ad-hoc and Time ... · storage system provides improved query...
An Adaptive Partitioning Scheme for Ad-hoc and
Time-varying Database Analytics
by
Anil Shanbhag
B.Tech. in Computer ScienceIndian Institute of Technology Bombay, 2014
Submitted to the Department of Electrical Engineering and ComputerScience
in partial fulfillment of the requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2016
c© Massachusetts Institute of Technology 2016. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Department of Electrical Engineering and Computer Science
May 19, 2016
Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Samuel Madden
Professor of Electrical Engineering and Computer ScienceThesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Leslie A. Kolodziejski
Chairman, Department Committee on Graduate Students
2
An Adaptive Partitioning Scheme for Ad-hoc and
Time-varying Database Analytics
by
Anil Shanbhag
Submitted to the Department of Electrical Engineering and Computer Scienceon May 19, 2016, in partial fulfillment of the
requirements for the degree ofMaster of Science in Electrical Engineering and Computer Science
Abstract
Data partitioning significantly improves query performance in distributed databasesystems. A large number of techniques have been proposed to efficiently partition adataset, often focusing on finding the best partitioning for a particular query work-load. However, many modern analytic applications involve ad-hoc or exploratoryanalysis where users do not have a representative query workload. Furthermore, work-loads change over time as businesses evolve or as analysts gain better understandingof their data. Static workload-based data partitioning techniques are therefore notsuitable for such settings. In this thesis, we present Amoeba, an adaptive distributedstorage system for data skipping. It does not require an upfront query workload andadapts the data partitioning according to the queries posed by users over time. Wepresent the data structures, partitioning algorithms, and an efficient implementationon top of Apache Spark and HDFS. Our experimental results show that the Amoebastorage system provides improved query performance for ad-hoc workloads, adapts tochanges in the query workloads, and converges to a steady state in case of recurringworkloads. On a real world workload, Amoeba reduces the total workload runtimeby 1.8x compared to Spark with data partitioned and 3.4x compared to unmodifiedSpark.
Thesis Supervisor: Samuel MaddenTitle: Professor of Electrical Engineering and Computer Science
4
Acknowledgments
I would like to thank Alekh Jindal, Qui Nguyen, Aaron Elmore, Jorge Quiane and
Divyakanth Agarwal who have contributed many ideas to this work and helped build
the system.
I would also like to thank Prof. Samuel Madden, my thesis supervisor, for being
a constant source of guidance and feedback in this project and outside.
Finally, I am always grateful to my family and friends, who encouraged me and
supported me along the way.
5
6
Contents
1 Introduction 13
2 Related Work 17
3 System Overview 21
4 Upfront Data Partitioning 23
4.1 Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Upfront Partitioning Algorithm . . . . . . . . . . . . . . . . . . . . . 27
5 Adaptive Repartitioning 31
5.1 Workload Monitor and Cost Model . . . . . . . . . . . . . . . . . . . 32
5.2 Partitioning Tree Transformations . . . . . . . . . . . . . . . . . . . . 33
5.3 Divide-And-Conquer Repartitioning . . . . . . . . . . . . . . . . . . . 36
5.4 Handling Multiple Predicates . . . . . . . . . . . . . . . . . . . . . . 39
6 Implementation 41
6.1 Initial Robust Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7 Discussion 45
7.1 Leveraging Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.2 Handling Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8 Evaluation 47
8.1 Upfront Partitioning Performance . . . . . . . . . . . . . . . . . . . . 47
8.2 Micro-benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7
8.3 Amoeba on Real Workload . . . . . . . . . . . . . . . . . . . . . . . 53
9 Conclusion 55
Appendices 60
A Fast Remote Reads 61
8
List of Figures
1-1 Example partitioning tree with 8 blocks . . . . . . . . . . . . . . . . . 14
3-1 Amoeba Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4-1 Partitioning Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . 25
4-2 Upfront Partitioning Algorithm Example. . . . . . . . . . . . . . . . . 26
5-1 Node swap in the partitioning tree. . . . . . . . . . . . . . . . . . . . 34
5-2 Illustrating adaptive partitioning when predicate A2 appears repeatedly. 35
5-3 Node pushdown in partitioning tree. . . . . . . . . . . . . . . . . . . 35
5-4 Node rotation in partitioning tree. . . . . . . . . . . . . . . . . . . . . 35
7-1 Heterogenous Replication. . . . . . . . . . . . . . . . . . . . . . . . . 45
8-1 Ad-hoc query runtimes for different attributes of TPC-H lineitem. . . 48
8-2 Comparing the upload time in Amoeba with HDFS . . . . . . . . . . 49
8-3 Comparing performance of upfront partition tree vs kd-tree . . . . . . 49
8-4 Query runtimes for changing query attributes on TPC-H lineitem. . . 51
8-5 Query runtimes for changing predicates on the same attribute of TPC-
H lineitem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8-6 Cumulative Optimizer Runtime Across 100 queries . . . . . . . . . . 52
8-7 Cumulative Repartitioning Cost . . . . . . . . . . . . . . . . . . . . . 52
8-8 Total runtimes of the different approaches . . . . . . . . . . . . . . . 53
A-1 Response time with varying data locality (%) . . . . . . . . . . . . . 61
9
10
List of Tables
5.1 The cost and benefit estimates for different partitioning tree transfor-
mations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
11
12
Chapter 1
Introduction
Collecting data is increasingly becoming easier and cheaper, leading to ever-larger
datasets. This big data, ranging from sources such as sensors to server logs has the
potential to uncover business insights and help businesses make informed decisions,
but only if they can analyze it effectively. For this reason, companies have adopted
distributed database systems as the go-to solution for storing and analyzing their
data.
Data partitioning is a well-known technique for improving the performance of
distributed database applications. For instance, when selecting subsets of the data,
having the data pre-partitioned on the selection attribute allows skipping irrelevant
pieces of the data, i.e., without scanning the entire dataset. Joins and aggregations
also benefit from data partitioning. Because of these performance gains, the database
research community has proposed many techniques to find good data partitioning for
a query workload. Such workload-based data partitioning techniques assume that
the query workload is provided upfront or collected over time [1, 2, 3, 4, 5, 6, 7].
Unfortunately, in many cases a static query workload may not be known a priori.
One reason for this is that modern data analytics is data-centric and tends to in-
volve ad-hoc and exploratory analysis. For example, an analyst may look for anoma-
lies and trends in a user activity log, such as from web servers, network systems,
transportation services, or any other sensors. Such analyses are ad-hoc and a repre-
sentative set of queries is not available upfront. To illustrate, production workload
traces from a Boston-based analytics startup reveal that even after seeing 80% of the
13
A4
B5
D4
1 2
D5
3 4
C6
B3
5 6
D4
7 8
Figure 1-1: Example partitioning tree with 8 blocks
workload, the remaining 20% of the workload still contains 57% new queries. These
workload patterns are hard to collect in advance. Furthermore, collecting the query
workload is tedious as analysts would typically like to start using data as soon as
possible, rather than having to provide a workload before obtaining acceptable per-
formance. Providing a workload upfront has the further complexity that it can overfit
the database to that workload, requiring all other queries to scan unnecessary data
partitions to compute answers.
Distributed storage systems like HDFS [8] store large files as a collection of blocks
of fixed size (for HDFS the blocks size is usually 64/128MB). A block acts as the
smallest unit of storage and gets replicated across multiple machines. The key idea
is to exploit this block structure to build and maintain a partitioning tree on top
of the table. A partitioning tree is a binary tree which partitions the data into a
number of small partitions of roughly the size of a block. Each such partition contain
a hypercube of the data. Figure 1-1 shows an example partitioning tree for a 1GB
dataset over 4 attributes with block size 128MB. The data is split into 8 blocks, the
same as what would be created by a block-based system, however each block now has
additional metadata. For example, block 1’s tuple satisfy A ≤ 4 & B ≤ 5 & D ≤ 4.
This results in it being possible to answer any query by reading a subset of partitions.
The partitioning tree is improved over time based on queries submitted by the user.
We implemented this idea in Amoeba. Amoeba is designed with three key
properties in mind: (1) it requires no upfront query workload, while still providing
good performance for a wide range of ad-hoc queries; (2) as users pose more queries
over certain attributes, it adaptively repartitions the data, to gradually perform better
on queries over frequent attributes and attribute ranges; and (3) it provides robust
performance that avoids over-fitting to any particular workload.
The system exposes a relational storage manager, consisting of a collection of
tables, with support for predicate-based data access i.e. scan a table with a set
14
of predicates to filter over. For example, executing scan over table employee with
predicate age ≥ 30 and 1000 ≤ salary ≤ 2000. The data stored is partitioned and
as a result we end up accessing a subset of the data. The system is self-tuning and
as users start submitting queries to the system, it specializes the partitioning to the
observed patterns over time.
Query optimizers tend to push predicates down to scan operator and big data
systems like Spark SQL [9] allow custom predicate-based scan operators. Amoeba
integrates as a predicate-based scan operator and the re-partitionings’ happening to
change data layout are invisible to the user.
Adaptive partitioning/indexing is extensively used in modern single node in-
memory column stores for achieving good performance. These techniques, called
Cracking [10] have been used to generate adaptive index on a column based on in-
coming queries. Partial sideways cracking [11] extends it to generate adaptive index
on multiple columns. Cracking happens on each query and maintains additional
structures to create the index. The reason cracking cannot be applied to a dis-
tributed setting is because the cost of re-partitioning is very high. Each round of
re-partitioning needs to be carefully planned to amortize the cost associated with it.
Our approach is complementary to many other physical storage optimizations.
For example, decomposed storage layouts (i.e., column-stores) are designed to avoid
reading columns that aren’t accessed by a query. In contrast, partitioning schemes,
including ours, aim to avoid reading entire partitions of the dataset. Although our
prototype does not use a decomposed storage model, there is nothing about our
approach that cannot work in such a setting: individual columns or column groups
could easily be separately partitioned and accessed in our approach.
In summary, we make the following major contributions:
• We describe a set of techniques to aggressively partition a dataset over several
attributes and propose an algorithm to generate a robust initial partitioning
tree. Our robust partitioning tree spreads the benefits of data partitioning
across all attributes in the schema. It does not require an upfront query work-
load, and also handles data skew and correlations (Chapter 4).
• We describe an algorithm to adaptively repartition the data based on the ob-
served workload. We piggyback on query processing to repartition only the
15
accessed portions of the data. We present a divide-and-conquer algorithm to
efficiently pick the best repartitioning strategy, such that the expected benefit of
repartitioning outweighs the expected cost. To the best of our knowledge, this
is the first work to propose adaptive data partitioning for analytical workloads
(Chapter 5).
• We describe an implementation of our system on top of the Hadoop Distributed
File System (HDFS)1 and Spark. This storage system consists of: (i) upfront
partitioning pipeline to load the dataset into Amoeba, (ii) an adaptive query
executor used to read data out of the system (Chapter 6).
• We present a detailed evaluation of the Amoeba storage system on real and
synthetic query workloads to demonstrate three key properties: (i) robustness:
the system gives improved performance over ad-hoc queries right from the start,
(ii) adaptivity: the system adapts to the changes in the query workload, and
(iii) convergence: the system approaches the ideal performance when a partic-
ular workload repeats over and over again. We also evaluate our results on a
real query workload from a local startup.
1Amoeba could equally work with any other distributed file system as well.
16
Chapter 2
Related Work
Database partitioning and indexing have a rich history in the database literature.
Partitioning involves organizing the data in a structured manner while indexing cre-
ates auxiliary structures which can be used to accelerate queries. Data partitioning
could be vertical (typically used for projection and late materialization) or horizontal
(typically used for selections, group-by, and joins). Vertical partitioning is typically
useful for analytical workloads and has been studied heavily in the past, both for
static [12, 13] and dynamic [14] cases. Horizontal partitioning has been considered
both for transactional and analytical workloads. Amoeba essentially does adaptive
horizontal partitioning based on a partitioning tree. Broadly, the related work can be
grouped into three categories: workload-based partitioning tools, multi-dimensional
indexing and adaptive indexing in single node systems.
Workload-based partitioning: For transactional workloads, researchers have pro-
posed fine-grained partitioning [4], a hybrid of fine and coarse-grained [5], and skew
aware partitioning [6]. For analytical workloads, different researchers have proposed
to leverage deep integration with the optimizer in order to make better decisions [3],
take the interdependence of difference design decisions into account [2], and even
integrate vertical and horizontal partitioning decisions [1]. Traditional database par-
titioning, however, is still workload-driven, and requires that the workload is either
provided upfront or monitored and collected over time. MAGIC [15] aims to support
multiple kinds of queries by declustering data on multiple attributes. The data is
17
arranged into directories which could be distributed across processors. MAGIC also
requires the query workload as well as the resource requirements in order to come
up with the directory in the first place. As a result, the directories need to reconfig-
ured every time the workload changes. Similar to MAGIC, both Oracle and MySQL
support sub-partitioning to create nested partitions on multiple attributes [16, 17].
However, the sub-partitions are useful only if the outer attributes appear in the group-
by or join clause.
Big data storage systems, such as HDFS, partition datasets based on size. Devel-
opers can later create attribute-based partitioning using a variety of data processing
tools, e.g. Hive [18], SCOPE [19], Shark [20], and Impala [21]. However, such a
partitioning is no different than traditional database partitioning as (i) partitioning
is a static one time activity, (ii) the partitioning keys must be known a-priori and
provided by users. Apart from single table partitioning, Recently, [7] proposed to
create data blocks in HDFS based on the features extracted from each input tuple.
Again, the features are selected based on a workload and the goal is to cluster tuples
with similar features in the same data block.
Multi-dimensional Indexing: Partitioning has also been considered in the con-
text of indexing. For example, researchers have proposed to partition a B+-Tree [22]
on primary keys. These indexes are typically partitioned on a single attribute. Re-
cently, Teradata proposed multi-level partitioned primary indexes [23]. However, the
partitioned attributes are still based on a query workload and they can be used only
for selection predicates.
Multidimensional indexing has been extensively investigated in the database lit-
erature. Examples include K-d Trees, R-Trees, and Quad-Trees. These index struc-
tures are typically used for spatial data with 2 dimensions. Octree, which divides
the data into octets, is used with 3-dimensional data, such as 3D graphics. Several
other binary search trees have been proposed in the literature, such as splay tree [24].
Recent approaches layer multidimensional index structures over distributed data in
large clusters. This includes SpatialHadoop [25], MD-HBase [26], and epiC [27], or
adapting the multidimensional index to the workload in TrajStore [28]. However, all
of these multidimensional indexing approaches typically consider data locality and
18
2-dimensional spatial data.
Adaptive Indexing: Adaptive indexing techniques, such as database cracking [29,
10, 30, 11, 31, 32, 33] have been successful in single node in-memory column stores.
Cracking adaptively build an index as the queries are processed. This is done by using
the selection predicate in each incoming query as a hint to recursively split the dataset
into finer grained partitions. As a result, cracking piggybacks the query processing to
amortize the cost of indexing over a sequence of queries. Cracking happens on each
query and maintains additional structures to create the index. The reason cracking
cannot be applied to distributed data store is that the cost of re-partitioning is very
high. Each round of re-partitioning needs to be carefully planned to amortize the
cost associated with it.
19
20
Chapter 3
System Overview
The system exposes a relational storage manager, consisting of a collection of tables.
A query to the system is of the form <table, (filter predicates)>, for example <
employee, (age > 30, 100 ≤ salary ≤ 200) >. As the table is stored partitioned based
on the table’s partitioning tree, Amoeba is able to answer the query by accessing
only the relevant data blocks. Database query optimizers [34] can pushdown selects
past group-by and join operators down to the table. The table scan along with the
filters form the input to Amoeba.
Sampledrecords
Querylogs
Block 0Block 1
Block 2
…
Rawdata
Optimizer
Queryplanner
Planexecutor
PredicatedScan Query
Spark RDD
Adaptive Query Executor
Storage engine
Upfrontpartitioner
Update index
Repartitiondata
Index
Figure 3-1: Amoeba Architecture
21
Figure 3 shows the overall architecture of Amoeba. The three key components
of Amoeba storage system are as follows:
(i) Upfront partitioner. The upfront partitioner partitions a dataset into blocks and
spreads them throughout a block-based file system. The blocks are created based
on attributes in the dataset, without requiring a query workload. As a result, users
immediately get improved performance on ad-hoc workloads.
(ii) Storage Engine. The storage engine builds on top of a block-based storage system
to store tables. Each table represents a dataset loaded using the upfront partitioner.
The table contains an index file which stores the partitioning tree used to partition
the dataset and the partitioned dataset as a collection of data blocks. In addition,
we also store a query log containing the most recent queries that accessed the dataset
and a sample of the dataset whose use is described later.
(iii) Adaptive Query Executor. The adaptive query executor takes queries in the
form of a predicated scan and returns back the matching tuples. Since Amoeba
internally stores the data partitioned by the partition tree, it is able to skip many
data blocks while answering queries. The query first goes to the optimizer. When
the data blocks accessed by the query are not perfectly partitioned, the optimizer
considers repartitioning some or all of the accessed data blocks, as they are accessed
by the query and using the query predicates as cut-points. We use a cost model to
evaluate the expected cost and benefit of repartitioning. Due to de-clustering, we
end up performing random I/Os for each data block. This is ok, large block sizes in
distributed file systems [35] combined with fast network speeds lead to remote reads
being almost as fast as local reads (See Appendix A). We sacrifice some data locality
in order to quickly locate the relevant portions of the data on each machine in a
distributed setting.
The Amoeba storage system is self-tuning , lightweight (both in terms of the up-
front and repartitioning costs), and does not increase the storage space requirements.
In the rest of the thesis, we focus on building an efficient predicate-based data access
system. Future work will look at developing query processing algorithms (e.g., join
algorithms), on top of this system.
22
Chapter 4
Upfront Data Partitioning
A distributed storage system, such as HDFS, subdivides a dataset into smaller chunks,
called blocks. Blocks are created based on size, such that each block except the last is
of B bytes (usually 64MB) and gets independently replicated on R machines, where
R is the number of replicas (usually 3). The upload happens in parallel, however it is
expensive and involves writing out R copies of the dataset to disk across the cluster.
The upfront data partitioning pipeline exploits this block structure to create blocks
based on attributes, i.e., it splits the data by attributes rather than partitioning with-
out regard to the values in each block. This is similar to content-based chunking [36]
and feature-based blocking [7], however our approach does not depend on a query
workload. The key idea is to integrate attribute-based partitioning with data block-
ing, i.e., splitting a dataset into data blocks, in the underlying storage system using a
partitioning tree. This helps Amoeba achieve improved query performance on almost
all ad-hoc queries, compared to standard full scans, without having any information
about the query workload. This partitioning also serves as a good starting point for
the adaptive data repartitioner to improve upon, as discussed in Chapter 5.
We first present the key ideas used in the upfront partitioner. Then, we describe
our partitioning algorithm to come up with a partitioning tree for a given dataset.
23
4.1 Key Ideas
(1) Balanced Binary Tree. We represent the partitioning tree as a balanced binary
tree, i.e., we successively partition the dataset into two until we reach the minimum
partition size. Each node in the tree is represented as Ap where A is attribute being
partitioned on and p is the cut-point. All tuples with A ≤ p go to the left subtree and
rest go to the right subtree. The leaf nodes in the tree are buckets, each having a
unique identifier and the file name in the underlying file system. This file contains the
tuples that satisfy the predicates of all nodes traversing upwards from the bucket to
the root of the tree. Note that an attribute can appear in multiple nodes in the tree.
Having multiple occurrences of an attribute in the same branch of the tree increases
the number of ways the data is partitioned on that attribute.
(2) Heterogenous Branching. Figure 4-1(a) shows a partitioning tree analogous to the
k-d tree. This tree can only accommodate as many attributes as the depth of the tree.
For a dataset size D, minimum partition size P , and n way partitioning over each
attribute, the partitioning tree contains blognDPc attributes. With n = 2, D = 1TB,
and P = 64MB, we can only accommodate 14 attributes in the partitioning tree.
However, many real-world schemas have way more attributes. To accommodate more
attributes, we introduce heterogeneous branching to partition different branches of
the partitioning tree on different attributes. Hence, we sacrifice the best performance
on a few attributes to achieve improved performance over more attributes. This is
reasonable as without a workload, there is no reason to prefer one attribute over
another. Figure 4-1(b) shows a partitioning tree with heterogenous branching. After
partitioning on attributes A and B, the left side of the tree partitions on C while
the right side partitions on D. Thus, we are now able to accommodate 4 attributes,
instead of 3. However, attributes C and D are each partitioned on 50% of the data.
As a result, ad-hoc queries would now gain partially but over all the four attributes,
which makes the partitioning robust.
The number of attributes in the partitioning tree, with c as the minimum fraction
of the data partitioned by each attribute and r as the number of replicas, is given
as 1c· blognD
Pc. With n = 2, D = 1TB, P = 64MB and c = 50%, the number of
attributes that can be partitioned is 28. Note that the number of attributes that can
24
C C C C
A
B B
D1 D2
D11
D
D12 D21 D22
C C D D
A
B B
D1 D2
D11
D
D12 D21 D22
C D E F
Replicate
A B
D1 D2
D11
D
D12 D21 D22
(a) Partitioning Tree.
C C C C
A
B B
D1 D2
D11
D
D12 D21 D22
C C D D
A
B B
D1 D2
D11
D
D12 D21 D22
C D E F
Replicate
A B
D1 D2
D11
D
D12 D21 D22
(b) Heterogenous Branching.
Figure 4-1: Partitioning Techniques.
be partitioned increases with the dataset size. This shows that with larger dataset
sizes, upfront partitioning is even more useful for quickly finding the relevant portions
of the data.
(3) Hedging Our Bets. We define the allocation of an attribute i at each node j in the
tree as the number of ways the node partitions that attribute (nij) times the fraction
of the dataset this partitioning is applied to (cij), i.e., the total allocation of attribute
i is given as:
Allocationi = Σcij · nij
Allocation as defined above gives the average fanout of an attribute. For example, in
Figure 4-1(b), attribute B has an allocation of (2 ∗ 0.5 + 2 ∗ 0.5) = 2, while attribute
C has an allocation of (2 ∗ 0.25 + 2 ∗ 0.25) = 1. If we distribute the allocation equally
among all attributes, then the maximum per-attribute allocation for |A| attributes
and b buckets is given as b1/|A|. For example, if there are 8 buckets, and 3 attributes,
the allocation (average fanout) per attribute is 81/3 = 2.
The key intuition behind our upfront partitioning algorithm is to compute this
maximum per-attribute allocation, and then place attributes in the partitioning tree
so as to approximate this ideal allocation.
(4) Handling skew and correlations efficiently. Real world datasets are often skewed
(e.g., recent sales, holiday season shopping, etc.), and have attributes that are cor-
25
attributes={A,B,C,D} buckets = 8 depth = 3
alloc[A] = 1.189alloc[B] = 1.189 alloc[C] = 1.189 alloc[D] = 1.189
alloc[A] = -0.811alloc[B] = 1.189 alloc[C] = 1.189 alloc[D] = 1.189
alloc[A] = -0.811alloc[B] = 0.189 alloc[C] = 1.189 alloc[D] = 1.189
alloc[A] = -0.811alloc[B] = 0.189 alloc[C] = 0.189 alloc[D] = 1.189
alloc[A] = -0.811alloc[B] = 0.189 alloc[C] = 0.189 alloc[D] = 0.689
alloc[A] = -0.811alloc[B] = 0.189 alloc[C] = 0.189 alloc[D] = 0.189
alloc[A] = -0.811alloc[B] = -0.311 alloc[C] = 0.189 alloc[D] = 0.189
alloc[A] = -0.811alloc[B] = -0.311 alloc[C] = 0.189 alloc[D] = -0.311
(i) (ii) (iii) (iv) (v) (vi) (vii)
B C
A
B
AA
D4 D4 B5 D6
B7 C3
A5
A2 D4 B5 D6
B7 C3
A5
A2 A2 B5 D6
B7 C3
A5
B7 B7 B5 D6
A2 C3
A5
Swap D4 Swap D4 Pushdown B7 Rotate A5
A2
B5 D6
C3
A5
B C
A
D D B D
B C
A
D D B
B C
A
D D
B C
A
D
B7
B7
Figure 4-2: Upfront Partitioning Algorithm Example.
related (e.g., state and zipcode). As a result, if we uniformly partition by attribute
value, some branches of the partitioning tree could have much more data than others,
resulting in unbalanced final partitions. We would then lose the benefit of partition-
ing due to either very small or very large partitions. To illustrate, consider a skewed
dataset D1 = {1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8} and two partitionings, one based on the
domain values and the other on the median value.
Pdomain(D1) = [{1, 1, 1, 2, 2, 2}, {3, 4}, {5, 6}, {7, 8}]
Pmedian(D1) = [{1, 1, 1}, {2, 2, 2}, {3, 4, 5}, {6, 7, 8}]
We can see that Pdomain(D1) is clearly unbalanced whereas Pmedian(D1) produces bal-
anced partitions. Likewise, to illustrate the effect of correlations, consider a dataset
D2 with two attributes country name and salary :
D2(country, salary) ={(X, $25), (X, $40), (X, $60), (X, $80), (Y, $600),
(Y, $700), (Y, $850), (Y, $950)}
Pdomain(D2) =[{(X, $25), (X, $40), (X, $60), (X, $80)}, {(Y, $600),
(Y, $700), (Y, $850), (Y, $950)}]
Pmedian(D2) =[{(X, $25), (X, $40)}, {(X, $60), (X, $80)}, {(Y, $600),
(Y, $700)}, {(Y, $850), (Y, $950)}]
Partitioning D2 on country followed by salary results in only two partitions when
splitting the salary over its domain (0 through 1000). This is because the salary
distributions are correlated with the country. However, we get all four partitions
when using the medians successively on each node.
26
Algorithm 1: UpfrontPartitioning
Input : Attribute[] attributes, Int datasetSize, Int partitionSizeOutput: Tree partitioningTree
1 numBuckets ← datasetSize / partitionSize;2 depth ← log2(numBuckets);3 foreach a in attributes do4 allocation[a] ← nthroot(numBuckets, size(attributes));
5 root ← CreateNode();6 CreateTree(root, depth, allocation);7 return root;
Amoeba avoids this problem by performing a breadth-first traversal while con-
structing the partitioning tree. At each node, we place the attribute which has the
maximum allocation remaining. Once the attribute is chosen, we sort the data ar-
riving at the node on the attribute and choose the median as the pivot. We split
the data on the pivot and proceed to assign (attribute, pivot) for the left and right
child nodes. Finding the median in the data handles skew and correlation between
attributes, thereby ensuring that child nodes get equal portions of data. This leads
to balanced partitions in the end. In order to find the medians efficiently, we first
make one pass over the data to construct a sample and then find the median at each
node using the sample. We refer to Chapter 6 for more details on implementation.
4.2 Upfront Partitioning Algorithm
We now describe our upfront partitioning algorithm. The goal of the algorithm is
to generate a partitioning tree, which balances the benefit of partitioning across all
attributes in the dataset. This means that same selectivity predicates on any two
attributes X and Y should have similar speed-ups, compared to scanning the entire
dataset. Notice that this is different from a k-d tree [37] which typically partitions
the space by considering the attributes in a round robin fashion, until the smallest
partition size is reached. Before we describe the algorithm, let us first look at the key
ideas in our upfront partitioning algorithm.
Algorithm 1 shows the upfront partitioning algorithm, which takes in the set
27
Algorithm 2: CreateTree
Input : Tree tree, Int depth, Int[] allocation
1 Queue nodeQueue ← {tree.root};2 while nodeQueue.size > 0 do3 node ← nodeQueue.pollFirst();4 if depth > 0 then5 node.attr ← leastAllocated(allocation);6 node.value ← findMedian(node.attr);7 node.left ← CreateNode();8 node.right ← CreateNode();9 allocation[node.attr] -= 2.0/2maxDepth - depth;
10 nodeQueue.add(node.left);11 nodeQueue.add(node.right);12 depth -=1;
13 else14 node ← newBucket ();
of attributes, the dataset size, and the smallest partition size1 and produces the
partitioning tree. The algorithm computes the ideal allocation for each attribute and
then calls createTree on the root node. Note that we could also consider relative
weights of attributes when computing the ideal allocation for each attribute, in case
some attributes are more likely to be queried than others. Algorithm 2 shows the
createTree function. It performs a breadth-first traversal and assigns an attribute
to each node. The attribute to be assigned is given by the function leastAllocated,
which returns the attribute which has the highest allocation remaining. If two or more
attributes have the same highest allocation remaining, we randomly choose among
the ones that have occurred the least number of times in the path from the node to
the root. findMedian returns the median of the attribute assigned to this node. This
is done by finding the median in the sampled data which comes to this branch. The
algorithm starts with an allocation of 2 for the root node, since we are partitioning
the entire dataset into two partitions. Each time we go to the left or the right subtree,
we reduce the data we operate on by half. Once an attribute is assigned to a node, we
subtract from the overall allocation of the attribute (Line 9). The algorithm creates
a leaf-level bucket in case we reach the maximum depth (Line 14).
1For HDFS, we take the block size as the smallest partition size.
28
Figure 4-2 illustrates the steps when assigning attributes in a partitioning tree
over 4 attributes and 8 buckets (dataset size = 8 ∗ 64 = 512MB) and the allocation
remaining per attribute in each step. The algorithm starts in Step (i) with the root
node and performs a breadth-first traversal. Once attribute A is assigned in Step (i),
it has the least allocation remaining (in fact, we have used up all the allocation for
attribute A) and it is excluded from the possible options in Step (ii). We continue
placing attributes with the minimum allocation even after all four attributes have
been placed once in Step (iv). At the end of Step (v), attributes B, C, and D have
the same allocation remaining. For the next nodes in Steps (vi) and (vii), since C
already occurs on the path from the nodes to the root, we randomly choose between
B and D.
29
30
Chapter 5
Adaptive Repartitioning
In the previous chapter, we described an algorithm to partition a dataset on several
(or all) attributes, without knowing the query workload. However, as users begin to
query the dataset, it is beneficial to improve the partitioning based on the observed
queries. As in most online algorithms, we assume that the queries seen thus far are
indicative of the queries to come. Amoeba does this by doing transformations on
the partitioning tree based on the observed data access patterns. The key features of
the adaptive repartitioning module of Amoeba are:
• Piggybacked, meaning it interleaves with query processing and only accesses
the data which is read by input query, i.e., we do not access data that is not
read by queries during re-partitioning. This has two benefits: (i) we never
spend any effort in re-partitioning data that will not be touched by any query,
and (ii) query processing and re-partitioning modules share the scan thereby
reducing the cost of re-partitioning.
• Transparent, as the users do not have to worry about making the repartitioning
decisions and their queries remain unchanged with new access methods.
• Eventually convergent, meaning it converges to a final partitioning if a fixed
workload is repeated over and over again.
• Lightweight, as it does not penalize any query with high repartitioning costs: it
distributes the costs over several queries.
31
• Balances between adaptivity and robustness, meaning the system tries to stabi-
lize newly made repartitioning decisions as well as expiring older decisions.
In the rest of this chapter, we first describe our workload monitor and the cost
model to estimate the cost of a query over a given partitioning tree. Then, we
introduce three basic transformations used to transform a given partitioning tree.
We describe a bottom-up algorithm to consider all possible alternatives generated
from the transformation rules for inserting a single predicate. Finally, we discuss how
to handle multi-predicate queries.
5.1 Workload Monitor and Cost Model
Amoeba maintains a history of the queries seen by the system. We call this the query
window denoted by W . Each incoming query gets added into the query window as
< T, q > where T is the current timestamp and q is the query. We do not directly
restrict the size of the query window, instead we restrict the window to contain only
queries that happened in the past X hours. The intuition behind this is that the
older queries are stale and are not the representative of the queries to be seen. In all
our evaluation, we set W = 4 hours. The cost of a query q over a partitioning tree T
is given as:
Cost(T, q) =∑
b∈lookup(T,q)
nb
Where the function lookup(T, q) gives the set of relevant buckets for query q in
T . The cost of the query window is the sum of the cost of individual queries. A
query being executed may have some of its buckets re-partitioned. The added cost of
repartitioning a set of buckets B is given as:
RepartitioningCost(T, q) =∑b∈B
c · nb
32
The important parameter to note here is c. c represents the write-multiplier i.e.:
how expensive writes are compared to read. Changing c alters the properties of the
system: on one end setting c =∞ makes it imitate a system with no-repartitioning,
at the other end setting c = 0 makes it re-partition the data every time it sees benefit.
5.2 Partitioning Tree Transformations
We now describe a set of transformation rules to explore the space of possible plans
when re-partitioning the data. For now, we restrict ourselves to the problem of
exploring alternate trees for a query with a single predicate of the form A ≤ p,
denoted as Ap. Later in 5.4, we discuss how to handle other predicate forms and
adding multiple predicates into the tree.
Given a query with predicate Ap, we attempt to assign Ap to one of the nodes
(partition on A with cutpoint p) in the partitioning tree. Note that we do not have
to consider partitioning on any other attribute, i.e., predicates on other attributes
would have already been considered when there was a query on that attribute. Our
approach is to consider partitioning transformations that are local, i.e., that do not
involve rewriting the entire tree. These local transformations are cheaper and amor-
tizes the repartitioning effort over several queries. Below, let us first see the three
kinds of partitioning transformations that we consider during repartitioning. We will
then describe our algorithm for when to apply these transformations in order to im-
prove the query performance. Amoeba considers three kinds of basic partitioning
transformations:
(1) Swap. This is the primary data-reorganization operator in Amoeba. It replaces
an existing node in the partitioning with the incoming query predicate Ap. As we
repartition only the accessed portions of the data, we consider swapping only those
nodes whose left and right children are accessed by the incoming query. Applying
swap on an existing node involves reading both sub-branches, and restructuring all
partitions beneath the left subtree to contain data satisfying Ap and the right sub-
tree to contain data that does not satisfy Ap. Swaps can happen between different
attributes (Figure 5-1(a)), in which case both branches are completely rewritten in
the new tree. Swaps can also happen between two predicates of the same attribute
33
(Figure 5-1(b)), in which case the data moves from one branch to the other.
X Ap
Ap' Ap
X
Ap Ap
Ap
X X
Ap'
Z
Ap
X Y
XAp
X Y
Ap'
Q[Ap]
Q[Ap]
Q[Ap]
Q[Ap]
(a) Different Attribute Swap
X Ap
Ap' Ap
X
Ap Ap
Ap
X X
Ap'
Z
Ap
X Y
XAp
X Y
Ap'
Q[Ap]
Q[Ap]
Q[Ap]
Q[Ap]
(b) Same Attribute Swap
Figure 5-1: Node swap in the partitioning tree.
For example, if predicate Ap′ is ≤ 10 and predicate Ap is ≤ 5, then data moves
from left branch to right branch in the Figure 5-1(b), i.e., the left branch is com-
pletely rewritten while the right branch just has new data appended. Swaps serve the
dual purpose of un-partitioning an existing (less accessed) attribute while refining on
another (more accessed) attribute. Since both the swap attributes as well as their
predicates are driven by the incoming queries, they reduce the access times for the
incoming query predicates. Finally, note that it is cheaper to apply swaps at lower
levels in the partitioning tree since less data is rewritten. Applying them at higher
levels of the tree results in a much higher cost.
(2) Pushup. This transformation is used to push a predicate as high up the tree as
possible. This can be done when both the left and the right child of a node contain
the incoming predicate, as a result of a previous swap, as shown in Figure 5-3. Notice
that this is a logical partitioning tree transformation, i.e., it only involves rearranging
the internal nodes without any modification of the contents of leaf nodes1.
We check for a pushup transformation every time we perform a swap transformation.
The idea is to move important predicates (ones that have recently or frequently ap-
peared in the query sequence) progressively up the partitioning tree, from the leaves
right up to the root. This makes such important predicates less likely to be swapped
immediately, i.e., the tree is still robust, because swapping a node higher in the par-
titioning tree is much more expensive. Another advantage of node pushup is that it
causes a churn of the attributes assigned to higher nodes in the upfront partitioning.
When such a dormant node is pushed down, subsequent predicates can swap them
in a more incremental fashion, affecting fewer branches. Overall, node pushup allows
Amoeba to naturally cause less important attributes to be repartitioned more fre-
quently, thereby striking a balance between adaptivity and robustness. Note that if
1In this case, the physical transformation, i.e. swap, must have happened in one of the childsubtrees.
34
attributes={A,B,C,D} buckets = 8 depth = 3
alloc[A] = 1.189alloc[B] = 1.189 alloc[C] = 1.189 alloc[D] = 1.189
alloc[A] = -0.811alloc[B] = 1.189 alloc[C] = 1.189 alloc[D] = 1.189
alloc[A] = -0.811alloc[B] = 0.189 alloc[C] = 1.189 alloc[D] = 1.189
alloc[A] = -0.811alloc[B] = 0.189 alloc[C] = 0.189 alloc[D] = 1.189
alloc[A] = -0.811alloc[B] = 0.189 alloc[C] = 0.189 alloc[D] = 0.689
alloc[A] = -0.811alloc[B] = 0.189 alloc[C] = 0.189 alloc[D] = 0.189
alloc[A] = -0.811alloc[B] = -0.311 alloc[C] = 0.189 alloc[D] = 0.189
alloc[A] = -0.811alloc[B] = -0.311 alloc[C] = 0.189 alloc[D] = -0.311
(i) (ii) (iii) (iv) (v) (vi) (vii)
B C
A
B
AA
D4 D4 B5 D6
B7 C3
A5
A2 D4 B5 D6
B7 C3
A5
A2 A2 B5 D6
B7 C3
A5
B7 B7 B5 D6
A2 C3
A5
Swap D4 Swap D4 Pushdown B7 Rotate A5
A2
B5 D6
C3
A5
B C
A
D D B D
B C
A
D D B
B C
A
D D
B C
A
D
B7
B7
Figure 5-2: Illustrating adaptive partitioning when predicate A2 appears repeatedly.
X Ap
Ap' Ap
X
Ap Ap
Ap
X X
Ap'
Z
Ap
X Y
XAp
X Y
Ap'
Q[Ap]
Q[Ap]
Q[Ap]
Q[Ap]
Figure 5-3: Node pushdown in partitioning tree.
possible, a pushup always happens as there is no cost associated with doing it.
(3) Rotate. The rotate transformation rearranges two predicates on the same at-
tribute such that more important (recently accessed or frequently appearing in the
query sequence) predicate appears higher up in the partitioning tree. Figure 5-4 shows
a rotate transformation involving predicates p and p′ on attribute A. The goal here
is to churn the partitioning tree such that predicates on less important attributes are
likely to be replaced first. Similar to the pushup transformation, rotate is a logical
transformation, i.e., it only rearranges the internal nodes of the partitioning tree and
always happens if possible.
X Ap
Ap' Ap
X
Ap Ap
Ap
X X
Ap'
Z
Ap
X Y
XAp
Y Z
Ap'
Q[Ap]
Q[Ap]
Q[Ap]
Q[Ap]
Figure 5-4: Node rotation in partitioning tree.
These three partitioning tree transformations can be further combined to capture
a fairly general set of repartitioning scenarios. Figure 5-2 shows how starting from
an initial partitioning tree, we first swap nodes D4 with incoming predicate A2 at the
lower level. Then, we pushup A2 one level above and finally rotate with nodes A5
and C3.
The upfront partitioning algorithm generates a partitioning tree which is fairly
balanced, i.e., all the leaf nodes get almost same number of tuples. Swapping nodes
based on predicates from incoming queries may lead to some leaves having more tuples
compared to the rest. This is not a problem as our cost model ensures that the skew
is actually beneficial if it arises. For example, if many queries access A ≤ 0.75, where
35
Transformation Notation Cost (C) Benefit (B)
Swap Pswap(n, n′)∑
b∈Tnc · nb
∑ki−0[Cost(Tn, qi)− Cost(Tn′ , qi)]
Pushup Ppushup(n, nleft, nright) C(PPtop(nleft)) + C(PPtop(nright)) B(PPtop(nleft)) +B(PPtop(nright))Rotate Protate(p, p
′) C(PPtop(nleft|right)) , for p’ on nleft|right B(PPtop(nleft|right)) , for p’ on nleft|rightNone Pnone(n) C(PBest(nleft)) + C(PBest(nright)) B(PBest(nleft)) +B(PBest(nright))
Table 5.1: The cost and benefit estimates for different partitioning tree transforma-tions.
A is uniformly distributed (0, 1), it is beneficial to add this node into the tree even
though it might lead to a skew. In the next section, we describe how we generate
alternate partitioning trees using these transformations.
5.3 Divide-And-Conquer Repartitioning
Given a query with predicate Ap and a partitioning tree T , there are many different
combinations of transformations that need to be considered. Consider for example a
simple 7-node tree, consisting of a root node X with two children Y and Z. Each of
Y and Z have two leaf nodes below them. Assuming all the leaf nodes are accessed,
the set of alternatives to be considered are: (i) Swap Y with Ap; (ii) Swap Z with
Ap; (iii) Swap both Y and Z, followed by pushup; (iv) Swap X with Ap; and (v) Do
nothing.
We propose a bottom-up approach to explore the space of all alternate reparti-
tioning trees. Observe that the data access costs over a partitioning tree Tn, rooted
at node n, could be broken down into the access costs over its subtrees, i.e.,
Cost(Tn, qi) = Cost(Tnleft, qi) + Cost(Tnright
, qi)
Where, Tnleftand Tnright
are subtrees rooted respectively at the left and the right
child of n. Thus, finding the best partitioning tree can be broken down into recursively
finding the best left and right subtrees at each level, and considering parent node
transformations only on top of the best child subtrees. For each transformation, we
consider the benefit and cost of that transformation and pick the one which has the
best benefit-to-cost ratio. Table 5.1 shows the cost and benefit estimates for different
transformations. For the swap transformation, denoted as Pswap(n, n′), we need to
recalculate the query costs. However, pushup and rotate transformations, denoted
36
Algorithm 3: getSubtreePlan
Input : Node node, Predicate predOutput: Plan transformPlan
1 if isLeaf(node) then2 return Pnone(node);3 else4 if isLeftRelevant(node,pred) then5 leftPlan ← getSubtreePlan(node.lChild, pred);
6 if isRightRelevant(node,pred) then7 rightPlan ← getSubtreePlan(node.rChild, pred);
/* consider swap */
8 if leftPlan.fullyAccessed and rightPlan.fullyAccessed then9 currentCost ←
∑i Cost(node,qi);
10 whatIfNode ← clone(node);11 whatIfNode.predicate ← newPred;12 swapNode(node, whatIfNode);13 newCost ←
∑i Cost(whatIfNode,qi);
14 benefit = currentCost - newCost;15 if benefit > 0 then16 updatePlanIfBetter(node, Pswap(node,whatIfNode));
/* consider pushup */
17 if leftPlan.ptop and rightPlan.ptop then18 updatePlanIfBetter(node, Ppushup(node, node.lChild, node.rChild));
/* consider rotate */
19 if node.attribute == predicate.attribute then20 if leftPlan.ptop then21 updatePlanIfBetter(node, Protate(node, node.lChild));
22 if rightPlan.ptop then23 updatePlanIfBetter(node, Protate(node, node.rChild));
/* consider doing nothing */
24 updatePlanIfBetter(node, Pnone(node));25 return node.plan;
as Pswap(n, n′) and Ppushup(n, nleft, nright) respectively, inherit the costs from children
subtrees. We also consider applying none of the transformations at a given node,
denoted as Pnone(n). This divide-and-conquer approach helps to significantly reduce
the candidate set of modified partitioning trees.
Given a query with single predicate Ap, we call getSubtreePlan(root, Ap). The
37
algorithm uses a divide-and-conquer approach to find the best plan for the given
predicate by recursively finding the best plan for each subtree, until we reach the
leaf. If the node is a leaf, we return a do-nothing plan (Lines 1-2). If not, we first
check if the left subtree is accessed, if yes we recursively call getSubtreePlan to find
the best plan for the left subtree (Lines 4-6). Similarly for the right subtree. Once
we have the best plans for the left and right subtree, we first consider swap rule
(Lines 10—19). We only consider swapping if both the subtrees are fully accessed.
Otherwise, we will need to access additional data in order to create new partitioning.
We perform a what-if analysis to analyze the plan produced by the swap transfor-
mation. This is done by replacing the current node with a hypothetical node having
the incoming query predicate. We then recalculate the new bucket counts at the leaf
level of this new tree using the sample. We now estimate the total query cost with the
hypothetical node present. In case the what-if node reduces the query costs, i.e., it
has benefits, we update the transformation plan of the current node. The update
method (updatePlanIfBetter) checks whether the benefit-cost ratio of the new plan
is greater than that of the best plan so-far. If so, we update the best plan. The
benefit-cost ratio is used to compare the alternative plans.
Next we check whether a pushup transformation is possible (Lines 21–23). plan.ptop
indicates if the plan results in the predicate p being at the root of the subtree. A
pushup transformation is possible only when both the child nodes have their root
as Ap. Since pushup is only a logical transformation, we do not need to perform a
what-if analysis. Instead, we simply check whether the pushdown results in better
benefit-cost ratio in the updatePlanIfBetter method. Then, we consider rotating
the node by bringing either the left or the right child up (Lines 24–31). Again, this is
a logical transformation and therefore we only check the benefit-cost ratio. Finally,
we check whether no transformation is needed, i.e., we simply inherit the transfor-
mations of the child nodes (Lines 32). The algorithm finally returns the best plan in
Line 33.
A plan contains action taken at the node, ptop to indicate if after the plan is
applied Ap is the root, fullyAccessed to indicate if the entire subtree is accessed
and pointers to the plan for the left and right child nodes. For the sake of brevity,
Algorithm 3 does not explicitly show the update of these attributes. The algorithm
38
Algorithm 4: getBestPlan
Input : Tree tree, Predicate[] predicates
1 while predicates 6= φ do2 prevPlan ← tree.plan;3 foreach p in predicates do4 Plan newPlan ← getSubtreePlan(tree.root, p);5 updatePlan(tree.root, newPlan);
6 if tree.plan 6= prevPlan then7 remove from predicates the newly inserted predicate;8 else9 break;
10 return tree.plan;
has a runtime complexity of O(QNlogN) where N is the number of nodes in the tree
and Q is the number of queries in the query window.
5.4 Handling Multiple Predicates
So far we assumed that a predicate is always of the form A ≤ p. It gets inserted in
the tree as Ap and on insertion, only the leaf nodes on the left side of the tree are
accessed. A > p is also inserted as Ap with the right side of the tree being accessed.
For A ≥ p and A < p, let p′ be p − δ where δ is the smallest change for that type.
We insert Ap′ into the tree. A = p is treated as combination of A ≤ p and A > p′.
Now lets consider queries with multiple predicates. Consider a simple query with
two predicate Ap and Ap2. The brute force approach is to consider choosing a set
of accessed non-terminal nodes to be replaced by Ap and then for every such choice,
choose a subset of remaining nodes to be replaced by Ap2. Thus, the number of choices
grows exponentially with the number of predicates. Amoeba uses a greedy approach
to work around this exponential complexity, as described in Algorithm 4. For each
predicate in the query, we try to insert the predicate into the tree. We find the best
plan for that predicate by calling getSubtreeP lan(root, pi) for the ith predicate (Lines
3-6). We take the best among the best plans obtained for different predicates and
remove the corresponding predicate from the predicate set. We then try to insert the
remaining predicates into the best plan obtained so far. The algorithm stops when
39
either all predicates have been inserted or when the tree stops changing (Lines 1 and
10). getBestP lan adds a multiplicative complexity of O(|P |2) where P is the set of
query predicates.
40
Chapter 6
Implementation
We now describe our implementation of Amoeba on top of HDFS and Apache Spark.
Notice that our ideas could be implemented on any other block-based distributed
storage system. The Amoeba storage system has more than 12, 000 lines of code
and it comprises of two modules: (i) a robust partitioning module that parses data
and writes out the initial partitions; and (ii) a query executor module performing the
distributed adaptive repartitioning.
6.1 Initial Robust Partitioning
This module takes in the raw input files (e.g., CSV) and partitions them across all
attributes. For this, it first builds the robust partitioning tree and then creates the
data blocks based on the partitioning tree.
Tree Construction. Recall that our robust partitioning algorithm needs to find
the median value of the partitioning attribute at each node of the tree. As finding
successive medians on the entire dataset is expensive, we instead employ a sample of
the input to estimate the median. We use block-level sampling to generate a random
sample of the input.
Amoeba uses a lazy parser to parse the input sample data lazily. The parser
detects: the record boundaries, the attribute boundaries inside a record and the data
types. Each value in the record is actually parsed only when it is accessed, i.e., lazily.
While Amoeba partitions on all attributes by default, developers can also specify
41
particular subsets of the attributes to partition on, in case they have some knowledge
of the query workload. In such a situation, only the partitioned attributes need to be
parsed. Finally, the lazy parser avoids copying costs by returning tuples as views on
the input byte buffer. The only time copying happens when the tuple is written to
an output byte buffer.
The sampled records and set of attributes are fed to the tree builder (Algorithm 2),
which produces the partitioning tree as the output. The tree builder successively sorts
the sample on different attributes in order to find the median at different nodes in the
tree. However, as each sort happens on a different portion of the sample, we sort on
different views of the same underlying data, i.e., the samples are not copied each time
they are sorted. When constructing the partitioning tree on a cluster of machines, we
collect the samples on each machine independently and in parallel. Later, we combine
the samples (via HDFS) and run the tree builder on a single machine to produce a
single partitioning tree across the entire dataset. The index is serialized and stored
as a file on HDFS.
Data Blocking. The second phase takes the partitioning tree and the input files
as input and creates the data blocks. During this phase, we scan the input files, for
each tuple we use the partitioning tree to find the leaf node it lands in. We use a
buffered blocker to collect the tuples belonging to each partition(leaf node) separately
and buffer them before flushing to the underlying file system, i.e., HDFS in our case.
Our current implementation creates a different HDFS file for each partition in the
dataset. However, future work could also integrate Amoeba deeply within HDFS.
Given that we do not assume any workload knowledge, Amoeba simply de-clusters
the partitioned blocks randomly across the cluster of machines, i.e., we use the default
random data placement policy of HDFS. This is reasonable because fetching relevant
data across the network is fine if we can skip more expensive disk reads, as noted in
the introduction.
Amoeba runs the loading and partitioning in parallel on each machine, using
the same partitioning tree (made available to all machines via HDFS). We employ
partition-level distributed exclusive locks (via Zookeeper) to ensure that buffered
writers on different machines don’t write to the same file at the same time. With
the current main memory and CPU capacities, having a file per partition (number of
42
partitions in the order of ten thousand) does not lead to any observable slowdown [38].
6.2 Query Execution
There are two main parts involved in query execution: (i) create an execution plan
which may involve re-partitioning some or all the of the data that is being accessed
by the query, and (ii) actually execute the plan.
Optimizer. Queries submitted to Amoeba first go to the optimizer. The optimizer
is responsible for generating an execution plan for the given query. It reads the current
tree file from the HDFS and uses Algorithm 4 to check if it is feasible to improve the
current partitioning tree. Note that while creating the plan, we also end up filtering
out partitions which do not match any of the query predicates. For example, if there
is a node A5 in the tree and one of the predicates in the query is A ≤ 4, then we don’t
have to scan any of the partitions in right subtree of the node. The plan returned is
used to create a new index tree and write it out to HDFS. From the plan, we now get
two set of buckets: 1) buckets that will just be scanned, and 2) buckets that will be
re-partitioned to generate a new set of buckets.
Plan Executor. Amoeba uses Apache Spark for executing the queries. We con-
struct a Spark job from the plan returned by the optimizer. We split the each of the
two sets of buckets into smaller sets called tasks. A task contains a set of buckets
such that the sum of the sizes of the buckets is not more than 4GB. Each task reads
the blocks from HDFS in bulk and iterates over the tuples in main-memory. Tasks
created from set 1 run with a scan iterator which simply reads a tuple at a time from
the buckets and returns the tuple if it matches the predicates in the query.
Tasks created from set 2 run with a distributed repartitioning iterator. The it-
erator reads the tree from HDFS. For each tuple, the iterator looks up in the new
partitioning tree to find its new partition id in addition to checking if it matches the
query predicates. It then re-clusters the data in main-memory according to the new
partitioning tree. Once the buffers are filled, the repartitioner flushes the new parti-
tions into physical files on HDFS. When the optimizer decides to repartition a large
subtree, the repartitioning work may end up being distributed across several tasks.
As a result, the workers need to coordinate while flushing the new partitions, i.e., so
43
that writes are done atomically. Again, we employ partition-level distributed exclu-
sive locks (via Zookeeper) for this synchronization. As a result of this synchronized
writing, each partition resides in a single file across the cluster.
Tasks are executed independently by the Spark job manager across the cluster of
machines and the result is exposed to the user as a Spark RDD. The user can use these
RDDs to do more analysis using the standard Spark APIs, e.g., run an aggregation.
44
Chapter 7
Discussion
In this chapter we discuss briefly ideas on how to improve the system to get better
performance.
7.1 Leveraging Replication
Distributed storage systems replicate data for fault-tolerance, e.g., 3x replication in
HDFS. Such replication mechanisms first partition the dataset into blocks and then
replicate each block multiple times. Instead, we can first replicate the entire dataset
and then partition each replica using a different partitioning tree.
C C C C
A
B B
D1 D2
D11
D
D12 D21 D22
C C D D
A
B B
D1 D2
D11
D
D12 D21 D22
C D E F
Replicate
A B
D1 D2
D11
D
D12 D21 D22
Figure 7-1: Heterogenous Replication.
For example, attributes {A,C,D} and {B,E, F} for the two replicas as in Fig-
ure 7.1. While the system is still fault-tolerant (same replication), recovery becomes
slower because we need to read several or all replica blocks in case of a block failure.
Essentially, we sacrifice fast recovery time for improved ad-hoc query performance.
45
To recap, the number of attributes in the partitioning tree, with a dataset size D,
minimum partition size P , n way partitioning over each attribute, c as the minimum
fraction of the data partitioned by each attribute, is given as a = 1c· blognD
Pc. Having
r replicas allows us to have a ∗ r number of attributes, with a attributes per replica
or increase the n for each attribute. Both these lead to improved query performance
due to greater partition pruning. Currently the system just splits the attribute set
into disjoint equals sets of attributes and builds a partitioning tree independently on
each set. There are interesting questions like can we group together attributes in a
non-random way so that cluster attributes accessed together, how to adapt across
partition trees which we plan to explore as future work.
7.2 Handling Joins
In order to accelerate join performance, distributed database system tend to do
co-partitioning. Hadoop++[39] and CoHadoop [40] proposed this scheme of co-
partitioning datasets in HDFS to speed up joins.
Co-partitioning can be achieved in Amoeba as well. The user can reserve the
d top levels in the partitioning tree for the join attribute on which he/she wants to
co-partition. This creates 2d partitions of the join attribute’s domain. The adaptive
query executor would incrementally re-partition to improve query performance based
on the filter predicates but it would not touch the topmost d levels which have been
reserved for co-partitioning.
Finally, the dataset may have multiple join attributes, each of which joins to
different datasets. Two tables can be co-partitioned, however a table can’t be co-
partitioned against multiple tables on different attributes. The default approach used
by systems today is to shuffle join. Shuffle join is much more expensive compared to
co-partitioned join as it requires a full network shuffle of the dataset. In Amoeba,
since the dataset is partitioned partially (as in not fully co-partitioned), it would be
possible to treat joins as a first class query and we could consider adaptively improving
the partitioning on the join attribute incrementally as well. It would converge on to
fully co-partitioned layout if only one join is frequently accessed. We plan to explore
this in the future.
46
Chapter 8
Evaluation
In this chapter, we report the experimental results on the Amoeba storage system.
The experiments are divided into three sub-sections: (i) we examine the benefits
and overheads of upfront data partitioning and compare it against a standard space
partitioning tree, (ii) we study the behaviour of the adaptive repartitioning under
different workload patters via micro benchmarks and (iii) we finally validate the
Amoeba system on a real world workload from a local startup company.
Setup. Our testbed consists of a cluster of 10 nodes. Each node has 32 2.07 GHz
Xeon cores, running on Ubuntu 12.04, 256 GB main-memory, and 11 TB of disk
storage. We generate the dataset in parallel on each node, so all data loading into
HDFS happens in parallel. The Amoeba storage system runs on top of current
stable Hadoop 2.6.0 and uses Zookeeper 3.4.6 for synchronization. We run queries
using Spark 1.3.1, with Spark programs running on Java 7. All experiments are run
with cold caches.
8.1 Upfront Partitioning Performance
We now study the impact of doing upfront partitioning. We analyze three aspects:
the benefit of doing upfront data partitioning, the added overhead as a result of doing
upfront partitioning and finally how it matches up against k-d trees[37] which is a
standard space partitioning tree.
We use the lineitem table from the TPC-H benchmark with scale factor 1000.
47
The table contains approximately 6 billion rows and 760 GB in size. The table has
16 columns. We use the TPC-H data generator to generate 1/10th of the data on
each machine, hence the data is uniformly distributed across all the machines. After
the upfront partitioning is completed, the data resides in HDFS 3-way replicated,
occupying 2.3 TB of space.
Attribute Full Scan Robust Partitioning
Robust Partitioning (per-replica)
Ideal Improvement Improvement (per-replica)
orderkey 1500 877.228 717.568 75 0.415181333333333 0.521621333333333
partkey 1500 757.293 544.222 75 0.495138 0.637185333333333
suppkey 1500 786.400 385.636 75 0.475733333333333 0.742909333333333
linenumber 1500 793.877 448.731 75 0.470748666666667 0.700846
quantity 1500 764.623 414.005 75 0.490251333333333 0.723996666666667
extendedprice 1500 799.784 251.845 75 0.466810666666667 0.832103333333333
discount 1500 852.987 358.950 75 0.431342 0.7607
tax 1500 794.281 316.465 75 0.470479333333333 0.789023333333333
returnflag 1500 1145.608 1116.729 75 0.236261333333333 0.255514
linestatus 1500 715.728 715.728 75 0.522848 0.522848
shipdate 1500 849.732 484.058 75 0.433512 0.677294666666667
commitdate 1500 838.666 497.604 75 0.440889333333333 0.668264
receiptdate 1500 802.018 475.641 75 0.465321333333333 0.682906
shipinstruct 1500 1039.717 676.373 75 0.306855333333333 0.549084666666667
shipmode 1500 859.939 365.166 75 0.426707333333333 0.756556
0 0.436538622222222 0.654723511111111
Tim
e (s
econ
ds)
0
400
800
1200
1600
orderk
ey
partk
ey
supp
key
linenu
mber
quan
tity
exten
dedp
rice
disco
unt
tax
return
flag
linesta
tus
shipd
ate
commitd
ate
receip
tdate
shipin
struc
t
shipm
ode
Full Scan Robust Partitioning Robust Partitioning (per-replica) Ideal
Figure 8-1: Ad-hoc query runtimes for different attributes of TPC-H lineitem.
Ad-hoc Query Processing. We first study ad-hoc query performance. We run
range queries of the following form on all the attributes: SELECT * FROM lineitem
WHERE start < A ≤ end; start and end are chosen randomly in the domain while
ensuring a 5% selectivity. Given that Amoeba distributes the partitioning effort
over all attributes, ad-hoc queries are expected to show a benefit right from the
beginning. Figure 8-1 shows the results. Two observations stand out: (i) Amoeba
(labeled “Robust Partitioning”) bridges the gap between the standard and the ideal
runtimes, and (ii) all attributes have similar advantage with Amoeba. Overall, as a
result of upfront partitioning we get 44% improvement on an average over full scan
versus no partitioning. Figure 8-1 also shows the runtimes for per-replica robust
partitioning, i.e., when we use a different partitioning tree for each replica. We can
see that per-replica robust partitioning improves the runtimes even further with an
average improvement of 65% over full scan. However, note that attributes such as
returnflag and linestatus still have the same runtime. This is because these are
very low cardinality attributes that cannot be partitioned further. Thus, Amoeba
indeed provides improved query performance over ad hoc workloads without being
given a workload upfront.
Partitioning overheads. Given that partitioning is an expensive operation in
database analytics, we now study the partitioning overheads in Amoeba. We first
48
Exhaustive Adaptive Workload-based Adaptive
1 9.0694579E+08 2997871289 3.14811328E+08 3002124738 6.3219558E+08 3002124738
2 9.1045811E+08 2997871289 3.12734144E+08 2997871289 3.16637952E+08 3002124738
3 9.0861056E+08 3002124738 3.1046384E+08 2359184298 1.5864256E+08 3002124738
4 9.1795782E+08 2513928775 3.16079552E+08 1265335315 8.1171424E+07 3002124738
5 9.1176288E+08 1595976302 3.79694176E+08 3002124738 4.2983852E+07 3002124738
6 6.8369171E+08 684385763 5.4918707E+08 1518682469 2.4999108E+07 2007317820
7 9.0694579E+08 2997871289 3.79694176E+08 3002126228 1.7504594E+07 2007317820
8 9.1045811E+08 1330870883 5.22116032E+08 2984199016 1.5004684E+07 2007317820
9 9.0861056E+08 908688569 4.7684624E+08 506427423 1.5004684E+07 2007317820
10 9.1795782E+08 917947303 5.4010598E+08 2844579882 1.5004684E+07 355650048
11 9.1176288E+08 911595775 3.13798816E+08 815518542 1.5004684E+07 23944808
12 6.8369171E+08 684385763 3.14920896E+08 612415327 1.5004684E+07 23944808
13 9.0694579E+08 1667001794 3.13349984E+08 1384334576 1.5004684E+07 23944808
14 9.1045811E+08 910379735 3.1682064E+08 3166572369 1.5004684E+07 23944808
15 9.0861056E+08 908688569 3.13323904E+08 1967528802 1.5004684E+07 23944808
16 9.1795782E+08 917947303 3.16836288E+08 2461879761 1.5004684E+07 23944808
17 9.1176288E+08 911595775 3.1925792E+08 1466482105 1.5004684E+07 23944808
18 6.8369171E+08 684385763 3.16945888E+08 1693384125 1.5004684E+07 23944808
19 9.0694579E+08 906950739 3.13777952E+08 1143868180 1.5004684E+07 23944808
20 9.1045811E+08 910379735 3.1750432E+08 1601779877 1.5004684E+07 23944808
21 9.0861056E+08 908688569 3.17076352E+08 901236093 1.5004684E+07 23944808
22 9.1795782E+08 917947303 3.15510656E+08 1317807849 1.5004684E+07 23944808
23 9.1176288E+08 911595775 5.4918707E+08 683999521 1.5004684E+07 23944808
24 6.8369171E+08 684385763 3.15740288E+08 992840181 1.5004684E+07 23944808
25 9.0694579E+08 906950739 3.1530192E+08 808139186 1.5004684E+07 23944808
26 9.1045811E+08 3.1772352E+08 1310146264 1.5004684E+07 23944808
27 9.0861056E+08 3.14936576E+08 1402616769 1.5004684E+07 23944808
28 9.1795782E+08 3.13344768E+08 857485053 1.5004684E+07 23944808
29 9.1176288E+08 3.12066112E+08 711028895 1.5004684E+07 23944808
30 6.8369171E+08 3.10677824E+08 1223661908 1.5004684E+07 23944808
31 9.0694579E+08
32 9.1045811E+08
33 9.0861056E+08
34 9.1795782E+08
35 9.1176288E+08
36 6.8369171E+08 6.7
37 9.0694579E+08 5.9
38 9.1045811E+08 6.1
39 9.0861056E+08 6.1
40 9.1795782E+08 6.0
41 9.1176288E+08 6.4
42 6.8369171E+08 6.2
43 9.0694579E+08
44 9.1045811E+08
45 9.0861056E+08
46 9.1795782E+08
47 9.1176288E+08
48 6.8369171E+08
49 9.0694579E+08
50 9.1045811E+08
51 9.0861056E+08
52 9.1795782E+08
53 9.1176288E+08
54 6.8369171E+08
55 9.0694579E+08
56 9.1045811E+08
57 9.0861056E+08
58 9.1795782E+08
59 9.1176288E+08
60 6.8369171E+08
61 9.0694579E+08
62 9.1045811E+08
63 9.0861056E+08
64 9.1795782E+08
65 9.1176288E+08
66 6.8369171E+08
67 9.0694579E+08
68 9.1045811E+08
69 9.0861056E+08
70 9.1795782E+08
71 9.1176288E+08
72 6.8369171E+08
73 9.0694579E+08
74 9.1045811E+08
75 9.0861056E+08
76 9.1795782E+08
77 9.1176288E+08
78 6.8369171E+08
79 9.0694579E+08
80 9.1045811E+08
81 9.0861056E+08
82 9.1795782E+08
83 9.1176288E+08
84 6.8369171E+08
85 9.0694579E+08
86 9.1045811E+08
87 9.0861056E+08
88 9.1795782E+08
89 9.1176288E+08
90 6.8369171E+08
91 9.0694579E+08
92 9.1045811E+08
93 9.0861056E+08
94 9.1795782E+08
95 9.1176288E+08
96 6.8369171E+08
97 9.0694579E+08
98 9.1045811E+08
99 9.0861056E+08
100 9.1795782E+08
Optimal Adaptive
1 3 5 7 9 11 13 15 17 19 21 23 25
Coef
ficie
nt o
f Va
riatio
n
00.5
11.5
22.5
1 10 20 50 100 1000
k-d TreeRobust Tree
Allo
catio
n
00.40.81.21.6
2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
k-d TreeRobust Tree
Random
Cyclic
Drill-down
TPC-H Lineitem Attributes
TPC-H Scale Factor
Que
ry C
osts
Query Sequence
Tim
e (s
econ
ds)
0
2800
5600
8400
11200
14000
(a) Data Load TimeLineitem
Standard HDFSRobust PartitioningRobust Partitioning (per-replica)
#tup
les
0E+005E+089E+081E+092E+09
(c) Repartitioning costs over queries1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Tim
e (s
)
0125250375500
(b) Repartitioning Optimizer RuntimeRepartitioned Not Repartitioned
(b) Comparison with k-d Tree
(b) Comparison with Optimal
Figure 8-2: Comparing the upload time in Amoeba with HDFS
look at the overhead of upfront partitioning, i.e., we compare the data upload time
with Amoeba against the standard data upload time in HDFS. For a fair compar-
ison, Amoeba preserves the same format (row, uncompressed) and layout (text) as
in standard HDFS, i.e., it only differs in how the data is partitioned.
Figure 8-2 shows the data upload costs of Amoeba and standard HDFS for TPC-
H lineitem table (scale factor 1000). From the figure, we can see that the upload time
of Amoeba is 2.6 times that of HDFS. It goes further up to 3.5 times with per-replica
robust partitioning. This is reasonable given that the entire dataset needs to clustered
along all 16 dimensions of the lineitem table. Furthermore, the overhead is similar to
other workload-specific data preparation, such as indexing and co-partitioning [41, 39].
Comparison with k-d trees.
Exhaustive Adaptive Workload-based Adaptive
1 9.0694579E+08 2997871289 3.14811328E+08 3002124738 6.3219558E+08 3002124738
2 9.1045811E+08 2997871289 3.12734144E+08 2997871289 3.16637952E+08 3002124738
3 9.0861056E+08 3002124738 3.1046384E+08 2359184298 1.5864256E+08 3002124738
4 9.1795782E+08 2513928775 3.16079552E+08 1265335315 8.1171424E+07 3002124738
5 9.1176288E+08 1595976302 3.79694176E+08 3002124738 4.2983852E+07 3002124738
6 6.8369171E+08 684385763 5.4918707E+08 1518682469 2.4999108E+07 2007317820
7 9.0694579E+08 2997871289 3.79694176E+08 3002126228 1.7504594E+07 2007317820
8 9.1045811E+08 1330870883 5.22116032E+08 2984199016 1.5004684E+07 2007317820
9 9.0861056E+08 908688569 4.7684624E+08 506427423 1.5004684E+07 2007317820
10 9.1795782E+08 917947303 5.4010598E+08 2844579882 1.5004684E+07 355650048
11 9.1176288E+08 911595775 3.13798816E+08 815518542 1.5004684E+07 23944808
12 6.8369171E+08 684385763 3.14920896E+08 612415327 1.5004684E+07 23944808
13 9.0694579E+08 1667001794 3.13349984E+08 1384334576 1.5004684E+07 23944808
14 9.1045811E+08 910379735 3.1682064E+08 3166572369 1.5004684E+07 23944808
15 9.0861056E+08 908688569 3.13323904E+08 1967528802 1.5004684E+07 23944808
16 9.1795782E+08 917947303 3.16836288E+08 2461879761 1.5004684E+07 23944808
17 9.1176288E+08 911595775 3.1925792E+08 1466482105 1.5004684E+07 23944808
18 6.8369171E+08 684385763 3.16945888E+08 1693384125 1.5004684E+07 23944808
19 9.0694579E+08 906950739 3.13777952E+08 1143868180 1.5004684E+07 23944808
20 9.1045811E+08 910379735 3.1750432E+08 1601779877 1.5004684E+07 23944808
21 9.0861056E+08 908688569 3.17076352E+08 901236093 1.5004684E+07 23944808
22 9.1795782E+08 917947303 3.15510656E+08 1317807849 1.5004684E+07 23944808
23 9.1176288E+08 911595775 5.4918707E+08 683999521 1.5004684E+07 23944808
24 6.8369171E+08 684385763 3.15740288E+08 992840181 1.5004684E+07 23944808
25 9.0694579E+08 906950739 3.1530192E+08 808139186 1.5004684E+07 23944808
26 9.1045811E+08 3.1772352E+08 1310146264 1.5004684E+07 23944808
27 9.0861056E+08 3.14936576E+08 1402616769 1.5004684E+07 23944808
28 9.1795782E+08 3.13344768E+08 857485053 1.5004684E+07 23944808
29 9.1176288E+08 3.12066112E+08 711028895 1.5004684E+07 23944808
30 6.8369171E+08 3.10677824E+08 1223661908 1.5004684E+07 23944808
31 9.0694579E+08
32 9.1045811E+08
33 9.0861056E+08
34 9.1795782E+08
35 9.1176288E+08
36 6.8369171E+08 6.7
37 9.0694579E+08 5.9
38 9.1045811E+08 6.1
39 9.0861056E+08 6.1
40 9.1795782E+08 6.0
41 9.1176288E+08 6.4
42 6.8369171E+08 6.2
43 9.0694579E+08
44 9.1045811E+08
45 9.0861056E+08
46 9.1795782E+08
47 9.1176288E+08
48 6.8369171E+08
49 9.0694579E+08
50 9.1045811E+08
51 9.0861056E+08
52 9.1795782E+08
53 9.1176288E+08
54 6.8369171E+08
55 9.0694579E+08
56 9.1045811E+08
57 9.0861056E+08
58 9.1795782E+08
59 9.1176288E+08
60 6.8369171E+08
61 9.0694579E+08
62 9.1045811E+08
63 9.0861056E+08
64 9.1795782E+08
65 9.1176288E+08
66 6.8369171E+08
67 9.0694579E+08
68 9.1045811E+08
69 9.0861056E+08
70 9.1795782E+08
71 9.1176288E+08
72 6.8369171E+08
73 9.0694579E+08
74 9.1045811E+08
75 9.0861056E+08
76 9.1795782E+08
77 9.1176288E+08
78 6.8369171E+08
79 9.0694579E+08
80 9.1045811E+08
81 9.0861056E+08
82 9.1795782E+08
83 9.1176288E+08
84 6.8369171E+08
85 9.0694579E+08
86 9.1045811E+08
87 9.0861056E+08
88 9.1795782E+08
89 9.1176288E+08
90 6.8369171E+08
91 9.0694579E+08
92 9.1045811E+08
93 9.0861056E+08
94 9.1795782E+08
95 9.1176288E+08
96 6.8369171E+08
97 9.0694579E+08
98 9.1045811E+08
99 9.0861056E+08
100 9.1795782E+08
Optimal Adaptive
1 3 5 7 9 11 13 15 17 19 21 23 25
Coef
ficie
nt o
f Va
riatio
n
00.5
11.5
22.5
1 10 20 50 100 1000
k-d TreeRobust Tree
Allo
catio
n
00.40.81.21.6
2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
k-d TreeRobust Tree
Random
Cyclic
Drill-down
TPC-H Lineitem Attributes
TPC-H Scale Factor
Que
ry C
osts
Query Sequence
Tim
e (s
econ
ds)
0
2800
5600
8400
11200
14000
(a) Data Load TimeLineitem
Standard HDFSRobust PartitioningRobust Partitioning (per-replica)
#tup
les
0E+005E+089E+081E+092E+09
(c) Repartitioning costs over queries1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Tim
e (s
)
0125250375500
(b) Repartitioning Optimizer RuntimeRepartitioned Not Repartitioned
(b) Comparison with k-d Tree
(b) Comparison with Optimal
Figure 8-3: Comparing performance of upfront partition tree vs kd-tree
We analyze our upfront partitioning tree in comparison to k-d trees. Specifically,
49
we implemented a k-d tree which partitions on attributes in a round robin fashion, one
attribute at a time, until the partition size falls below the minimum size. This emu-
lates the standard way of performing data placement in a conventional k-d tree [37].
In contrast, recall that our robust partitioning algorithm places the attributes such
that all attributes have similar partitioning benefits. To measure the robustness of
the tree, we look at the variation in the values of the allocation metric for different
attributes. For our purposes, a tree having less variation in allocation is more robust.
The top of Figure 8-3 shows the allocation for each attribute of TPC-H lineitem. We
can see that the k-d tree has higher allocation for the first seven attributes, however
the remaining nine attributes are not partitioned at all. Robust partitioning, on the
other hand, distributes the allocation more evenly across all attributes. The bottom
of Figure 8-3 shows the coefficient of variation in allocation over different TPC-H scale
factors. The gap between k-d tree and our approach deteriorates with increasing scale
factor. As expected, we did observe that k-d tree does perform slightly better on the
attributes it is partitioned on and do full scans for the ones it is not partitioned on.
8.2 Micro-benchmarks
We now study the benefits of Amoeba over different workload patterns on the
lineitem dataset (same as in the previous subsection). We run range queries of the fol-
lowing form on all the attributes: SELECT * FROM lineitem WHERE start < A ≤ end;.
We choose start and end such that the selectivity of each query is 5% and select
A uniformly and at random from all attributes. We compare against standard (full
scan) and the ideal (exactly 5% of the data is accessed) runtimes. Notice that this
ideal runtime could only be achieved for a single attribute if we had partitioned the
dataset perfectly on that attribute.
To evaluate how well Amoeba adapts the data to changes in the workload, let us
consider two types of workload changes: a (i) shifting workload, which gradually tran-
sitions from one attribute to another, and a (ii) switching workload, which switches
(immediately) from one workload to another.
Figure 8-4 reports the results for two sets of 20 queries each. The first query set
starts with predicates on discount and shifts towards predicates on shipdate, i.e.,
50
Q.No. Adaptive (Cyclic Workload)
Adaptive (Random Workload)
Adaptive (Drill-down Workload)
Full Scan Ideal Ideal (drill down)
1 727.109 783.144 749.428 1500 75 150
2 700.24 705.974 2473.48 1500 75 75
3 3252.76 2693.4 358.483 1500 75 37.5
4 2011.565 1250.26 338.855 1500 75 18.75
5 1363.129 2707.26 350.724 1500 75 9.375
6 218.428 2131.14 331.671 1500 75 4.6875
7 2713.108 1267.12 340.017 1500 75 2.34375
8 1142.267 1839.42 340.609 1500 75 1.171875
9 252.009 1520.82 1295.85 1500 75 0.5859375
10 258.187 889.269 153.56 1500 75 0.29296875
11 241.23 652.543 147.64 1500 75 0.146484375
12 220.52 515.45 151.021 1500 75 0.0732421875
13 1331.734 989.724 147.755 1500 75 0.03662109375
14 257.011 673.078 152.415 1500 75 0.018310546875
15 242.49 778.024 150.198 1500 75 0.0091552734375
16 241.437 518.883 149.021 1500 75 0.00457763671875
17 238.362 553.043 149.929 1500 75 0.002288818359375
18 213.794 457.776 148.086 1500 75 0.0011444091796875
19 248.237 439.926 146.166 1500 75 0.00057220458984375
20 232.09 524.97 150.433 1500 75 0.000286102294921875
21 244.572 608.221 149.637 1500 75 0.000143051147460938
22 256.374 529.803 150.934 1500 75 0.0000715255737304688
23 252.614 253.293 145.403 1500 75 0.0000357627868652344
24 200.982 413.022 147.832 1500 75 0.0000178813934326172
25 246.366 272.664 149.198 1500 75 0.00000894069671630859
26 236.729 521.372 143.879 1500 75 0.0000044703483581543
27 256.608 284.209 146.021 1500 75 0.00000223517417907715
28 254.297 411.126 143.438 1500 75 0.00000111758708953857
29 223.374 323.209 154.56 1500 75 0.000000558793544769287
30 195.956 494.278 143.665 1500 75 0.000000279396772384644
31 249.693
Cyclic Access Pattern
Tim
e (s
econ
ds)
0
850
1700
2550
3400
Query Sequence1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Adaptive (Cyclic Workload)Adaptive (Random Workload)Adaptive (Drill-down Workload)Full ScanIdeal
Tim
e (s
econ
ds)
0
850
1700
2550
3400
(a) Random1 5 9 13 17 21 25 29
AdaptiveFull ScanIdeal
(b) Cyclic1 5 9 13 17 21 25 29
AdaptiveFull ScanIdeal
(c) Drill-down1 4 7 10 13 16 19 22 25 28
AdaptiveFull ScanIdeal
(b) Switching Workload1 6 11 16 21 26 31 36 41 46 51 56
Adaptive Full Scan Idealshipdate, receiptdate, shipped
quantity, extendedprice, discount
quantity, extendedprice, discount
Tim
e (s
econ
ds)
0
1000
2000
3000
4000
(a) Shifting Workload1 4 7 10 13 16 19 22 25 28 31 34 37 40
Adaptive Full Scan Idealdiscount ➔ shipdate shipdate ➔ receiptdate
discount
shipdate shipdate
receiptdate
Figure 8-4: Query runtimes for changing query attributes on TPC-H lineitem.
the probability of predicating on the first (second) attributes decreases (increases) by
1/20th after each query. We can see that the first query set involves a repartitioning
first at query 2 (on discount) and again at query 10 (on shipdate when the workload
has shifted sufficiently). The second query set starts with 100% of queries predicating
on shipdate and shifts towards 100% of queries predicating on receiptdate. At-
tribute shipdate is now further refined at query 21 and finally the system repartitions
on receiptdate at query 33 when workload again changes sufficiently. The spikes
annotated in the graph involve major reorganization, each of them represents almost
50% of the data being re-partitioned. However, there is re-partitioning happening on
small fractions of the data as well in some queries and they show up as small spikes.
The query runtime approaches the ideal runtime. This shows the ability of our system
to adapt to changes in the workload.
Figure 8-4(b) shows a switching workload. Here, we start with an attribute
set (quantity, extendedprice, discount) then switch to a second attribute set
(shipdate, receiptdate, shipped) after 20 queries. We again switch back to the
first attribute set after 40 queries. The two interesting things in this experiment is
that: (i) the system adapts to the workload by quickly repartitioning in each query
set, (ii) the repartitioning efforts in subsequent query sets are lower because of larger
query history. Also, note that first the attribute set appears twice in the sequence
and the repartitioning effort is significantly less the second time. Thus, Amoeba
storage system can efficiently adapt to changing workloads across sets of attributes.
We now study how Amoeba adapts and eventually converges when a particular
workload is seen more often. In this experiment, we start from an initial robust
partitioning and run queries on a given attribute over and over again. We consider
three workload patterns: (i) random, ad-hoc query predicates with fixed selectivity,
51
Q.No. Adaptive (Cyclic Workload)
Adaptive (Random Workload)
Adaptive (Drill-down Workload)
Full Scan Ideal Ideal (drill down)
1 727.109 783.144 749.428 1500 75 150
2 700.24 705.974 2473.48 1500 75 75
3 3252.76 2693.4 358.483 1500 75 37.5
4 2011.565 1250.26 338.855 1500 75 18.75
5 1363.129 2707.26 350.724 1500 75 9.375
6 218.428 2131.14 331.671 1500 75 4.6875
7 2713.108 1267.12 340.017 1500 75 2.34375
8 1142.267 1839.42 340.609 1500 75 1.171875
9 252.009 1520.82 1295.85 1500 75 0.5859375
10 258.187 889.269 153.56 1500 75 0.29296875
11 241.23 652.543 147.64 1500 75 0.146484375
12 220.52 515.45 151.021 1500 75 0.0732421875
13 1331.734 989.724 147.755 1500 75 0.03662109375
14 257.011 673.078 152.415 1500 75 0.018310546875
15 242.49 778.024 150.198 1500 75 0.0091552734375
16 241.437 518.883 149.021 1500 75 0.00457763671875
17 238.362 553.043 149.929 1500 75 0.002288818359375
18 213.794 457.776 148.086 1500 75 0.0011444091796875
19 248.237 439.926 146.166 1500 75 0.00057220458984375
20 232.09 524.97 150.433 1500 75 0.000286102294921875
21 244.572 608.221 149.637 1500 75 0.000143051147460938
22 256.374 529.803 150.934 1500 75 0.0000715255737304688
23 252.614 253.293 145.403 1500 75 0.0000357627868652344
24 200.982 413.022 147.832 1500 75 0.0000178813934326172
25 246.366 272.664 149.198 1500 75 0.00000894069671630859
26 236.729 521.372 143.879 1500 75 0.0000044703483581543
27 256.608 284.209 146.021 1500 75 0.00000223517417907715
28 254.297 411.126 143.438 1500 75 0.00000111758708953857
29 223.374 323.209 154.56 1500 75 0.000000558793544769287
30 195.956 494.278 143.665 1500 75 0.000000279396772384644
31 249.693
Cyclic Access Pattern
Tim
e (s
econ
ds)
0
850
1700
2550
3400
Query Sequence1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Adaptive (Cyclic Workload)Adaptive (Random Workload)Adaptive (Drill-down Workload)Full ScanIdeal
Tim
e (s
econ
ds)
0
850
1700
2550
3400
(a) Random1 5 9 13 17 21 25 29
AdaptiveFull ScanIdeal
(b) Cyclic1 5 9 13 17 21 25 29
AdaptiveFull ScanIdeal
(c) Drill-down1 4 7 10 13 16 19 22 25 28
AdaptiveFull ScanIdeal
(b) Switching Workload1 6 11 16 21 26 31 36 41 46 51 56
Adaptive Full Scan Idealshipdate, receiptdate, shipped
quantity, extendedprice, discount
quantity, extendedprice, discount
Tim
e (s
econ
ds)
0
1000
2000
3000
4000
(a) Shifting Workload1 4 7 10 13 16 19 22 25 28 31 34 37 40
Adaptive Full Scan Idealdiscount ➔ shipdate shipdate ➔ receiptdate
discount
shipdate shipdate
receiptdate
Figure 8-5: Query runtimes for changing predicates on the same attribute of TPC-Hlineitem.
(ii) cyclic, with a set of query predicates repeating over and over again, and (iii) drill
down, where successive predicate narrow down the same data.
Figure 8-5 shows the results. We see that while the random query sequence is
slower to converge (will need many more queries), the cyclic and drill down query
sequence converge quite fast. In fact, the query times are almost 4 times faster than
full scan after 13 queries in the cyclic workload and 10 times faster than full scan
after 9 queries in the drill-down workload. Furthermore, we see that the systems stop
repartitioning the data in these workloads (except in the random workload case) and
reaches a stable state, thereby confirming that Amoeba converges in case a fixed
workload is observed over and over again.
Exhaustive Adaptive Workload-based Adaptive
1 9.0694579E+08 2997871289 3.14811328E+08 3002124738 6.3219558E+08 3002124738
2 9.1045811E+08 2997871289 3.12734144E+08 2997871289 3.16637952E+08 3002124738
3 9.0861056E+08 3002124738 3.1046384E+08 2359184298 1.5864256E+08 3002124738
4 9.1795782E+08 2513928775 3.16079552E+08 1265335315 8.1171424E+07 3002124738
5 9.1176288E+08 1595976302 3.79694176E+08 3002124738 4.2983852E+07 3002124738
6 6.8369171E+08 684385763 5.4918707E+08 1518682469 2.4999108E+07 2007317820
7 9.0694579E+08 2997871289 3.79694176E+08 3002126228 1.7504594E+07 2007317820
8 9.1045811E+08 1330870883 5.22116032E+08 2984199016 1.5004684E+07 2007317820
9 9.0861056E+08 908688569 4.7684624E+08 506427423 1.5004684E+07 2007317820
10 9.1795782E+08 917947303 5.4010598E+08 2844579882 1.5004684E+07 355650048
11 9.1176288E+08 911595775 3.13798816E+08 815518542 1.5004684E+07 23944808
12 6.8369171E+08 684385763 3.14920896E+08 612415327 1.5004684E+07 23944808
13 9.0694579E+08 1667001794 3.13349984E+08 1384334576 1.5004684E+07 23944808
14 9.1045811E+08 910379735 3.1682064E+08 3166572369 1.5004684E+07 23944808
15 9.0861056E+08 908688569 3.13323904E+08 1967528802 1.5004684E+07 23944808
16 9.1795782E+08 917947303 3.16836288E+08 2461879761 1.5004684E+07 23944808
17 9.1176288E+08 911595775 3.1925792E+08 1466482105 1.5004684E+07 23944808
18 6.8369171E+08 684385763 3.16945888E+08 1693384125 1.5004684E+07 23944808
19 9.0694579E+08 906950739 3.13777952E+08 1143868180 1.5004684E+07 23944808
20 9.1045811E+08 910379735 3.1750432E+08 1601779877 1.5004684E+07 23944808
21 9.0861056E+08 908688569 3.17076352E+08 901236093 1.5004684E+07 23944808
22 9.1795782E+08 917947303 3.15510656E+08 1317807849 1.5004684E+07 23944808
23 9.1176288E+08 911595775 5.4918707E+08 683999521 1.5004684E+07 23944808
24 6.8369171E+08 684385763 3.15740288E+08 992840181 1.5004684E+07 23944808
25 9.0694579E+08 906950739 3.1530192E+08 808139186 1.5004684E+07 23944808
26 9.1045811E+08 3.1772352E+08 1310146264 1.5004684E+07 23944808
27 9.0861056E+08 3.14936576E+08 1402616769 1.5004684E+07 23944808
28 9.1795782E+08 3.13344768E+08 857485053 1.5004684E+07 23944808
29 9.1176288E+08 3.12066112E+08 711028895 1.5004684E+07 23944808
30 6.8369171E+08 3.10677824E+08 1223661908 1.5004684E+07 23944808
31 9.0694579E+08
32 9.1045811E+08
33 9.0861056E+08
34 9.1795782E+08
35 9.1176288E+08
36 6.8369171E+08 6.7
37 9.0694579E+08 5.9
38 9.1045811E+08 6.1
39 9.0861056E+08 6.1
40 9.1795782E+08 6.0
41 9.1176288E+08 6.4
42 6.8369171E+08 6.2
43 9.0694579E+08
44 9.1045811E+08
45 9.0861056E+08
46 9.1795782E+08
47 9.1176288E+08
48 6.8369171E+08
49 9.0694579E+08
50 9.1045811E+08
51 9.0861056E+08
52 9.1795782E+08
53 9.1176288E+08
54 6.8369171E+08
55 9.0694579E+08
56 9.1045811E+08
57 9.0861056E+08
58 9.1795782E+08
59 9.1176288E+08
60 6.8369171E+08
61 9.0694579E+08
62 9.1045811E+08
63 9.0861056E+08
64 9.1795782E+08
65 9.1176288E+08
66 6.8369171E+08
67 9.0694579E+08
68 9.1045811E+08
69 9.0861056E+08
70 9.1795782E+08
71 9.1176288E+08
72 6.8369171E+08
73 9.0694579E+08
74 9.1045811E+08
75 9.0861056E+08
76 9.1795782E+08
77 9.1176288E+08
78 6.8369171E+08
79 9.0694579E+08
80 9.1045811E+08
81 9.0861056E+08
82 9.1795782E+08
83 9.1176288E+08
84 6.8369171E+08
85 9.0694579E+08
86 9.1045811E+08
87 9.0861056E+08
88 9.1795782E+08
89 9.1176288E+08
90 6.8369171E+08
91 9.0694579E+08
92 9.1045811E+08
93 9.0861056E+08
94 9.1795782E+08
95 9.1176288E+08
96 6.8369171E+08
97 9.0694579E+08
98 9.1045811E+08
99 9.0861056E+08
100 9.1795782E+08
Optimal Adaptive
1 3 5 7 9 11 13 15 17 19 21 23 25
Coef
ficie
nt o
f Va
riatio
n
00.5
11.5
22.5
1 10 20 50 100 1000
k-d TreeRobust Tree
Allo
catio
n
00.40.81.21.6
2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
k-d TreeRobust Tree Random
Cyclic
Drill-down
TPC-H Lineitem Attributes
TPC-H Scale Factor
Que
ry C
osts
Query Sequence
Tim
e (s
econ
ds)
0
2800
5600
8400
11200
14000
(a) Data Load TimeLineitem
Standard HDFSRobust PartitioningRobust Partitioning (per-replica)
#tup
les
0E+005E+089E+081E+092E+09
(c) Repartitioning costs over queries1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Tim
e (s
)
0125250375500
(b) Repartitioning Optimizer RuntimeRepartitioned Not Repartitioned
(a) Comparison with k-d Tree (b) Comparison with Optimal
Figure 8-6: Cumulative Optimizer Runtime Across 100 queries
Exhaustive Adaptive Workload-based Adaptive
1 9.0694579E+08 2997871289 3.14811328E+08 3002124738 6.3219558E+08 3002124738
2 9.1045811E+08 2997871289 3.12734144E+08 2997871289 3.16637952E+08 3002124738
3 9.0861056E+08 3002124738 3.1046384E+08 2359184298 1.5864256E+08 3002124738
4 9.1795782E+08 2513928775 3.16079552E+08 1265335315 8.1171424E+07 3002124738
5 9.1176288E+08 1595976302 3.79694176E+08 3002124738 4.2983852E+07 3002124738
6 6.8369171E+08 684385763 5.4918707E+08 1518682469 2.4999108E+07 2007317820
7 9.0694579E+08 2997871289 3.79694176E+08 3002126228 1.7504594E+07 2007317820
8 9.1045811E+08 1330870883 5.22116032E+08 2984199016 1.5004684E+07 2007317820
9 9.0861056E+08 908688569 4.7684624E+08 506427423 1.5004684E+07 2007317820
10 9.1795782E+08 917947303 5.4010598E+08 2844579882 1.5004684E+07 355650048
11 9.1176288E+08 911595775 3.13798816E+08 815518542 1.5004684E+07 23944808
12 6.8369171E+08 684385763 3.14920896E+08 612415327 1.5004684E+07 23944808
13 9.0694579E+08 1667001794 3.13349984E+08 1384334576 1.5004684E+07 23944808
14 9.1045811E+08 910379735 3.1682064E+08 3166572369 1.5004684E+07 23944808
15 9.0861056E+08 908688569 3.13323904E+08 1967528802 1.5004684E+07 23944808
16 9.1795782E+08 917947303 3.16836288E+08 2461879761 1.5004684E+07 23944808
17 9.1176288E+08 911595775 3.1925792E+08 1466482105 1.5004684E+07 23944808
18 6.8369171E+08 684385763 3.16945888E+08 1693384125 1.5004684E+07 23944808
19 9.0694579E+08 906950739 3.13777952E+08 1143868180 1.5004684E+07 23944808
20 9.1045811E+08 910379735 3.1750432E+08 1601779877 1.5004684E+07 23944808
21 9.0861056E+08 908688569 3.17076352E+08 901236093 1.5004684E+07 23944808
22 9.1795782E+08 917947303 3.15510656E+08 1317807849 1.5004684E+07 23944808
23 9.1176288E+08 911595775 5.4918707E+08 683999521 1.5004684E+07 23944808
24 6.8369171E+08 684385763 3.15740288E+08 992840181 1.5004684E+07 23944808
25 9.0694579E+08 906950739 3.1530192E+08 808139186 1.5004684E+07 23944808
26 9.1045811E+08 3.1772352E+08 1310146264 1.5004684E+07 23944808
27 9.0861056E+08 3.14936576E+08 1402616769 1.5004684E+07 23944808
28 9.1795782E+08 3.13344768E+08 857485053 1.5004684E+07 23944808
29 9.1176288E+08 3.12066112E+08 711028895 1.5004684E+07 23944808
30 6.8369171E+08 3.10677824E+08 1223661908 1.5004684E+07 23944808
31 9.0694579E+08
32 9.1045811E+08
33 9.0861056E+08
34 9.1795782E+08
35 9.1176288E+08
36 6.8369171E+08 6.7
37 9.0694579E+08 5.9
38 9.1045811E+08 6.1
39 9.0861056E+08 6.1
40 9.1795782E+08 6.0
41 9.1176288E+08 6.4
42 6.8369171E+08 6.2
43 9.0694579E+08
44 9.1045811E+08
45 9.0861056E+08
46 9.1795782E+08
47 9.1176288E+08
48 6.8369171E+08
49 9.0694579E+08
50 9.1045811E+08
51 9.0861056E+08
52 9.1795782E+08
53 9.1176288E+08
54 6.8369171E+08
55 9.0694579E+08
56 9.1045811E+08
57 9.0861056E+08
58 9.1795782E+08
59 9.1176288E+08
60 6.8369171E+08
61 9.0694579E+08
62 9.1045811E+08
63 9.0861056E+08
64 9.1795782E+08
65 9.1176288E+08
66 6.8369171E+08
67 9.0694579E+08
68 9.1045811E+08
69 9.0861056E+08
70 9.1795782E+08
71 9.1176288E+08
72 6.8369171E+08
73 9.0694579E+08
74 9.1045811E+08
75 9.0861056E+08
76 9.1795782E+08
77 9.1176288E+08
78 6.8369171E+08
79 9.0694579E+08
80 9.1045811E+08
81 9.0861056E+08
82 9.1795782E+08
83 9.1176288E+08
84 6.8369171E+08
85 9.0694579E+08
86 9.1045811E+08
87 9.0861056E+08
88 9.1795782E+08
89 9.1176288E+08
90 6.8369171E+08
91 9.0694579E+08
92 9.1045811E+08
93 9.0861056E+08
94 9.1795782E+08
95 9.1176288E+08
96 6.8369171E+08
97 9.0694579E+08
98 9.1045811E+08
99 9.0861056E+08
100 9.1795782E+08
Optimal Adaptive
1 3 5 7 9 11 13 15 17 19 21 23 25
Coef
ficie
nt o
f Va
riatio
n
00.5
11.5
22.5
1 10 20 50 100 1000
k-d TreeRobust Tree
Allo
catio
n
00.40.81.21.6
2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
k-d TreeRobust Tree Random
Cyclic
Drill-down
TPC-H Lineitem Attributes
TPC-H Scale Factor
Que
ry C
osts
Query Sequence
Tim
e (s
econ
ds)
0
2800
5600
8400
11200
14000
(a) Data Load TimeLineitem
Standard HDFSRobust PartitioningRobust Partitioning (per-replica)
#tup
les
0E+005E+089E+081E+092E+09
(c) Repartitioning costs over queries1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Tim
e (s
)
0125250375500
(b) Repartitioning Optimizer RuntimeRepartitioned Not Repartitioned
(a) Comparison with k-d Tree (b) Comparison with OptimalFigure 8-7: Cumulative Repartitioning Cost
Apart from the upload overhead, Amoeba also incurs optimization and repar-
titioning overhead. Figure 8-6 shows the optimizer runtime over a sequence of 100
52
queries. We show two bars, one when the repartitioning decisions are actually made
and one when the data is never actually repartitioned. The optimization time per
query is very small (about 5s) compared to the actual runtime of the queries. Finally,
Figure 8-7 shows the accumulated repartitioning costs in terms of the number of tu-
ples that are repartitioned. We observe that the re-partitioning is largely incremental
after the first few queries.
8.3 Amoeba on Real Workload
We obtained data from a Boston-based company that captures analytics regarding a
user’s driving trips. Each tuple in the dataset represents a trip taken by the user,
with the start time, end time, and a number of statistics associated with the journey
regarding various attributes of the driver’s speed and driving style. The data con-
sists of a single large fact table with 148 columns. To protect user’s privacy, we used
statistics provided by the company regarding data distributions to generate a syn-
thetic version of the data according to the actual schema. The total size of the data is
705GB. We also obtained a trace of ad-hoc analytics queries from the company (these
queries were generated by data analysts performing one-off exploratory queries on the
data). This trace consists of 105 queries, run between 04/19/2015 and 04/21/2015,
on the trip data. The queries sub-select different portions of the data, e.g., based on
trip time range, distances, and customer ids, before producing aggregates and other
statistics.
0 5 10
Amoeba (c = 2)
Amoeba (c = 4)
Spark (with range partition)
Spark
3.1
2.79
4.94
9.4
Total time taken (in hrs)
Figure 8-8: Total runtimes of the different approaches
We compared the performance of Amoeba versus two baselines: 1) using Spark
without any modifications, 2) using Spark with the data already partitioned on
53
upload time, which is most frequently accessed attribute (accessed in 78% of the
queries). The data was split equally across the machines and loaded in parallel using
the upfront data partitioner. The queries collected were then run in order.
Figure 8-8 shows the total query runtime for running the 105 queries using the dif-
ferent approaches. To remind the reader, c is the write multiplier (described in Section
5.1). c can be calibrated by measuring the runtime increase due to re-partitioning.
For our setup, c is 4, i.e.: writing out data (while re-partitioning) is four times more
expensive than just scanning the data. We observe that by range partitioning on
upload time, the total query runtime drops by 1.9x. Amoeba initially does worse off
as it does not have knowledge of the workload, however in the end it does 1.8x better
than Spark with data partitioned and 3.4x better than unmodified Spark. The c = 2
setting is more reactive in the sense it adapts to the query workload faster. It does
slightly worse than c = 4 setting as it ends up doing introducing changes to the tree
very soon, as the workload is ad-hoc some patterns are one-off and re-partitioning
done to improve them ends up being wasted effort. Finally, we observed that the
total runtime of the last 60 of the 105 queries on Amoeba is 19x lower than full scan
and 11x lower than using the range-partitioned data.
Thus, we see that Amoeba is useful for real-world ad-hoc querying workloads,
where we need to quickly access different subsets of data for further analysis.
54
Chapter 9
Conclusion
In this thesis, we presented Amoeba, an adaptive distributed storage manager tar-
geted at ad-hoc query workloads. Amoeba does not require developers to provide
upfront query workloads, which is not known apriori in several modern exploratory
data analyses. Rather, Amoeba distributes the partitioning effort over all attributes
in the dataset and later refines the partitioning based on how the data is actually
being used. We presented techniques to partition a dataset over several attributes, de-
scribed the robust partitioning algorithm to create the upfront partitioning, showed
transformations to adaptively change the upfront partitioning tree, and detailed a
divide-and-conquer algorithm to pick the best repartitioned tree. We also described
the Amoeba storage system built on top of HDFS and showed experimental results
with Spark. Our results on both real and synthetic workloads show that Amoeba
provides improved up-front query performance, improves the query performance even
further as the queries arrive, and eventually converges to a steady state when a par-
ticular workload repeats over and over again.
55
56
Bibliography
[1] Sanjay Agrawal, Vivek Narasayya, and Beverly Yang. Integrating Vertical andHorizontal Partitioning into Automated Physical Database Design. SIGMOD,2004.
[2] Daniel C. Zilio, Jun Rao, Sam Lightstone, Guy Lohman, Adam Storm, ChristianGarcia-Arellano, and Scott Fadden. DB2 Design Advisor: Integrated AutomaticPhysical Database Design. VLDB, 2004.
[3] Rimma Nehme and Nicolas Bruno. Automated Partitioning Design in ParallelDatabase Systems. SIGMOD, 2011.
[4] Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. Schism: a Workload-Driven Approach to Database Replication and Partitioning. VLDB, 2010.
[5] Abdul Quamar, K. Ashwin Kumar, and Amol Deshpande. SWORD: ScalableWorkload-Aware Data Placement for Transactional Workloads. EDBT, 2013.
[6] Andrew Pavlo, Carlo Curino, and Stan Zdonik. Skew-Aware Automatic DatabasePartitioning in Shared-Nothing, Parallel OLTP Systems. SIGMOD, 2012.
[7] Liwen Sun, Michael J. Franklin, Sanjay Krishnan, and Reynold S. Xin. Fine-grained Partitioning for Aggressive Data Skipping. SIGMOD, 2014.
[8] Hadoop Apache Project, http://hadoop.apache.org.
[9] Apache Spark, https://spark.apache.org.
[10] Stratos Idreos, Martin Kersten, and Stefan Manegold. Database Cracking. InCIDR, 2007.
[11] Stratos Idreos, Martin Kersten, and Stefan Manegold. Self-organizing TupleReconstruction In Column-stores. In SIGMOD, 2009.
[12] Martin Grund, Jens Kruger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, and Samuel Madden. HYRISE: A Main Memory Hybrid Storage En-gine. PVLDB, 4(2):105–116, 2010.
[13] Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, and Jens Dittrich. Trojan Data Lay-outs: Right Shoes for a Running Elephant. In ACM SOCC, pages 21:1–21:14,2011.
57
[14] Alekh Jindal and Jens Dittrich. Relax and let the database do the partitioningonline. In BIRTE, pages 65–80, 2011.
[15] S. Ghandeharizadeh and D. J. DeWitt. MAGIC: A Multiattribute DeclusteringMechanism for Multiprocessor Database Machines. IEEE Trans. Parallel Distrib.Syst., 1994.
[16] Oracle Subpartitioning, http://docs.oracle.com/cd/E17952 01/refman-5.5-en/partitioning-subpartitions.html.
[17] MySQL Subpartitioning, http://dev.mysql.com/doc/refman/5.1/en/partitioning-subpartitions.html.
[18] Apache Hive, https://hive.apache.org.
[19] Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken,and Darren Shakib. SCOPE: parallel databases meet MapReduce. The VLDBJournal, 21(5):611–636, 2012.
[20] Shark, http://shark.cs.berkeley.edu.
[21] Impala, http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html.
[22] Goetz Graefe. Sorting And Indexing With Partitioned B-Trees. CIDR, 2003.
[23] Young-Kyoon Suh, Ahmad Ghazal, Alain Crolotte, and Pekka Kostamaa. ANew Tool for Multi-level Partitioning in Teradata. CIKM, 2012.
[24] Daniel Dominic Sleator and Robert Endre Tarjan. Self-adjusting Binary SearchTrees. Journal of the ACM, 32(3):652–686, 1985.
[25] Ahmed Eldawy and Mohamed F. Mokbel. A Demonstration of SpatialHadoop:An Efficient MapReduce Framework for Spatial Data. In VLDB, 2013.
[26] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, and Amr El Abbadi. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location AwareServices. In MDM, 2011.
[27] Jinbao Wang, Sai Wu, Hong Gao, Jianzhong Li, and Beng Chin Ooi. IndexingMulti-dimensional Data in a Cloud System. In SIGMOD, 2010.
[28] Philippe Cudre-Mauroux, Eugene Wu, and Samuel Madden. TrajStore: AnAdaptive Storage System for Very Large Trajectory Data Sets. In ICDE, 2010.
[29] Martin Kersten and Stefan Manegold. Cracking the Database Store. In CIDR,2005.
[30] Stratos Idreos, Martin Kersten, and Stefan Manegold. Updating a CrackedDatabase. In SIGMOD, 2007.
58
[31] Stratos Idreos, Stefan Manegold, Harumi Kuno, and Goetz Graefe. MergingWhat’s Cracked, Cracking What’s Merged: Adaptive Indexing in Main-MemoryColumn-Stores. In PVLDB, 2011.
[32] Goetz Graefe, Felix Halim, Stratos Idreos, Harumi Kuno, and Stefan Manegold.Concurrency Control for Adaptive Indexing. In PVLDB, 2012.
[33] Felix Halim, Stratos Idreos, Panagiotis Karras, and Roland H. C. Yap. Stochas-tic Database Cracking: Towards Robust Adaptive Indexing in Main-MemoryColumn-Stores. In PVLDB, 2012.
[34] M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. Gray,P. P. Griffiths, W. F. King, R. A. Lorie, P. R. McJones, J. W. Mehl, G. R.Putzolu, I. L. Traiger, B. W. Wade, and V. Watson. System R: RelationalApproach to Database Management. ACM Trans. Database Syst., 1976.
[35] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system.SIGOPS Oper. Syst. Rev., 37(5):29–43, 2003.
[36] Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A. Acar, andRafael Pasquin. Incoop: MapReduce for Incremental Computations. In SoCC,2011.
[37] Jon Louis Bentley. Multidimensional binary search trees used for associativesearching. Commun. ACM, 1975.
[38] Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Stefan Richter, Stefan Schuh, AlekhJindal, and Jorg Schad. Only Aggressive Elephants are Fast Elephants. PVLDB,5(11):1591–1602, 2012.
[39] Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, VinaySetty, and Jorg Schad. Hadoop++: Making a Yellow Elephant Run Like aCheetah (Without It Even Noticing). PVLDB, 2010.
[40] Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Ozcan, Rainer Gemulla, AljoschaKrettek, and John McPherson. CoHadoop: Flexible Data Placement and ItsExploitation in Hadoop. VLDB, 2011.
[41] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. De-Witt, Samuel Madden, and Michael Stonebraker. A comparison of approachesto large-scale data analysis. In SIGMOD, 2009.
[42] Carsten Binnig, Ugur Cetintemel, Andrew Crotty, Alex Galakatos, Tim Kraska,Erfan Zamanian, and Stan Zdonik. The End of Slow Networks: It’s Time for aRedesign. arXiv:1504.01048 [cs.DB], April, 2015.
[43] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Disk-locality in Datacenter Computing Considered Irrelevant. In HotOS, 2013.
59
Appendices
60
Appendix A
Fast Remote Reads
One effect of Amoeba is to de-cluster the data, as values from a specific range of a
single attribute may not be clustered on a specific node. This means that answering
a query may involve reading many partitions from different nodes in the network. For
example, to find products with price = $125, we might need to read both partitions
1 and 2, which will be on different nodes. This is contrary to the traditional wisdom
in the database community, which argues for data locality by moving computation
to data, rather than moving data to answer queries. However, recent improvements
in datacenter network design have resulted in designs that provide full cross-section
bandwidth of 1 Gbit/sec or more between all pairs of nodes [42], such that network
throughput is no longer a bottleneck. Recent research has shown that accessing a
remote disk in a computing cluster is only about 8% lower throughput than reading
from a local disk [43].
Table 1
Data Locality Response Time (sec)
100 44271 50046 51227 524
Res
pons
e Ti
me
(sec
)
0
150
300
450
600
Data Locality (%)100 71 46 27
524512500442
Fig. 2. Varying Data Locality (%)100 71 46 27
524512500442
Response Time (sec)
Figure A-1: Response time with varying data locality (%)
To verify this in a real system, we ran a micro-benchmark on Hadoop MapReduce,
in which we measured the runtime of a map-only job (performing simple aggregation
in the combiner) while varying the locality of data blocks on HDFS. Figure 2 shows
61
the results from a 4-node cluster with full duplex 1 Gbit/sec network. Note that even
with locality as low as 27%, the job is just 18% slower than with 100% data locality.
Amoeba leverages this new hardware trend to aggressively de-cluster the data over
several (or all) attributes in the dataset.
62