Parallel SECONDO: Processing Moving Objects Data At Large ......Parallel SECONDO: Processing Moving...

Parallel SECONDO:Processing Moving Objects Data At Large Scale

Der Fakultat fur Mathematik und Informatikder FernUniversitat in Hagen

vorgelegte

Dissertation

zur Erlangung des akademischen Grades einesDoktors der Naturwissenschaften

(Dr. rer. nat.)

vonJiamin Lu

Geburtsort: NanTong, JiangSu, China

Hagen, 2014

ii

Eingereicht am: 27.03.2014

Tag der mundlichen Prufung: 09.07.2014

1. Berichterstatter/in: Prof. Dr. Ralf Hartmut Guting

2. Berichterstatter/in: Prof. Dr. Mohamed F. Mokbel

Contents

List of Tables v

List of Algorithms vii

List of Figures x

Abstract xi

1 Introduction 11.1 Background and Motivations . . . . . . . . . . . . . . . . . . . . 11.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 System Infrastructure . . . . . . . . . . . . . . . . . . . . 51.2.2 Declarative Query Language . . . . . . . . . . . . . . . . 61.2.3 Cloud Evaluation . . . . . . . . . . . . . . . . . . . . . . 71.2.4 Optimization Technologies . . . . . . . . . . . . . . . . . 8

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Related Work 112.1 Early Parallel Database Systems . . . . . . . . . . . . . . . . . . 11

2.1.1 Parallel Query Processing . . . . . . . . . . . . . . . . . 122.1.2 Parallelism Metrics . . . . . . . . . . . . . . . . . . . . . 13

2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 The MapReduce Paradigm . . . . . . . . . . . . . . . . . 152.2.2 Pros and Cons . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Hadoop Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 SECONDO Database System . . . . . . . . . . . . . . . . . . . . 192.5 Parallel Processing on Specialized Data . . . . . . . . . . . . . . 21

i

ii CONTENTS

3 System Infrastructure 233.1 System Components . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Parallel SECONDO File System . . . . . . . . . . . . . . . . . . 27

3.2.1 PSFS Operators . . . . . . . . . . . . . . . . . . . . . . . 303.2.2 PSFS vs. HDFS . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 System Management Components . . . . . . . . . . . . . . . . . 363.3.1 Parallel SECONDO Preferences . . . . . . . . . . . . . . . 383.3.2 Graphical Preference Editor . . . . . . . . . . . . . . . . 40

4 Declarative Language Support 454.1 Parallel Data Model . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1.1 SECONDO Executable Language . . . . . . . . . . . . . . 464.1.2 Representing Distributed Objects . . . . . . . . . . . . . 484.1.3 Flow Operators . . . . . . . . . . . . . . . . . . . . . . . 534.1.4 Hadoop Operators . . . . . . . . . . . . . . . . . . . . . 54

4.2 PBSM: Partition Based Spatial Merge . . . . . . . . . . . . . . . 564.2.1 PBSM in Parallel SECONDO . . . . . . . . . . . . . . . . 584.2.2 PBSM with In-memory Index . . . . . . . . . . . . . . . 634.2.3 PBSM for Moving Objects Data . . . . . . . . . . . . . . 644.2.4 Represent PBSM in Executable Language . . . . . . . . . 664.2.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3 Parallel BerlinMOD . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.1 Parallel Data Generation . . . . . . . . . . . . . . . . . . 754.3.2 Parallel Range Queries . . . . . . . . . . . . . . . . . . . 774.3.3 Evaluations On Benchmark Queries . . . . . . . . . . . . 80

5 Cloud Evaluation 835.1 Amazon EC2 Services . . . . . . . . . . . . . . . . . . . . . . . 84

5.1.1 Hardware Configuration . . . . . . . . . . . . . . . . . . 845.1.2 Software Configuration . . . . . . . . . . . . . . . . . . . 865.1.3 EC2 Instance Performance . . . . . . . . . . . . . . . . . 87

5.2 Set up Parallel SECONDO on EC2 Clusters . . . . . . . . . . . . . 885.3 Evaluations In EC2 Clusters . . . . . . . . . . . . . . . . . . . . 91

6 Optimization 956.1 Pipeline File Transfer . . . . . . . . . . . . . . . . . . . . . . . . 966.2 FLOB Accessing in PSFS . . . . . . . . . . . . . . . . . . . . . . 986.3 Distributed Filter and Refinement . . . . . . . . . . . . . . . . . 1016.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

CONTENTS iii

7 Conclusions 1077.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Appendices 108

A Install Parallel SECONDO 109A.1 Installation Instructions . . . . . . . . . . . . . . . . . . . . . . . 111A.2 Deploy Parallel SECONDO With Virtual Images . . . . . . . . . . 114

A.2.1 VMWare Image . . . . . . . . . . . . . . . . . . . . . . . 115A.2.2 Amazon Machine Image . . . . . . . . . . . . . . . . . . 116

B Evaluation Queries 121B.1 Join on Standard Data Types . . . . . . . . . . . . . . . . . . . . 121B.2 Join on Spatial Data Types . . . . . . . . . . . . . . . . . . . . . 123B.3 Join on Spatio-Temporal Data Types . . . . . . . . . . . . . . . . 126

C Parallel BerlinMOD Benchmark 129C.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 129C.2 Parallel Range Queries . . . . . . . . . . . . . . . . . . . . . . . 131

Bibliography 146

iv CONTENTS

List of Tables

3.1 Extended Operators for Basic PSFS Access . . . . . . . . . . . . 313.2 Auxiliary Tools for Parallel SECONDO . . . . . . . . . . . . . . . 37

4.1 The Classification of Parallel SECONDO Distributed Objects . . . 494.2 DELIVERABLE Data Types . . . . . . . . . . . . . . . . . . . 504.3 Parallel Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4 Extensive SECONDO Operators for PBSM . . . . . . . . . . . . . 594.5 Operators Extended for Hadoop Distributed Join . . . . . . . . . . 63

5.1 Auxiliary Tools for Parallel SECONDO on EC2 . . . . . . . . . . 90

6.1 Extended Operators for Optimized PSFS Access . . . . . . . . . . 96

A.1 Cluster Preparation Arguments . . . . . . . . . . . . . . . . . . . 120

v

vi LIST OF TABLES

List of Algorithms

1 Generic Hadoop-based Parallel Join . . . . . . . . . . . . . . . . . 28

2 SECONDO Distributed Join (SDJ) with PBSM . . . . . . . . . . . . 613 Hadoop Distributed Join (HDJ) with PBSM . . . . . . . . . . . . . 62

4 Pipeline File Feed (pffeed) . . . . . . . . . . . . . . . . . . . . . . 98

vii

viii LIST OF ALGORITHMS

List of Figures

1.1 Multi-dimensional Objects . . . . . . . . . . . . . . . . . . . . . 3

2.1 SECONDO Components (left), Architecture of Kernel (right) . . . 192.2 The Graphical User Interface for SECONDO and Parallel SECONDO 21

3.1 The Infrastructure of Parallel Secondo . . . . . . . . . . . . . . . 243.2 Two Region Objects . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Data Structure in PSFS files . . . . . . . . . . . . . . . . . . . . 323.4 The SQL Statement of 12th TPC-H Query . . . . . . . . . . . . . 353.5 Evaluations on 12th TPC-H Query . . . . . . . . . . . . . . . . . 363.6 The Main Frame with Error Information . . . . . . . . . . . . . . 403.7 The Frame for Single-Computer Installation . . . . . . . . . . . . 413.8 The Frame for Simple Cluster Installation . . . . . . . . . . . . . 423.9 Advanced Setting Frame . . . . . . . . . . . . . . . . . . . . . . 43

4.1 Query in SQL-like Language . . . . . . . . . . . . . . . . . . . . 474.2 Query in SECONDO Executable Language . . . . . . . . . . . . . 474.3 PS-Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4 The Map/Reduce Procedures Described by Hadoop Operators . . 544.5 The Boundary Crossing Objects in PBSM . . . . . . . . . . . . . 574.6 The Common Smallest Cell . . . . . . . . . . . . . . . . . . . . . 584.7 The Partition Method in PBSM . . . . . . . . . . . . . . . . . . . 584.8 The Sliced Representation of Moving Point Values . . . . . . . . 654.9 The SQL Query of the 6th BerlinMOD Query . . . . . . . . . . . 664.10 The Sequential Query for the 6th BerlinMOD Query . . . . . . . 674.11 The Parallel Query (SDJ) for the 6th BerlinMOD Query . . . . . . 684.12 The Parallel Query (HDJ) for the 6th BerlinMOD Query . . . . . 694.13 Query Converting Process for PBSM . . . . . . . . . . . . . . . . 714.14 HDJ and SDJ Performances On Spatial Data . . . . . . . . . . . . 734.15 SDJ Performance On Spatial Data With In-memory Index . . . . . 73

ix

x LIST OF FIGURES

4.16 HDJ and SDJ Performances On Spatio-Temporal Data . . . . . . 744.17 SDJ Performance On Spatio-Temporal Data With In-memory Index 744.18 The 1st BerlinMOD Example Query . . . . . . . . . . . . . . . . 784.19 The 10th BerlinMOD Example Query . . . . . . . . . . . . . . . 794.20 Data Generation in Parallel BerlinMOD . . . . . . . . . . . . . . 814.21 The 10th Example Query in Parallel BerlinMOD . . . . . . . . . 81

5.1 EC2 Storage Architecture . . . . . . . . . . . . . . . . . . . . . . 855.2 The Amazon EC2 Web-based Console . . . . . . . . . . . . . . . 875.3 Evaluation On Spatio-temporal Data in Cloud . . . . . . . . . . . 925.4 Average Step Overhead for the Cloud Speed-up Evaluation . . . . 93

6.1 Shuffling Overhead on the Cluster . . . . . . . . . . . . . . . . . 976.2 The Tuple Storage in SECONDO . . . . . . . . . . . . . . . . . . 1006.3 SDJ with the Distributed Filter Refinement Mechanism . . . . . . 1036.4 Parallel Spatial Join on Lands and Buildings in SDJ-Index’ . . . . 1056.5 The Performance Comparison on Different PSFS Modes . . . . . 106

Abstract

In the recent years, along with the popularization of portable positioning deviceslike smart phones and vehicle navigators, it is becoming simpler to generate andcollect end-users’ continuous position information (termed moving objects data),in order to assist various location based services. Under this background, ourgroup’s SECONDO system was developed. It is designed as an extensible databasesystem, providing a large number of data types and algorithms to represent andefficiently process moving objects based on the constant geographical information(termed spatial data).

However, like many other standalone databases, SECONDO is facing challengesfrom Big Data, since it was developed as a single-computer system and its capa-bility is restricted by the underlying computer resources. There are many paral-lel processing platforms like Hadoop developed for analyzing massive data uponcomputer clusters. However, they usually lay more weight on improving their ef-ficiency and scalability but less on processing specialized data types. In order toscale up SECONDO’s capability to a cluster of computers, this Ph.D project in-tends to propose a hybrid system combing the Hadoop platform and SECONDO

databases, taking the best technologies from both sides. This new system is namedParallel SECONDO.

In this dissertation, the following issues about this novel system are studied. (1)A hybrid structure is established to combine Hadoop and SECONDO for achievingthe most effective performance. Specifically, a native store mechanism is devel-oped to reduce the data migration overhead between them to the minimum. (2) Aparallel data model is proposed to help end-users to state their queries in SECONDO

executable language, getting rid of the low-level and rigid programming model inHadoop. Besides, it enables Parallel SECONDO to inherit most existing SECONDO

data types and operations, hence any heavy sequential query can be easily con-verted into the corresponding parallel statements. As an example, a join methodnamed PBSM is extensively used in this thesis. It can process the join operationon both spatial and moving objects data. Besides, its various approaches are alsoproposed, using different distributed file systems to shuffle the intermediate results,

xi

xii ABSTRACT

in order to achieve the best performance. All these approaches can be representedas SECONDO queries with slight adjustments, fully demonstrating the parallel datamodel’s flexibility. (3) Parallel SECONDO is evaluated not only on our small privatecluster, but also on large clusters consisting of hundreds virtual computers providedby AWS (Amazon Web Services). On these different scale environments, ParallelSECONDO keeps a stable speed-up and scale-up performance, showing remarkablescalability by being set up on the Hadoop platform. (4) Regarding the special stor-age for spatial and moving objects data, a set of optimization technologies are alsodeveloped to improve the data access in the cluster environment.

Furthermore, we intend to develop Parallel SECONDO as a user friendly sys-tem. A set of auxiliary tools are developed to easily deploy and manage the systemon large-scale clusters. Two virtual machine images are also provided, hence end-users can get familiar with the system immediately and use it to address their ownproblems. The graphical user interface in SECONDO is also inherited, hence thequery results can be displayed with vivid animations.

Chapter 1

Introduction

1.1 Background and Motivations

In the recent years, Big Data and how to process them efficiently has become anincreasingly hot issue in both commercial and academic communities. Along withthe popularity of various Internet services, massive amounts of customer informa-tion are continuously collected and stored, in order to find valuable knowledge andregularity from them. For example, until the end of 2012, each day there were 618million users active on Facebook, creating more than 500 TB new data. Every 30minutes 105 TB data are scanned over the company’s clusters where more than 100PB data can be preserved in total.

Apparently, facing the challenges from such exploding data, the conventionalRDBMS systems are bound to fall behind. They are usually installed on singlecomputers with limited resources, which make it impossible to be used as therepositories for such massive data. Certainly their performances can be improvedby being migrated to some more advanced computers, which usually charge a con-siderable price and will be out of date soon. Moreover, parallel databases likeVertica [43] are able to run on multiple computers also cannot hold these largedata sets. On one hand, these systems usually have a high homogeneity demandfor the underlying clusters, which is nearly impossible to achieve at scale. Onthe other hand, parallel databases are often designed with the assumption that fail-ures, caused by either software or hardware problems, happen rarely in the clusters.However, this probability rapidly increases if more computers are added into thecluster. Therefore, the scalability of these parallel databases is restricted, hencethey usually cannot be set up on clusters with more than one hundred computers[1], while the exploding data often need the computing resources of hundreds oreven thousands of computers.

1

2 CHAPTER 1. INTRODUCTION

All above issues promote the study for a more scalable parallel processingmechanism, in order to process the massive data on much larger clusters. Con-sequently, novel platforms like MapReduce [9], Dryad [36], SCOPE [7], etc. areproposed successively. They all intend to provide a flexible infrastructure over thenetwork, storing data distributively without following the relational data model.In addition, they usually give more considerations to keep a high fault-toleranceinstead of efficient performance on large clusters. Although these platforms areoften criticized as brute-force approaches, causing many controversies by beingcompared with the parallel databases [43, 48], they become very attractive to boththe industry and research communities because of their outstanding performanceson the large-scale data analysis.

Among these novel approaches, MapReduce gains the most attention. It wasproposed by Google in 2003 and has been used by the company itself for a decade.It helps to process their PB-size data distributed on thousands computers acrossmultiple data centers with more than 10,000 distinct programs, including the algo-rithms for large-scale graph processing, text processing, machine learning, and sta-tistical machine translation [10]. In addition, its programming model (also namedMapReduce) is relatively simple compared with others. End-users first implementa Map function to process a 〈key, value〉 pair to generate a set of intermediate〈key, value〉 pairs, then a Reduce function is specified to process all values asso-ciated with the same intermediate key. Google’s MapReduce platform is kept pri-vate, but the programming model is implemented in the open source project calledHadoop, provided by the Apache Software Foundation. Therefore, the MapRe-duce paradigm and the Hadoop platform are widely studied in many researchesabout the parallel processing. Consequently, the work in this thesis is also builtupon the Hadoop platform.

Besides the pressure from the exploding data, it is notable that the species ofthese data also become much more various nowadays. Along with of the popularityof the positioning devices integrated within smart-phones and navigators, it is com-mon that end-users can get their real-time locations (spatial data), by which theycan easily find nearby interesting targets or the best path to their destinations. In themeantime, by collecting these locations along with the time sequence, end-users’trajectories (moving objects data) can also be recorded. In this thesis, these spatialand moving objects data are collectively called multi-dimensional data since theydistribute on spaces with more than one dimension. If authorized by end-users,these multi-dimensional data can also be collected as big data, in order to find theirhidden regularity. For example, the airlines can conclude certain patterns by ana-lyzing the trajectories from their pilots [44], in order to improve the training systemor direct their airports’ future constructions.

Multi-dimensional data often requires specialized data models and algorithms

1.1. BACKGROUND AND MOTIVATIONS 3

[18, 26, 33] to process, which are normally not provided in common database sys-tems. Moreover, their big sets usually have not only a large quantity but also a greatsize, since each multi-dimensional object contains plenty of details. For example,the region shown in Figure 1.1a has fifteen lines, while each line needs at leastfour real numbers to denote its two endpoints’ coordinates. Therefore, this objectrequires at least 60 real numbers for storing all its details. Similarly, Figure 1.1bdepicts two moving objects with three segments. Each segment keeps not only thegeographic coordinates of its endpoints, but also the timestamps of the interval.All these detailed information increase the size of multi-dimensional data, causingmore network and disk overhead to transfer and preserve them in clusters.

a

bc

d

e

fg

h

ij

k l

m

no

(a) Spatial Data

x

y

t

(b) Moving Objects Data

Figure 1.1: Multi-dimensional Objects

In order to process these multi-dimensional data, a “generic” database systemnamed SECONDO [27] is proposed. It is being developed for over a decade, able toprocess many kinds of data models, including relational, spatial, spatio-temporal,etc. At the current stage, nearly one hundred data types and thousands of theirrelated operators have already been implemented in SECONDO, creating a powerfultool for processing these specialized data types. More details about SECONDO areintroduced in Section 2.4.

Nevertheless, SECONDO is implemented as a centralized system, hence it canonly be deployed on a single computer and its capability is restricted by the under-lying hardware, facing the challenges from big data. Therefore, in this PhD projectwe intend to scale up the capability of SECONDO to a cluster of computers, in orderto process specialized data models at large scale. The new system is named ParallelSECONDO.

For this purpose, we determine to combine SECONDO with the Hadoop plat-form because of its good extensibility and popularity. In total, there exist the fol-


lowing three possible combinations of these two systems:

1. Integrate the MapReduce programming model into the database system, en-abling itself to schedule and assign the parallel tasks over the cluster [20].However, it is difficult to learn all the details in the MapReduce program-ming model and the Hadoop implementation, hence it is too complicated toput this solution into practice.

2. Extend the Hadoop platform with all necessary SECONDO data types andoperators, like what SpatialHadoop does in [17]. For us, this solution is notcost-effective since it needs reimplementation of almost all SECONDO func-tions based on the MapReduce programming model. Besides, whenever anew data type or operator is extended for SECONDO, it also should be im-plemented repeatedly in Parallel SECONDO, causing a significant workload.

3. Build Parallel SECONDO as a hybrid system, using one Hadoop platform tocouple with a set of SECONDO databases that are distributed on the clustercomputers. Here Hadoop is used only as the communication level, assigningMapReduce tasks to the cluster computers and keeping the workload balanceon the cluster. In the meantime, the single-computer query work is embeddedinside the parallel tasks and pushed into the SECONDO databases to process,in order to achieve the best performance.

Apparently, the third solution maintains the independence of both componentsand also uses their best technologies. On one hand, it uses Hadoop to achieve thebest scalability, hence can be deployed on large-scale clusters. On the other hand,it inherits SECONDO’s capability to process the specialized data types efficiently.Therefore, in this thesis, Parallel SECONDO is constructed in this way. In the fol-lowing, we will roughly illustrate the main challenges that we meet by couplingthese two systems and present our basic solutions.

1.2 Thesis Contributions

This thesis mainly covers the following four issues during the building process ofParallel SECONDO: (1) system infrastructure, (2) declarative query language, (3)cloud evaluation and (4) optimization technologies. We demonstrate the couplingmechanism in the first topic, in order to achieve the most scalable and efficientcombination of Hadoop and SECONDO. Then a parallel data model is prepared,which enables end-users to state parallel queries in a well understandable declara-tive language. In addition, Parallel SECONDO is also fully evaluated on large scale

1.2. THESIS CONTRIBUTIONS 5

clusters consisting of more than one hundred virtual computers. During this pro-cess, problems on managing the multi-dimensional data at large scale are revealed,and their solutions are then discussed in the next topic. Details about all aboveissues are elaborated in the following subsections.

1.2.1 System Infrastructure

Essentially, Hadoop is implemented as a data-driven platform. It stores the massive〈key, value〉 pair data in its HDFS (Hadoop Distributed File System) and then allo-cates them to the parallel tasks to process, according to the MapReduce paradigm.Consequently, most Hadoop-based hybrid systems like HadoopDB [1] or HadoopGIS [2] keep using HDFS as the communication level for both data and tasks. Inthese systems, database records are transformed and shuffled via HDFS, and thenloaded into the single-computer databases when they are required. Nevertheless,since the multi-dimensional data are much larger than the standard data, this addi-tional and frequent data migration always causes considerable overhead.

Regarding this issue, Parallel SECONDO adopts the idea of so called nativestore. It keeps the data either in the distributed databases, or in a simple self-madedistributed file system named PSFS (Parallel SECONDO File System), without fol-lowing the format for the MapReduce model. At the same time, HDFS stores onlya few light-weighted synopsis data in order to schedule the MapReduce tasks byHadoop. Therefore, Parallel SECONDO uses HDFS only as the task communicationlevel, in order to avoid the useless data migration overhead as much as possible.

Based on PSFS, Parallel SECONDO sets DS (Data Server) as the basic pro-cessing unit of the system. Every DS contains a SECONDO database and a PSFSnode, while each MapReduce task is assigned to one DS to process. During theprocedure, the required data are first collected from the remote PSFS nodes intothe local one, then loaded into the SECONDO database directly, without any unnec-essary transformation cost.

Furthermore, a series of operators are proposed in Parallel SECONDO for ac-cessing data in PSFS. They can either export SECONDO relations into several PSFSnodes as disk files, or read files from several DSs and then import the data back intoSECONDO database as a relation. Considering the high probability of encounter-ing failures in large-scale clusters, files in PSFS can also be duplicated on severaladjacent DSs, in order to avoid the data loss when few PSFS nodes become inac-cessible.

Nowadays, it is common that one low-end PC is also equipped with a multi-core processor, large-sized memory and multiple hard drives. This feature is notconsidered in Hadoop, which views the computers as the basic processing unit.Therefore, multiple tasks assigned on the same computer are forced to access the


same disk drive, decreasing the system performance because of their disk inter-ruption. In contrast, Parallel SECONDO is able to set several DSs on differentdisks within the same computer. Therefore, each computer can further distributeits allocated tasks to its multiple DSs, in order to reduce the unnecessary disk I/Ooverhead. Although Hadoop itself can also relieve this problem by using morecomputers, Parallel SECONDO is much more economical as it can fully use theexisting resources of the current cluster.

In the end, in order to assist end-users to use Parallel SECONDO, a set of aux-iliary tools are also proposed, which are able to install, start, stop and remove thesystem on different scale clusters quickly. Besides, both the SECONDO text andgraphic interfaces are fully compatible with the new system, by which end-userscan monitor the results intuitively. At last, once a new function is extended in SEC-ONDO, it can be immediately utilized in Parallel SECONDO by distributing the newSECONDO system to all DSs.

1.2.2 Declarative Query Language

Like many other parallel platforms, Hadoop provides no declarative language. Itforces end-users to write their algorithms in a low-level language like C++ or Javabased on the MapReduce programming model, which are difficult to be maintainedand reused. In addition, MapReduce asks no schema request for the stored records,hence end-users can structure their data in any manner. Although this mechanismimproves the system’s flexibility, it creates barriers for coworkers, since they mustreach an agreement on the data format and parse the data explicitly in their respec-tive algorithms.

In contrast, SECONDO provides two levels of query languages: executableand SQL. In the first level, queries are stated with database objects and opera-tors, representing the query plans precisely, hence can be processed by the SEC-ONDO database directly. Furthermore, end-users can also state their queries inSQL, which are then converted and optimized into the query plans declared in theexecutable language.

In order to help end-users to construct their queries easily, a parallel data modelis proposed; hence the queries can be formulated in the first-level SECONDO querylanguage. It includes a new data type named flist to indicate the schema and thedistribution status of the parallel objects, in order to separate the schema from var-ious queries. Besides, it also prepares several operators to describe MapReducejobs. The single-computer task query is encapsulated within these operations asUDF (User Defined Function), hence most existing SECONDO data types and op-erators can be used in Parallel SECONDO as usual.

Most often, we use this data model to present the parallel join on different data

1.2. THESIS CONTRIBUTIONS 7

types, including: standard, spatial and moving objects. These join queries are thenused to evaluate the performance of Parallel SECONDO, due to the factor that thejoin is widely thought of as the most costly procedure in the Hadoop platform anddeeply studied in many other researches. Considering the speciality of the multi-dimensional data, the PBSM (Partition Based Spatial Merge) method [42] is oftenused to process the join on them, and several ad-hoc operators are also proposed inParallel SECONDO for this method.

At last, we convert all sequential queries in the benchmark BerlinMOD[16] intotheir corresponding parallel queries, which covers all possible range queries on themoving objects data. It fully demonstrates the compatibility and the flexibility ofthis parallel data model.

1.2.3 Cloud Evaluation

MapReduce and Hadoop gain a lot of attention from the database communitymainly by taking advantage of their outstanding scalability. For example, Google’snative MapReduce platform can be evaluated on a cluster consisting of 1800 com-puters [9]. So far, the largest Hadoop cluster that we know is built by Yahoo! [47],containing 3500 machines.

Consequently, as a Hadoop extension, Parallel SECONDO should also be eval-uated on large-scale clusters. However, purchasing so many computers all by our-selves is obviously not economic. Therefore, we determine to rent them for a rea-sonable price from AWS (Amazon Web Service), where the computing resourcescan be leased as virtual computers. Moreover, we are grateful to have receiveda considerable grant from AWS in Education, which enables us to evaluate thesystem on hundreds of virtual computers with limited investment.

In the end, we are able to deploy Parallel SECONDO on clusters containingat most 150 computers. Its performance is also fully evaluated with the paralleljoin for moving objects data, showing that Parallel SECONDO can still achievesatisfactory performance at this scale.

Apart from making evaluations on AWS, we also build up a public AMI (Ama-zon Machine Image) which contains the system’s basic components. Thereby, largeclusters with deployed Parallel SECONDO systems can be built up in few minutes,in order to save more time for end-users to use these resources for their own pur-poses.

Nevertheless, the evaluation on AWS does not continue with more than 150computers, due to some new problems found on managing multi-dimensional dataon large scales. For example, data in PSFS are shuffled via the standard file transferprotocol, the transfer cost increases rapidly along with the growth of the cluster sizesince more computers need to be accessed. Besides, the MapReduce join query


usually shuffles the complete data sets over the network, although many of theseobjects’ detail data are never used. The high cost on transferring a large amountof useless data opposes the performance of Parallel SECONDO on large clusters.Therefore, optimizing mechanisms are studied regarding these issues.

1.2.4 Optimization Technologies

Some optimizing technologies are studied and a set of operators are proposed inthis topic, in order to further improve the performance of Parallel SECONDO onlarge clusters, regarding the above mentioned problems.

Normally, files in PSFS are transferred and loaded into SECONDO databasesone after another, each file requires a certain overhead to prepare the connectionbetween the computers. However, during the parallel procedures, it often happensthat each task needs to access a set of files and considerable network resourcesare wasted during the sequential transferring. Addressing this issue, a pipelinemechanism is built up in PSFS to deliver the files concurrently, in order to reducethe overall elapsed time and achieve the full use of the network resource.

During the procedure of loading multi-dimensional objects from PSFS intoSECONDO databases, each object’s complete information is read and cached inthe memory buffer, which has a relatively small size limitation. Along with theincrease of the input data, it becomes impossible to cache all these data within thememory hence part of them have to be flushed into the disk, causing additionaldisk I/O overhead. However, not all these detailed information are actually neededduring the query procedures. Therefore, a novel approach is proposed to read thesedetailed information only when they are really needed.

Generally, the join procedure on multi-dimensional objects contains two stages:filter and refinement [38]. In the first stage, candidates are generated based on ob-jects’ approximate information, like the MBR (Minimum Bounding Rectangle) ofregions with complicated shapes. This stage is usually used to eliminate the tuplepairs that cannot be part of the final result. For example, in the query finding theintersected regions, if two regions’ MBRs are disjoint, then their tuple pair is ofcourse removed from the candidates. In the second stage, each candidate is exam-ined with the objects’ detailed information to further check whether they satisfythe join predicate.

Apparently, not all multi-dimensional objects’ large detailed data are needed inthe refinement step. However, based on the MapReduce paradigm, both above twostages are normally processed in the Reduce stage, hence all objects’ complete datahave to be shuffled over the network. Regarding this issue, one more optimizationmechanism is prepared. It first separates multi-dimensional objects’ informationinto two parts, one for the approximate data and another for the detailed data. Dur-

1.3. THESIS ORGANIZATION 9

ing the shuffle stage, only the approximate data are transferred over the network,in order to process the first join step and generate the candidate results. Afterward,only the candidate objects’ detailed information are collected in the second-timeshuffling. With this mechanism, we can further reduce the network traffic, makingParallel SECONDO more efficient on the large scale clusters.

1.3 Thesis Organization

The rest of this thesis is organized as follows: first the related work is reviewed inChapter 2. Then Chapter 3 presents the infrastructure of the system, especially thePSFS for achieving the native store mechanism. A simple comparison is evaluatedhere to demonstrate the benefit that we earn by using PSFS instead of HDFS toshuffle the intermediate data. Next the parallel data model is introduced in Chapter4, and the examples of using Parallel SECONDO to process the join on spatial andmoving objects data are also explained and evaluated. Then Chapter 5 shows theevaluation on large-scale clusters upon the AWS platform. The optimizing mech-anisms are then introduced in Chapter 6, in order to further improve the system’sperformance at large scale while processing multi-dimensional data procedures.Finally, a short conclusion about this project is given in Chapter 7 to summarizeour contributions and display possible future directions.

Chapter 2

Related Work

2.1 Early Parallel Database Systems

Parallel processing is not a novel topic in the database community. As early asthe late 1970’s, the “I/O bottleneck” issue caused by the bandwidth differencesamong processors, memory and disk drivers, have already encouraged the study onparallel processing technologies.

In the early stage, database machines are proposed in projects like GRACE[21] and GAMMA [12], dispersing the I/O overhead on multiple disks by con-structing special-purpose hardwares, e.g, introducing data filtering devices withinthe disk heads. These high-end machines are often equipped with large-size mem-ory, multiple processors and disk drivers, hence the source data is partitioned intopieces and stored on those disks separately. In the relational data model, queriesare processed in stream: the output of one operator is streamed into the input ofanother operator. Within parallel database machines, two operators can work inseries giving pipelined parallelism, and each is processed by several independentprocessors simultaneously. Although these custom-made machines achieve a highperformance, they did not succeed because of their poor cost/performance ratio,which make it difficult to further study or commercially develop them. Naturally,the software solutions like parallel database systems are accepted by more andmore researchers [11].

A parallel database is set up on a cluster, containing a number of computerscomposed with standard hardware elements, i.e, processors, memory and disks.The computers are connected together inside the cluster through an interconnectionnetwork. Note that here the term processor is used only in the general sense ofcentral processing unit (CPU). It is common that the processor itself can be madewith several core processors, like a multi-core processor, performing instruction-

11

12 CHAPTER 2. RELATED WORK

level multithread parallelism. Nonetheless, in this thesis, the parallelism usually isdiscussed in a higher-level, hence the processors are simply considered as black-boxes accessing other resources.

Within the cluster, depending on how much the resources, especially the mem-ory and disks, are shared during the database procedures, three basic architecturesare obtained by most parallel database systems [52, 11]:

shared-memory: All processors share direct access to a common global memoryand all disks. Such a system can achieve a high performance easier since allprocessors are able to communicate via the main memory. However, in orderto match up with the high throughput of the memory, it has an extremely highrequirement for the underlying network, like a high-speed bus or a cross-barswitch, thus limiting the system’s scalability to a few tens of processors.

shared-disks: Each processor has a private memory but has direct access to alldisks. Compared with the shared-memory systems, it has a better scalibilitysince the memory is distributed. Besides, it does not require special dataplacement technology as the data on all disks need not to be reorganized.However, this architecture should keep the cache coherency on all involvingcomputers for the sake of data consistency, requiring some form of difficultand complex distributed lock management functions.

shared-nothing: Here the resources are fully distributed: each computer is viewedas a local site, charging only its own resources, and processing the assigneddata independently. This architecture minimizes the interferences caused byresource sharing and throws off problems like locking overhead and cachecoherency, achieving a considerable scalability. Despite few studies thatclaim shared-memory is still recommendable in limited degrees of paral-lelism [24] because of its high efficiency, the shared-nothing architecture iswidely accepted by most researchers and adopted in projects like Volcano[22, 23], Vertica [43]. Novel parallel processing platforms like MapReduce[9] and Hadoop are also established based on this architecture.

2.1.1 Parallel Query Processing

Queries in the parallel databases usually are represented in high-level declarativelanguages like SQL, then they are transformed into execution plans that can beefficiently processed in parallel [40]. The generated execution plan normally ispresented as a tree consisting of relational operators. Each takes the input (eitherthe source or the intermediate data) and generates the intermediate data. The fi-nal query result is produced by the root operator. These operators are processed

2.1. EARLY PARALLEL DATABASE SYSTEMS 13

with two basic forms: inter-operator and intra-operator parallelism. The formerprocesses different operators on several computers in the way of pipeline, gettingbeneficial on complex queries containing many operators. The later processes thesame operator on several computers, each working on a different partition of thedata. It is better on processing heavy operations, like sequentially scanning a largeamount of tuples.

During the executing processes, most parallel databases transfer the interme-diate data between two operations with the approach of the push model, i.e. theproducer operator sends its output data to the successive consumer operator overthe interconnection network directly. It is opposed to the pull model, where the pro-ducer operator materializes the output data on the local disks as split files then theconsumer operator seeks its input from the producer operator by copying the files.The pull model has a better performance on the fault-tolerance property. When theproducer operator is carried out as parallel tasks on several computers and one ofthem fails regardless of hardware or software problems, the push model needs tore-execute all these tasks while the pull model only needs to process that failed oneagain. Nevertheless, the push model is more efficient since the intermediate dataneeds not to be materialized as many small files, which cost considerable overheadto create and transfer, hence it is often adopted in the parallel databases [10, 43].

2.1.2 Parallelism Metrics

Parallel database systems are evaluated with two main properties: speed-up andscale-up [11]. The speed-up measures the improvement by using a larger parallelsystem to process a fixed job:

Speed-up =small system elapsed time

large system elapsed time

The speed-up keeps linear in the ideal system, i.e. an N -times large systemshould yield a speed-up of N . In the meantime, the scale-up measures the systemwhen both the cluster and the job scale grow. The ideal scale-up is linear if itsvalue is always 1.

Scale-up =small system elapsed time on small problem

large system elapsed time on large problem

In addition, instead of using the elapsed time to evaluate systems’ speed-up,this thesis introduces a new metric called parallel improvement (PI). It measuresthe ratio of time for sequential query over time for processing it in a parallel way[25]:


PI =sequential system elapsed time

parallel system elapsed time

Here the sequential system stands for a database system that can only run on asingle computer. With this new metric, it is more intuitive to tell the improvementbrought by the parallel system.

2.2 MapReduce

Although parallel databases perform outstandingly on processing relatively large-scale data, their scalability is often questioned [43, 10]. First, they have a strongrequirement in the homogeneity of the cluster, which is difficult to achieve on largescales. Second, parallel databases are very sensitive to failures during the runningtime as the intermediate results are transferred by the push model. Any failure onfew computers can cause the re-execution of the complete query. Third, data haveto be loaded before being processed. The loading cost is quite considerable on alarge amount of data, especially when the data is used only for a small number oftimes. All these issues restrict the further development on parallel databases.

To the best of our knowledge, there is no conventional parallel database systemthat can be deployed on clusters consisting of more than one hundred comput-ers [1]. Consequently, it blocks those emerging Internet companies like Googleand Yahoo from processing their massive amounts of data with parallel databases.These data include the web pages captured by the crawler, searching and tradinglog files, etc. They increase daily by petabytes, but only being temporarily storedand analyzed with the complicated algorithms that are difficult to represent withSQL queries. Nevertheless, in order to respond to end-users’ requests as quickly aspossible, they need to be processed with the computing capability of hundreds oreven thousands of computers. This kind of requirements are impossible to achieveby parallel databases. Therefore, novel parallel platforms like MapReduce [9],Dryad [36] and System S [6] are proposed.

Among these platforms, MapReduce attracts most attention from both aca-demic and industry institutions. On one hand, it is credited to the MapReduceparadigm’s simplicity and flexibility, which can represent various parallel proce-dures with only two simple primitives: Map and Reduce. On the other hand, al-though Google did not publish their internal MapReduce platform, its open sourceimplementation Hadoop 1, proposed by ApacheTM, is available to the public. There-fore, the work in this thesis is mainly built upon the Hadoop framework.

1http://hadoop.apache.org/

2.2. MAPREDUCE 15

2.2.1 The MapReduce Paradigm

The MapReduce paradigm itself is a kind of programming model, while the MapRe-duce platform like Hadoop is a framework that supports this model. It is inspiredby the functional language Lisp, enabling end-users to express all kinds of parallelprocedures with Map and Reduce functions, without considering the messy detailsof the parallelism like fault-tolerance, data distribution, load balancing, etc, whichare processed by the underlying platform automatically.

Data in MapReduce is stored in the distributed file system, like the GFS (GoogleFile System) or the HDFS (Hadoop Distributed File System). The storage layer isbasically independent from the processing system on top, by simply keeping data as〈key, value〉 pairs in the row layout. However, MapReduce does not require dataadhere to a schema defined using the relational data model. For the same record,its key and value parts are set differently in various Map and Reduce functions,since end-users are free to structure their data in any manner. Accordingly, there isno reorganization needed for loading the data into the distributed file system. Forexample in HDFS, data are loaded by simply dividing the data files into fixed-sizeblocks and copying each into several HDFS instances, i.e. computers on whichthe HDFS platform is deployed. Therefore, the loading overhead in MapReduce ismuch cheaper than in parallel databases.

Conceptually, arbitrary parallel procedures can be easily decomposed into sev-eral MapReduce procedures, each containing one Map and one Reduce functionwith the following types:

Map: 〈k1, v1〉 → list 〈k2, v2〉Reduce: 〈k2, list(v2)〉 → list 〈k3, v3〉

Both the Map and Reduce functions are carried out as intra-operators, beingprocessed as parallel tasks running on all cluster computers. Each task processeseither the Map or the Reduce function with partial input. The Reduce tasks startonly when all Map tasks finish, hence we can divide the MapReduce procedureinto two disjoint stages.

A Map task processes each input 〈k1, v1〉 pair and generates the intermediateresult 〈k2, v2〉. Afterward, each Reduce task collects all pairs with the same keyvalue k2 as the input, then produces the result 〈k3, v3〉 that can be used in thesuccessive MapReduce procedure.

In the end of the Map stage, all the intermediate results 〈k2, v2〉 pairs are ma-terialized as split files and dispersedly stored on the computers where the tasks arecarried out. Then each reduce task uses a file-transfer protocol to “pull” its requiredinput files from the whole cluster [43]. This period is called the Shuffle stage, sinceall the intermediate data are moved between the Map and Reduce tasks over the net-


work. In order to avoid creating and moving a lot of small files during this stage,special implementation tricks like batching, sorting, and grouping of intermediatedata and smart scheduling of reads can be used [10]. However, end-users do notneed to care about these details since they are all hidden inside the Hadoop frame-work. By using the pull model to transfer the intermediate data, when one Maptask fails for whatever reason, only the failed one needs to be re-executed. Thismechanism improves the fault-tolerance property, making it possible to deploy thesystem on large-scale clusters.

To sum up, the MapReduce paradigm enables end-users to express a complexand heavy computation into several MapReduce procedures, in order to speedupthe efficiency by processing it on a large-scale cluster. Details about the underly-ing parallelism mechanism are abstracted by the platform with only two primitivefunctions, simplifying the follow-up developments. Furthermore, using the pullmodel during the Shuffle stage to transfer the intermediate data vastly improvesthe system’s scalability, compared with the conventional parallel databases.

2.2.2 Pros and Cons

MapReduce performs outstandingly on its scalability. In 2009, Hadoop won the 1stposition in the GraySort benchmark test on processing the sorting procedure with100TB (1 trillion 100-byte records), with over 3800 computers [39]. Nevertheless,it is still criticized as a “major step backwards” in comparison with conventionalparallel databases [8, 43, 3] and viewed as a brute force solution that wastes vastamounts of energy.

In [43], a comparison is performed between Hadoop and two parallel databases:Vertica 1 and DBMS-X (a parallel SQL DBMS from a major relational databasevendor), with the benchmarks including the Grep which was used in the originalMapReduce paper [9]. Three typical relational operations: select, aggregate andjoin, as well as an UDF aggregation query are evaluated with these systems on thesame cluster containing 100 computers. In the end, it reveals that Hadoop is 2-50times slower than the two parallel databases, except the data loading.

These studies clearly exhibit that it is crucial to achieve a good tradeoff be-tween the efficiency and scalability in the parallel systems. On one hand, paralleldatabases often use the push model to pipeline the intermediate data between thequery operators, in order to achieve a better performance. However, it also intro-duces the potential danger that many operations need to be re-executed when aslight failure happens. On the other hand, besides keeping the scalability, Hadoopshould also introduce more mature database technologies like various index struc-

1http://www.vertica.com/

2.3. HADOOP EXTENSIONS 17

tures, novel storage mechanisms [49, 14], sophisticated parallel algorithms, etc, tooptimize its efficiency.

Apart from the efficiency comparison, MapReduce is also often criticized forlacking the support on the high-level declarative languages like SQL. All end-usersneed to represent their procedures by programming in procedural languages, suchas C/C++ or Java, based on the MapReduce paradigm. These programs process thedistributed data on the record-level, creating custom parsers to derive the appropri-ate semantics from the input data, since they are stored with no relational model.In a word, the MapReduce paradigm is low-level and rigid, creating barriers forend-users to maintain and reuse those existing work with each other.

2.3 Hadoop Extensions

Regarding these imperfections about MapReduce, many extensions are proposed inthe recent years. Since most of them are actually built upon the Hadoop platform,which is widely perceived as a substitute for MapReduce, they can also be calledHadoop extensions. Roughly, these extensions can be divided into the followingfour kinds.

The first kind of extensions attempts to improve the data accessing in theHadoop platform. Natively, Hadoop processes the data on the record-level, with-out using any kind of index structure, hence its efficiency on many conventionaldatabase operations is much slower than the parallel databases [43]. Addressingthis issue, Hadoop++ [14] proposes the approach that injects a Trojan index at theend of each data block in HDFS, so as to reduce the I/O cost without changingthe MapReduce paradigm. However, pre-generating the index structure is time-consuming and also it is difficult to build the index for the data set on all perspec-tives, hence it is challenging to use it as a generic system. In addition, the Mapand Reduce stages are blocked from each other, no reduce task can start beforethe last Map task finishes. This strategy guarantees the system’s fault-toleranceproperty, but pulls down the processing performance as well [4, 31, 34]. There-fore, some studies intend to introduce the push model into the shuffle stage in aproper manner. In [34], as soon as each Map task finishes, the intermediate resultare hashed and pushed to the hash tables held by the Reduce tasks. Therefore, theReduce tasks can start to perform the aggregation within each bucket on the flyeven when the Map tasks are not total completed. Nevertheless, whether this kindof approaches interferes with the system’s scalability is still doubtful.

Secondly, the importance of using the high-level declarative languages espe-cially like SQL to describe, instead of using the low-level procedural languagesto program the MapReduce procedures has gradually become the consensus of the


whole database community. Therefore, many extensions are proposed to improvethe expressivity for describing Hadoop jobs. In the early stage, the project Pig [37]develop a language named Pig Latin. It provides a nested data model and corre-sponding operators, which can be used to depict the MapReduce operations withSQL-like style, in a so called Pig Latin program. This program can be transformedinto Hadoop jobs automatically with a corresponding compiler in the Pig system.Besides, the Hive [50] project, which is practically used in Facebook, also pro-posed a language named HiveQL. A HiveQL statement is similar to a SQL queryand is also converted into several jobs that are carried out on the Hadoop platform.In particular, Hive views HDFS as a data warehouse and offers a meta-store to keepthe schemas of all involved data, hence it is able to optimize the generated Hadoopjobs based on this information. Nevertheless, both the compilers in Pig and Hiveare naıve. For example, Hive assumes all the tables in the system are independentlydistributed on the cluster, hence queries like join that involve multiple tables alwayspush most procedures into the Reduce stage and cause additional shuffle overhead.More recently, the distributed database F1 [46] proposed by Google itself also in-cludes a fully functional SQL engine. Although its implementation details are notdeeply introduced in the existing paper, it clearly exhibit the importance of a high-level declarative language in the future parallel systems.

The third kind of Hadoop extensions enhances Hadoop’s efficiency on partic-ular database operations, especially the join operation which shows the biggestperformance difference between Hadoop and the parallel database systems in [43].[4] studies several Hadoop-based algorithms and preprocessing methods to processthe parallel equi-join operation on vast amounts of log records. On the one hand,if there occurs a large size difference between two data sets, the smaller one is ei-ther delivered to the Reduce tasks before the other one in order to be fully bufferedinto the memory, or broadcasted among the cluster without repartitioning the largetable. On the other hand, two large data sets can be processed with the semi-join method. One set is first filtered by the aggregated joining attributes from theother set, hence the communication overhead can be reduced as much as possible.Besides these optimized parallel algorithms, other researchers attempt to proposethe MapReduce variances with more flexible data flow. Map-Reduce-Merge [57]improves Hadoop on processing heterogeneous operations like the one-round joinquery. It adds an extra primitive function Merge to the MapReduce paradigm, inorder to avoid the homogenization on both inputs. Thereby, two heterogeneous re-lations can be processed independently and then be joined in the last Merge stage.

Most above Hadoop extensions tend to import the existing database technolo-gies, which are been developed for decades and widely supported by many DBMSs,into the Hadoop platform. Nevertheless, they need to reimplement these methodsinto the Hadoop platform, causing some potential side affects. Therefore, hybrid

2.4. SECONDO DATABASE SYSTEM 19

systems are proposed as the last kind of Hadoop extensions, in order to take thebest features from both sides. They are typified by HadoopDB [1], which combinesthe Hadoop framework with single-node database PostgreSQL. It depends on Hiveto compile the SQL-style queries into Hadoop jobs, while each task pushes theassigned sub-queries into slaves’ high-performing database to process those rela-tional database operations. The Hadoop platform is then used as the task coordina-tor and the communication level. There are also some other Hadoop-based hybridsystems. [55] shows the integration of Hadoop with parallel DBMS Teradata EDW.It builds up a data tunnel between these two systems, in order to use either systemto process different kind of queries. DEDUCE [32] combines Hadoop and SystemS, since the latter has an advantage on processing the stream data. Therefore, itembeds the Hadoop workflow into its SPADE data flow, in order to use the anal-ysis results, which are computed by Hadoop based on a large amount of historydata, to assistant the real-time processing. At last, SQL/MapReduce [20] achievesthe hybrid on the other way round by introducing the MapReduce paradigm to theparallel databases. In their own parallel database nCluster, MapReduce proceduresare declared and processed as UDFs in SQL queries.

2.4 SECONDO Database System

GUI

Optimizer

SECONDO Kernel

Command Manager

Storage Manager & Tools

Query Processor & Catalog

AlgnAlg2Alg1 ...

Figure 2.1: SECONDO Components (left), Architecture of Kernel (right)

We intend to build up Parallel SECONDO as a hybrid system as well, by com-bining the Hadoop framework with a set of SECONDO databases. SECONDO [27,28] is a “generic” single-computer database system, aiming to be filled with theimplementations of various DBMS data models, like relational, spatial, temporal,etc. It represents different data types like spatial and moving objects and processesthem with efficient algorithms [26, 19, 33].


Roughly, SECONDO includes three level modules: kernel, optimizer and userinterface, shown in Figure 2.1. The user interface provides the front-end to end-users with both text and graphical interfaces. The former is a command-line shellwhere both the input query and the result are represented as text descriptions. Oncontrary, the latter can represent different types of query result in a graphical way.E.g, a moving object’s trajectory can be illustrated on the urban road network ofthe real-world, while its movement is also animated vividly, as shown in Figure2.2.

SECONDO accepts queries in two level languages: executable and SQL. In thelatter case, those SQL-like queries are first processed by the optimizer to be trans-formed into the optimal query plans, which are expressed in SECONDO executablelanguage. The optimization procedure of the query plan is determined based on thecost estimates of the involved predicates (operators), with small sample relations.However, its details cannot be further discussed in this thesis since they are farbeyond the scope of our main topic.

At last, the SECONDO kernel module provides the command manager, queryprocessor and storage manager, to represent and process different data types. Queriesexpressed in the executable language are first parsed into the operator trees by thecommand manager, while each node is a SECONDO operator. The execution ofthe operator trees is controlled by the query processor. The leaf operators accessthe source data from the storage manager, while the root operator generates thefinal query result. Each internal operator gets the input from the former one andprocesses them iteratively, then generates the output to the successive operator. Inaddition, the kernel’s capability is extended by algebra modules. An algebra mod-ule generally offers a set of type constructors to represent a kind of data objects[13], and also a set of operators to process them with efficient algorithms [33].

A Client/Server mechanism is also provided in SECONDO. It uses a server dae-mon named Monitor to listen and process all requests from remote clients, basedon the local database. It is possible to set several SECONDO Monitors on the samecomputer, while each Monitor is combined with a different database and listens toa unique TCP/IP port.

Parallel SECONDO keeps the SECONDO user interface untouched, using it asthe front-end of the new system as well. Therefore, end-users are able to accessboth of them with the same interface, in order to solve different scale problems byselecting either system. Besides, Parallel SECONDO uses Hadoop to connect a setof SECONDO kernel modules. The Hadoop platform is only used as the commu-nication level and the task coordinator, just like what HadoopDB does, while thedata is processed all within the SECONDO systems, in order to achieve the best per-formance. In the meantime, all functions like invoking the Hadoop platform andexchanging data between SECONDO databases over the network are implemented

2.5. PARALLEL PROCESSING ON SPECIALIZED DATA 21

Figure 2.2: The Graphical User Interface for SECONDO and Parallel SECONDO

as two SECONDO algebras. More details about Parallel SECONDO infrastructureare introduced in Chapter 3.

2.5 Parallel Processing on Specialized Data

SECONDO is mainly proposed to process specialized data types that normally arenot well supported in ordinary database systems, especially for the moving ob-jects data. Likewise, Parallel SECONDO needs to inherit this capability, in order tohandle these special types of data on a large scale.

To the best of our knowledge, there do not exist many parallel systems that cansystemically process these special data types. SpatialHadoop [17] proposes such asystem by extending the Hadoop platform with spatial data types and functions. Ituses a two-level index structure to store a large amount of spatial objects over thecluster, in order to efficiently perform spatial operations like range queries, kNNqueries and spatial join. However, it forces end-users to extend the system withnew data types and operations by programming in MapReduce paradigm, building


barriers for future developments. HadoopGIS [2] also attempts to combine Hadoopwith their spatial processing engine RESQUE. It is integrated with Hive, hencequeries can be expressed in SQL-like statements. Nevertheless, it implements thecombination simply on the record-level since data should be parsed in real-time,causing considerable overhead along with the increase of the data size. Besides,neither of them support any technology for the moving objects data.

There are some studies proposed for processing certain spatial queries in paral-lel. In [42], the PBSM method is proposed to process the spatial join operation inparallel. It evenly divides spatial objects into disjoint partitions, hence each clusternode can process the join within one partition independently. SJMR [58] imple-ments this method with the MapReduce paradigm and improves it by removingduplicated result in-stream. This method is also adopted in our study for process-ing the spatial and spatio-temporal join in Parallel SECONDO. BRACE [54] usesMapReduce to simulate agents’ behaviors in parallel. It abstracts behavioral simu-lations in the state-effect programming pattern, which can be processed as iteratedparallel spatial join operations, with two consecutive MapReduce jobs at each time.

At last, some studies are also proposed to process special data types with otherparallel platforms instead of MapReduce or Hadoop. [6] specifically studies thegrid-based map-matching procedure in parallel, which can be viewed as an appli-cation of the spatial join operation. It is implemented with the IBM’s System Ssystem. Paradise [41] is a parallel geo-spatial DBMS based on a hierarchical in-frastructure including both shared-memory and shared-nothing architectures. It isparticularly prepared for dealing with a large number of satellite image data, bydeclustering them across the cluster based on their spatial attributes. Besides usingthe push model, it particularly imported the pull model to fetch large image dataover the network only when they are required.

Chapter 3

System Infrastructure

This chapter mainly explains the construction of Parallel SECONDO. We startin Section 3.1 by introducing all system components prepared for coupling theHadoop framework with our extensive data processing engine SECONDO. Unlikemany other Hadoopize systems that rely on the default HDFS (Hadoop distributedFile System) to shuffle intermediate data for inheriting the Hadoop framework’sessential features: the balanced workload assignment and the large scalability, Par-allel SECONDO provides a similar and much simpler module called PSFS (ParallelSECONDO File System). Via this, all intermediate data can be exchanged amongdistributed SECONDO databases directly, in order to achieve the best network trans-fer performance. The detailed explanations about PSFS are mainly introduced inSection 3.2, where some evaluations are also given to compare the performancesbetween PSFS and HDFS. At last, Section 3.3 introduces the auxiliary tools thatare provided to help end-users easily deploying and managing Parallel SECONDO

on large-scale clusters.To get the most understanding of this chapter, we recommend the reader to

download and make a trial of Parallel SECONDO before reading the coming details.Parallel SECONDO has already been freely published along with SECONDO 3.3.2and can be downloaded from our website 1. The reader can download its sourcecode and easily set it up on either a single-computer or a private cluster by theinstallation script. More specific deployment steps can be found in Appendix A.

Besides, we also provide the virtual machine images for Parallel SECONDO

hence end-users can get familiar with the system and use it to process their prac-tical data as quickly as possible. Firstly the VMWare image contains a ParallelSECONDO system that is installed and configured on a single-computer Ubuntusystem, with which Parallel SECONDO can be immediately set up by loading it into

1http://dna.fernuni-hagen.de/secondo/ParallelSecondo

23

24 CHAPTER 3. SYSTEM INFRASTRUCTURE

Master Node

P S F S

master Data Server

Mini Secondo

Master

Database

DS Catalog

Hadoop

Operators

H D F S

Slave Node

Main Server

Mini Secondo

Slave Data

Server

Slave Node

. . .Parallel Query(UDF)UDF

JobTracker

NameNode

TaskTracker

DataNode

UDF

DS Catalog

Slave

Database

Slave Data

Server

Figure 3.1: The Infrastructure of Parallel Secondo

a VMWare virtual computer. Secondly an AMI (Amazon Machine Image) withParallel SECONDO is also provided. Based on this image, it is possible to create alarge-scale cluster consisting of Amazon EC2 (Elastic Compute Cloud) instances,where Parallel SECONDO has already been automatically deployed. Both imagesare also freely published on our website and their specific usages are introducedwith more details in Appendix A.2.

3.1 System Components

Essentially, Parallel SECONDO is built up as a hybrid system by efficiently couplingthe Hadoop framework with a set of SECONDO databases. The infrastructure isshown in Figure 3.1. This system is inspired by HadoopDB [1], where the Hadoopframework mainly takes charge of assigning and scheduling parallel tasks runningon an amount of computers, while the tasks’ embedded procedures are then pushedas much as possible to the distributed local database engines, in order to improvethe system’s overall efficiency. Beyond this, there are several special modules andmechanisms in Parallel SECONDO for achieving a better performance.

First, SECONDO provides a lot of techniques for processing spatial and mov-ing objects (spatio-temporal) data which both need heavy geometric computations.Unlike common database procedures, geometric computations are CPU- and I/O-intensive. For example, as shown in Figure 3.2, two region objects A and B do notintersect although their MBRs (Minimum Bounding Rectangle) overlap. There-fore, the intersection detection needs the comparison between not only their MBRsbut also their coordinates.

3.1. SYSTEM COMPONENTS 25

In SECONDO, the detailed data like spatial objects’ precise coordinates are keptin a structure named FLOB (Faked Large OBjects), which is not read until the dataare really needed by the query. This mechanism helps to reduce the useless I/Oaccess as much as possible. Like in the above example, if A’s and B’s MBRs do notintersect, then the comparison between their precise coordinates is completely notneeded. This filter and refinement procedure [38] is generically used for processingthe join on multi-dimensional objects.

In Parallel SECONDO, it is normal to

AB

Figure 3.2: Two Region Objects

process several tasks on the same com-puter. If all tasks need to read data fromthe same database or several databasesthat are settled on the same disk, thenthe disk interference among these simul-taneous tasks may considerably encum-ber the performance.

Regarding this issue, unlike Hadoop that views every cluster computer as itsbasic processing unit, Parallel SECONDO sets DS (Data Server) as its basic pro-cessing unit, shown in Figure 3.1. Each DS contains at least a compact SECONDO

system called Mini-SECONDO and its affiliated database. Nowadays it is com-mon that even a low-end computer is also equipped with multi-core processors,large memory and several hard disks. Therefore, in computers with multiple harddisks we can set several DSs on the same computer and the simultaneous taskson this computer involving different Mini-SECONDO databases can read their dataindependently, hence the disk interference can be reduced to the bare minimum.Since there are several DSs set on the same computer, we indicate one of them asMS(Main Server), which contains the configuration information and managementscripts that work for all DSs on the same computer.

In addition, Hadoop provides two lists masters and slaves to indicate all com-puters of the current cluster that take part in the system. Parallel SECONDO pro-vides a similar structure called DS Catalog since we use DS as the basic processingunit. For every cluster node, its first DS that is listed in DS Catalog is denoted asthe MS. The DS Catalog is duplicated on every MS, hence any DS can find theother DSs by scanning it. Each DS in the catalog is represented as one line withthe format:

IP:PSFSNode:SecPort

The IP denotes the computer where the DS is set. In the runtime, each Mini-SECONDO provides its service through the Client/Server mechanism with the Mon-itor daemon, as introduced in Section 2.4. The listening port of the Monitor is


denoted by SecPort.As mentioned before, Parallel SECONDO builds up a simple distributed file sys-

tem PSFS, in order to exchange intermediate data among Mini-SECONDO databasesdirectly. Similar like HDFS that needs to materialize impermanent data as diskfiles, PSFS also caches the data as files on every DS, in a directory called PSF-SNode and usually is set inside the DS. Note that there is a dependency betweenthe PSFSNode and the SecPort, since it is not allowed for different Mini-SECONDO

to fetch data from the same PSFSNode, and of course it is impossible for the sameMini-SECONDO to keep data in different PSFSNodes.

Same as Hadoop that characterizes its cluster computers as masters and slaves,Parallel SECONDO also indicate one DS as the mDS (master Data Server) and allthe others as sDSs (slave Data Server). Usually the mDS is set on the MS of themaster in Hadoop. Besides, it is possible to set the mDS also as one sDS.

Along with the distinction of mDS and sDS, we set the Mini-SECONDO databasein the mDS as the master database while all the others as slave databases. ThemDS contains some global data like the scale of the cluster, global index structuresand also the meta data of distributed objects that will be further explained in thenext chapter. In the contrast, slave databases contain only the local objects be-longing to the corresponding sDSs. Therefore, the master database becomes theonly entrance of the whole system, while all the other components are hidden fromend-users, giving them the impression of still handling a standalone SECONDO

database.Consequently, both the text and the graphical interfaces provided for the mas-

ter database are straightly used as the interface for Parallel SECONDO. When aquery is given, containing certain parallel operators that will be introduced in Sec-tion 4.1, it is first transformed into Hadoop jobs by mDS and then further dividedinto Map and Reduce tasks by the Hadoop framework, while each task containscertain SECONDO queries. During the Map or Reduce stage, tasks run in parallelon all sDSs, being applied and scheduled by the Hadoop framework, in order toachieve a balanced workload assignment on the cluster. However, their encapsu-lated SECONDO queries, which can be sequentially processed, are then pushed andexecuted by Mini-SECONDO. Every task fetches its required data from either thelocal Mini-SECONDO databases or remotely from the other computers via PSFS.

In principle, the master database still can be used as a normal SECONDO

database, where all sequential queries can be processed as usual. Consideringthere exists considerable overhead for processing Hadoop jobs, which is mainlyspent on network communication and tasks assignment [14], Parallel SECONDO

is more proper to process queries involving massive amounts of data while smallqueries are more efficiently processed in a standalone SECONDO. In addition, forconventional SECONDO users who have stored many customized data in the per-

3.2. PARALLEL SECONDO FILE SYSTEM 27

sonal database, it is possible to set the existing database to be the master databasedirectly during the deployment without migrating the data at all. Therefore, via themaster database, end-users can choose an appropriate system, either standaloneor parallel, to process various queries according to the sizes of the handling data,making Parallel SECONDO more flexible for all kinds of queries.

At last, during the installation of Parallel SECONDO, the Hadoop frameworkis also deployed by unpacking the software and setting HDFS nodes to all MSs.Its configuration parameters are also set automatically, according to the ParallelSECONDO preferences. Nevertheless, we keep the Hadoop framework independentfrom the other Parallel SECONDO components. There is no extension done inHadoop core functionality and the framework works all by its own mechanismwithout the participation of any Parallel SECONDO components. The reason thatwe maintain Hadoop and Parallel SECONDO separated from each other is to keepthe system compatible with any possible update from either side. Especially forthe SECONDO system, when there are any new features extended like a new typeconstructor or operator, accordingly all Mini-SECONDO systems can be updatedwith these new features immediately, without affecting the Hadoop framework atall.

3.2 Parallel SECONDO File System

Comparing parallel databases like Vertica, MapReduce and Hadoop provide a sim-ple and flexible mechanism that enables them to achieve an impressive scalabilitywithout being carefully tuned. To the best of our knowledge, there exists no publi-cations that such parallel databases can be deployed on clusters with more than onehundred computers [1, 43], while many systems upon Hadoop, like Hive and Pig,can be easily set up on hundreds or even thousands of computers covering severaldata centers in different geographic regions.

The large-scale scalability is mainly owed to the MapReduce fault-tolerancemechanism. Data in Hadoop are kept in HDFS (Hadoop Distributed File System)as 〈key, value〉 pairs. All pairs with the same key are grouped into a split and eachsplit is processed by one Map/Reduce task. The splits are stored in constant-sizeblocks, 64 MB by default, each block is duplicated on several computers in casesome HDFS nodes crash and become inaccessible. The tasks are assigned to slavesby the JobTracker and the TaskTracker of the framework. For various reasons,certain tasks may hang or fail, and then straggle the whole job. In this case, theyare replicated to the other idle slaves and fetch their input splits from the duplicatedblocks. When the first of the replicated tasks finishes then all its other task attemptsare aborted. Thereby, with this brute force approach, although certain computing


and network resources are wasted for assigning and processing duplicated tasks,the performance of the complete job is improved.

In order to inherit the impressive fault-tolerance and scalability of Hadoop,many Hadoop extensions that we introduced in Chapter 2 depend on HDFS to shuf-fle their intermediate data. Even hybrid systems like HadoopDB [1] and Hadoop-GIS [2], which basically use distributed standalone databases to process mostdatabase queries, still rely on HDFS to shuffle their intermediate data. Withinthese systems, data are first loaded into HDFS as 〈key, value〉 pairs and then canbe duplicated and shuffled by the Hadoop framework. Nevertheless, such a mech-anism causes a lot of problems when we attempt to integrate SECONDO into theHadoop framework.

Algorithm 1: Generic Hadoop-based Parallel Join

1 function Map (k,v):2 for tag in (0,1) do3 if tag == 0 then4 relation = SECONDO.Read (R) ;

5 else if tag == 1 then6 relation = SECONDO.Read (S) ;

7 foreach tuple in relation do8 ik = SECONDO.Project (tuple, JoinAttr) ;9 iv = concat(tag, tuple) ;

10 emit(ik, iv) ;

11 function Reduce (k,v arr):12 IR = SECONDO.CreateRelation () ;13 IS = SECONDO.CreateRelation () ;14 foreach v in v arr do15 tag = extractTag (v) ;16 tuple = extractTuple (v) ;17 if tag == 0 then18 SECONDO.Append (IR, tuple) ;

19 else if tag == 1 then20 SECONDO.Append (IS, tuple) ;

21 SECONDO.Join (IR, IS) ;


For clearly explaining all problems caused by shuffling data via HDFS in Par-allel SECONDO, we use the generic Hadoop-based parallel join operation as theexample and show its pseudocode in Algorithm 1. It is a typical Hadoop reduce-side join for two relations R and S that have already been distributed in slavedatabases. In the Map stage, each task exports the relations from the assignedslave database and sets them apart with the variable tag. The tag is used to distin-guish tuples from different source relations. It is set 0 when the tuple comes fromthe relation R, or else it is set 1. Afterward, each tuple extracts its join attributevalue as the key for the intermediate result, while the value contains the tag andalso the tuple itself. The produced 〈key, value〉 pairs are then emitted into HDFSand shuffled by the Hadoop framework. Both input relations’ tuples that have thesame key, i.e. the same join attribute value, are grouped into one reduce task forthe up-coming join processing.

Later in the Reduce stage, each task first creates two temporary relations IRand IS in the Mini-SECONDO databases. Subsequently, the tag and tuple areextracted from every value and the tuple is inserted into the one temporary relationbased on its tag value. At last, the task invokes its assigned Mini-SECONDO toprocess the join operation.

During the parallel join procedure, all intermediated tuples are uploaded toHDFS from slave databases in the Map stage, shown in line 4 and 6. Later inthe Reduce stage, these tuples are then read from HDFS and loaded back to slavedatabases on line 18 and 20. The data has to be migrated between two systems forallocating them into Reduce tasks evenly, then can keep a balanced workload overthe cluster. However, it also generates considerable migration overhead, whichincreases when the cluster scale becomes larger.

Apart from the unnecessary transfer overhead, the overhead of parsing keysfrom the given tuples is another obstacle for Parallel SECONDO using HDFS asits main data communication layer. As shown in the 8th line of the Algorithm 1,the map task needs to extract the key, i.e. the join attribute value from the slavedatabase. Since this parsing overhead happens for every tuple, it also increasesalong with the sizes of the given data sets.

Besides, the keys are required by HDFS to shuffle all emitted 〈key, value〉pairs. This means that Hadoop should be aware of all possible data types that canbe used as key, in order to parse them from the Mini-SECONDO databases andgrouping them based on their values. However, as an extensible database system,SECONDO contains a lot of special data types like spatial and moving objects thatusually are not supported in common data processing systems. Therefore, we needto reimplement almost all of these data types in Hadoop if they can be used as keys.This work also sets barriers for the subsequent development of Parallel SECONDO,since we should not only implement all existing, nearly one hundred, SECONDO


special data types, but also all future data types that will be extended in SECONDO.Regarding these issues, a simple distributed file system is developed in Paral-

lel SECONDO, in order to achieve the native store mechanism. It is named PSFS(Parallel SECONDO File System), working similar as HDFS but with a much sim-pler mechanism. Via PSFS, all intermediate data are exchanged directly amongMini-SECONDO databases without being uploaded to HDFS. They are partitionedand delivered by Mini-SECONDOs in order to completely reduce the unnecessarytransferring and parsing overhead.

The fault-tolerance of HDFS is maintained in PSFS. According to various cus-tomized queries, intermediate data in PSFS are partitioned into tiles and each tileis materialized as a binary file. On one hand, each tile can also be duplicatedfor several times, based on the chained declustering technique [56, 30]. It is kepton several continuous sDSs while the order of the duplicate candidates is decidedby the DS Catalog. On the other hand, each tile generates a corresponding metaobject, named FIPair (File Information Pair), following the 〈key, value〉 format.Each FIPair contains some basic information about its corresponding tile file, likeon which sDSs the file is generated, stored and duplicated. These FIPairs are shuf-fled in HDFS, in order to apply and schedule Map/Reduce tasks. When one taskfails, Hadoop can also duplicate or restart it based on its FIPair.

In addition, Mini-SECONDO takes over the responsibility for partitioning SEC-ONDO objects into pieces based on their various data types, also all data types usedas FIPair key are naturally supported by Hadoop, hence all SECONDO existing andfuture special data types can be supported straightforwardly in Parallel SECONDO,without being reimplemented. This feature improves the development progress,making sure that all SECONDO new features can be used in Parallel SECONDO

immediately.

3.2.1 PSFS Operators

In order to partition SECONDO data into tiles and then shuffle them among Mini-SECONDO databases directly, a set of basic PSFS operators is developed and listedin Table 3.1. They are briefly divided into two kinds, export and import. Sincemost SECONDO objects are stored as tuple attributes, the export operators can reada stream of SECONDO tuples and export them as binary blocks into disk files. Incontrast, the import operators read the files created by export operators and loadthe tuples back into the database as a stream of tuples. Note that the tuples areencapsulated as binary blocks, hence the files created on a 64-bit computer cannotbe imported to a database running on a 32-bit computer and vice versa.

These operators can read/write data from/into files, which are kept on either thelocal file system or PSFS. Consequently, their arguments are mainly divided into


Kind Name Signature

Exportfconsume

stream(tuple(T)) × FileInfo(T) × PSFSInfo→ bool

fdistributestream(tuple(T)) × Key × FileInfo(T) × PSFSInfo→ stream(tuple(Suffix:int , Num:int))

Import ffeed FileInfo(T) × PSFSInfo→ stream(tuple(T))

Table 3.1: Extended Operators for Basic PSFS Access

two groups, FileInfo and PSFSInfo, although the specific arguments in the groupsvary with the operators. The FileInfo parameters are mandatory, denoting the fileposition on the local file system. The PSFSInfo parameters are optional, being usedonly for exchanging data in PSFS. Therefore, these operators can not only be usedin PSFS but also on a single computer by migrating data between databases. In thefollowing, we introduce them by these two circumstances.

Standalone Access

On a standalone SECONDO system, all PSFS operators can work only with FileInfoparameters, including:

fileName:string × filePath:text × [suffix1:int] × [suffix2:int]

The fileName denotes a prefix-name that all target files are named after, whilethe filePath indicates a local disk path where the files are stored. The other twoarguments are optional, defining the possible file name suffixes.

The export operator fconsume creates two files as the result. One is the binarydata file that contains all input tuples. The other is the text type file, containing theschema of the input tuple stream. The type file is named fileName type, whilethe data file name is set as fileName[ suffix1][ suffix2]. If both two optionalsuffix parameters are not stated, then the data file name is exactly the same as thegiven fileName. Both files are stored in the directory that the filePath indicates,which should be an absolute disk path if it is not empty, or else it denotes thePSFSNode set in the DS Catalog. This operator returns a boolean result, indicatingwhether both files are successfully created.

Consequently, the import operator ffeed finds both the type and data files basedon its given FileInfo parameters, then reads the tuples into the database from thedata file.

For every SECONDO data type that can be used as tuple attribute, its storage isdivided into three parts: root, extension and FLOB. It is created in such a way inorder to reduce the disk I/O as much as possible, as mentioned in Figure 3.2. Take


BS TS Root Extension FLOB

(a) Tuple Block Structure

Tuple Tuple Tuple ... checksum

(b) Data File Structure

Figure 3.3: Data Structure in PSFS files

the region type again for the example, its basic information like the MBR, lengthare kept in the root record since they are frequently used by various operations. Onthe contrary, its precise coordinates are less often needed, hence they are kept onthe extension or the FLOB record, depending on whether the data is larger thana certain length-threshold. During query procedures, its root and extension recordare always read into the memory while the FLOB is left on the disk untouched untilit is really needed.

Consequently, each tuple in the data file is also serialized in the same way, withthe structure shown in Figure 3.3a. All its attributes’ root records are saved in theRoot block, then the Extension block for all attributes’ extension records. At last,the FLOB records from all attributes are stored in the FLOB block. Two size vari-ables are kept at the beginning of each tuple block. The first BS stands for blocksize, it is a 4 bytes unsigned integer telling the size of the whole block. The TSmeans tuple size, which is a 2 bytes unsigned integer with the size only for the Rootand Extension blocks. In the data file, tuple blocks are continuously kept based onthe fetching order from the database, as shown in Figure 3.3b. A checksum recordis stored at the end of the file, in order to verify the file’s correctness in case thedata is not completely transferred.

The type file is kept separately from the data file for preventing meaninglessdata transfer. In a SECONDO procedure, it needs all involved objects’ schemas firstto verify the correctness of the query and then generate the operator tree, beforeprocessing it. In such a way, if the query is not correctly stated after checking theschema based on the type files, it can be aborted immediately and the data file doesnot need to be read or transfered at all.

The other export operator fdistribute actually is particularly prepared forPSFS, although it can also be used on the standalone computer system. It dividesthe given stream of tuples into pieces based on certain key attribute values and


generates a data file for each piece. Without considering the PSFSInfo arguments,its signature is:

stream(tuple(...))x fileName x path x [Suffix1]x keyAttribute x [nBuckets] x [KKA]

-> stream(tuple(Suffix2, Num))

Specially this operator needs a set of Key arguments, containing:

keyAttribute × [nBuckets:int] × [KKA:bool]

The keyAttribute indicates one attribute from the given stream, by which thetuples are divided into pieces. We use its hash value to partition the tuples, whilethe optional nBuckets sets the size of the hash table. The last argument KKAstands for Keep Key Attribute, it decides whether the key attribute is kept after thepartition. By default it is set as false.

Its FileInfo parameters are almost the same as fconsume, except it needs onlyone optional suffix parameter, since the second suffix is set with the key attribute’shash value.

As the result, fdistribute creates one type file and a set of data files. For eachdata file, its second suffix and the number of tuples that it contains are returned asone output tuple of the operator.

Cluster Access

In Parallel SECONDO, PSFS operators help to exchange data among Mini-SECONDO

databases via PSFS by setting the PSFSInfo parameters. They do not need to beexplicitly set by end-users, since Hadoop operators which will be introduced in thenext chapter, all implicitly use PSFS operators within their Map/Reduce stages.

For example, the complete fconsume signature is:

(stream(tuple(T))x fileName x filePath x [suffix1] x [suffix2]x [typeDS1] x [typeDS2]x [targetDS x duplicateTimes])

-> bool

The PSFSInfo parameters indicate the target files’ PSFS location, by settingarguments like typeDS and targetDS that denote specific sDSs in the DS Cata-log. This operator first creates the type and data file in its local PSFS node. If thetypeDS parameters are set, then the type file is duplicated to at most two othersDSs. Besides, the targetDS parameter decides on which sDS the data file is


duplicated. If the duplicateT imes is set larger than one, then the data file is du-plicated on duplicateT imes adjacent sDSs’ PSFS nodes starting from the targetsDS.

The PSFSInfo arguments in the fdistribute operator are the same as thefconsume, except it duplicates all its created data files to the target sDS and itscontinuous sDSs when the duplicateT imes is larger than one.

The complete signature of ffeed is:

fileName x filePath x [rowNum] x [colNum]x [typeDS]x [producerDS x targetDS x duplicateTimes]

->stream(tuple(T))

The ffeed operator looks for the type file from only one remote sDS. Besides,its PSFSInfo arguments contains the producerDS. This is set in order to distin-guish data files that are produced by different sDSs but duplicated on the samesDS. In the end, the optional parameters rowNum and colNum are mainly usedto identify the target file’s two integer suffixes, which are introduced in the abovestandalone access circumstance.

3.2.2 PSFS vs. HDFS

In order to compare the performances between HDFS and PSFS by shuffling inter-mediate data, a basic evaluation is prepared. It performs the join operation in Par-allel SECONDO. The first method is named HDJ (Hadoop Distributed Join), whichuses HDFS to shuffle the intermediate data, while the second method is namedSDJ (SECONDO Distributed Join) since the major intermediate data is exchangedamong Mini-SECONDO databases directly via PSFS. Since both methods use manyoperators that are not mentioned above, we will introduce them with more details inthe next chapter. Here we only use them to demonstrate the performance differencebetween HDFS and PSFS.

We choose the 12th query of the benchmark TPC-H 1 to test our methods. Thebenchmark data set contains a suite of business oriented data. When the scale factoris set as one, the two example relations, lineitem and orders, have 6,001,215 and1,500,000 tuples respectively, taking in total about 1.3 GB disk space.

The SQL expression of the example query is shown in Figure 3.4. It countsthe times of delayed deliveries caused by choosing cheaper shipping. The last twotime conditions on the receipt date (lines 23-24) are removed from our evaluation,in order to reduce the query’s selectivity and make it suitable to evaluate parallelsystems. Parallel SECONDO does not provide SQL statements yet, but the queries

1The TPC BenchmarkTMH: http://www.tcp.org/hspec.html


can be stated in SECONDO executable language which will be introduced withmore details in Chapter 4. Therefore, the evaluation queries are listed in AppendixB.1.

1SELECT2l_shipmode,3sum(case4when o_orderpriority='1-URGENT'5or o_orderpriority='2-HIGH'6then 17else 08end) as high_line_count,9sum(case10when o_orderpriority <> '1-URGENT'11and o_orderpriority <> '2-HIGH'12then 113else 014end) as low_line_count15FROM16orders,17lineitem18WHERE19o_orderkey = l_orderkey20and l_shipmode in ('MAIL','SHIP')21and l_commitdate < l_receiptdate22and l_shipdate < 1_commitdate23and l_receiptdate >= date '[DATE]'24and l_receiptdate < date '[DATE]' + interval '1'year25GROUP BY26l_shipmode27ORDER BY28l_shipmode;

Figure 3.4: The SQL Statement of 12th TPC-H Query

In various parallel system evaluations, the coefficients speed-up and scale-up[11] are often used. There are slightly different interpretations of speed-up in theliterature, namely ratio of time required on small system over time required on largesystem or ratio of time for sequential query over time for parallel query [25]. Weuse the latter interpretation in cases when we are able to compare a query runningsequentially in standard SECONDO with the parallel version. To make it clear wealso call it Parallel Improvement (PI).

The evaluations here are made on our small-scale six-computer cluster. Each


0

5

10

15

20

1 2 3 4 5 6

Para

llel Im

pro

vem

ent (t

imes)

Cluster Scale

HDJSDJ

(a) Speed-up

0

50

100

150

200

1 2 3 4 5 6

Ela

psed T

ime (

sec)

Cluster Scale

HDJSDJ

(b) Scale-up

Figure 3.5: Evaluations on 12th TPC-H Query

computer has a AMD Phenom(tm) II X6 1055T processor with six cores, 8 GBmemory and two 500 GB hard disks, and uses Ubuntu 10.04.2-LTS-Desktop-64 bitas the operating system. A Hadoop 0.20.2 framework is deployed on the cluster,taking one computer as the master node. Since there are two hard disks on everycomputer, each computer sets two data servers, one on each hard disk; the masterDS is also used as a slave DS. Hence in total, this parallel SECONDO testbed has6 machines, 6 processors, 36 cores, 48 GB memory, 12 disks and 12 slave dataservers.

The evaluation of both methods is shown in Figure 3.5. The scale of the clusterincreases from 1 to 6. In the speed-up evaluation, shown in Figure 3.5a, a constant-sized TPC-H data set with the scale-factor of 5 is used. It is clear that both methodskeep a linear speed-up, and SDJ always shows a better performance than HDJ. Theevaluation on scale-up is shown in Figure 3.5b. Here the scale factor of the data setalso increases from 1 to 6. The result shows both methods keep a considerable sta-ble scale-up by using more computers and SDJ always shows a better performancethan HDJ. The elapsed time of both increases slowly, when larger data sets are pro-cessed. This is mainly caused by the increasing transfer overhead along with theincrease of data sets, since the network resource is shared by all slaves; it does notgrow no matter how many slaves are used.

3.3 System Management Components

SECONDO is developed as a cross-platform software, supporting Linux, Mac andWindows systems. Besides, it also provides a detailed installation guide to helpend-users to install the system easily on different platforms. Although comparedwith other parallel databases, Hadoop is already quite easy to install, end-usersneed to get familiar with the underlying cluster management at first and there is

3.3. SYSTEM MANAGEMENT COMPONENTS 37

Kind Name Function

Setupps-cluster-format Initialize the systemps-cluster-uninstall Uninstall the system

Update ps-secondo-buildMini Update Mini-SECONDOs

Control

ps-startMonitors Start local SECONDO monitorsps-stopMonitors Stop local SECONDO monitorsps-start-AllMonitors Start all SECONDO monitorsps-stop-AllMonitors Stop all SECONDO monitors

Monitor ps-cluster-queryMonitorStatus Check of Mini-SECONDO status

Interfaceps-startTTY Start a local text interfaceps-startTTYCS Start a text interface accessing

any Mini-SECONDO monitor

Table 3.2: Auxiliary Tools for Parallel SECONDO

no installation tool prepared. Moreover, when managing a cluster consisting oftens or even hundreds of computers, even routine work like installing, updating,starting and stopping the system becomes overburdened with trivial details, andsets barriers for end-users.

Regarding this problem, Parallel SECONDO provides a set of auxiliary scriptsto assist end-users to get familiar with Parallel SECONDO quickly, hence to use itto solve their customized problems as soon as possible. All these tools are writtenas bash scripts, hence at present Parallel SECONDO supports only the Unix-basedsystems.

All auxiliary tools use ps as their names’ prefix, shown in Table 3.2. The mDScontains all these tools, while part of them are duplicated on every MS in order tomanage the local sDSs. They are roughly divided into six kinds, shown in the firstcolumn of the table.

The Setup tools take charge in the deployment of Parallel SECONDO with twotools. The first is named ps-cluster-format, initializing Parallel SECONDO

on a private cluster by setting up all Data Servers and also the Hadoop framework.In contrast, the ps-cluster-uninstall performs the opposite function, re-moving Hadoop and all DSs safely. A parameter file is needed to guide the setup,which will be explained later with more details in Section 3.3.1. We also provide agraphical preference editor, which creates the parameter file by detecting the clus-ter first.

Secondly, SECONDO is designed as an extensible database platform, new datamodels are extended as algebras. In order to keep the extensibility in Parallel SEC-ONDO as well, the ps-secondo-buildMini is implemented. It extracts the


necessary components from the SECONDO in the master node, and distributes themto all Data Servers as Mini-SECONDO. Thereby, all Data Servers can support thenew features immediately. Updating Mini-SECONDO does not affect the set up ofthe Hadoop platform, hence it preserves the independence of both systems.

Thirdly, two steps are required to start Parallel SECONDO. First the Hadoopplatform is started with start-all.sh, which is provided by Hadoop itself.Then ps-start-AllMonitors is used to start all Mini-SECONDO monitors onthe whole cluster, which can then be turned off with ps-stop-AllMonitors.These two scripts rely on ps-startMonitors and ps-stopMonitors toturn on and off all Mini-SECONDO monitors of every computer, respectively.

Fourthly, it happens that certain monitors are failed to start up for various rea-sons. Therefore, the ps-cluster-queryMonitorStatus is prepared to listall running monitors on the cluster.

At last, Parallel SECONDO inherits both the text and the graphical interface forSECONDO. The ps-startTTYCS is prepared to start the text interface to anyrunning monitor in the cluster. The graphical interface can be started as usual,connecting to Parallel SECONDO by setting the IP address and port number of themonitor in the mDS.

We also provide some tools to deploy Parallel SECONDO on clusters composedwith virtual instances rented from AWS EC2. They will be explained with moredetails later in Section 5.2.

3.3.1 Parallel SECONDO Preferences

Parameters in Hadoop are defined in several XML files, in the format differentfrom the parameter file used in SECONDO. In order to integrate both systems’parameters, two Parallel SECONDO preference files are prepared.

Both files follow the format for SECONDO configuration, being divided intoseveral sections. Each section sets the parameters for one particular feature ofthe system. For example, since SECONDO uses Berkeley DB as the underlyingstorage system, all its parameters are defined in the section named “ BerkeleyDB”of the SECONDO parameter file. In addition, there are also example files preparedfor both parameter files, where common parameters are prepared. With them, asingle-computer Parallel SECONDO can be set up immediately without additionalpreparation.

The first file prepares runtime parameters for all Mini-SECONDO databases.Despite the increase of the section prepared for Parallel SECONDO, it is almost thesame as the common SECONDO configuration file. It is made during the installationof the system, by copying the current SECONDO configuration file and adding theParallel SECONDO section, then being duplicated into Mini-SECONDO of every


sDS.The second file is used only for the installation of Parallel SECONDO, including

all Hadoop parameters. Through such mechanism, the Hadoop framework can beautomatically installed along with the set up of Parallel SECONDO. Therefore, end-users can prepare a ready-to-use system in several minutes on a large-scale privatecluster, making the deployment as easy as possible.

In the second configuration file, the parameters are divided into three sections:Hadoop, Cluster and Options. The first Hadoop section sets all parameters neededby the Hadoop framework. They are read by the installation script of ParallelSECONDO and then set to the corresponding Hadoop XML configuration files. Allparameters are listed by lines, following the below format.

[fileName]:[title] = [value]

The fileName indicates a specific Hadoop configuration file, while the title setsthe name and the value sets the content of the parameter. For example, a line like

mapred-site.xml:mapred.tasktracker.map.tasks.maximum = 6

defines the mapred.tasktracker.map.tasks.maximum parameter in the mapred-site.xml file for Hadoop, allowing at most 6 simultaneous Map tasks during theruntime of Hadoop jobs.

All Data Servers are defined in the Cluster section. They are also indicated bylines, each stands for one DS and follows the format as:

{"Master=","Slaves+="}IP:DS_Path:MSPort

Each DS is either set as the mDS or a sDS. There is needed one and only onemDS while it is possible to set one sDS as the mDS. Therefore, we use “+=” asthe connector for sDS and “=” for mDS. Besides, every DS is defined with tripleelements: IP, DS Path and MSPort. The IP indicates the IP address of the computerwhere the DS is set. The DS Path indicates the directory where the DS is stored onthat computer and the MSPort defines the port number which its Mini-SECONDO

listens. For example, the following Cluster setting

[Cluster]

Master = 192.168.0.1:/Home/dataServer1:11234Slaves += 192.168.0.1:/Home/dataServer1:11234Slaves += 192.168.0.1:/Home/dataServer2:12234

describes the Parallel SECONDO set on a single-computer with the IP address as192.168.0.1. It contains two Data Servers, the /Home/dataServer1 and the /Home-/dataServer2. The Mini-SECONDO of the first DS listens to the port 11234, whilethe second Mini-SECONDO listens to the port 12234. The first DS is used as themDS and a sDS at the same time.


The last section Options is prepared to define parameters that are specificallydefined for Parallel SECONDO. At present, there is only one option parameternamed NS4Master (Normal Secondo For Master database). It is prepared for end-users who have an existing SECONDO database. If this parameter is set as “true”,then Parallel SECONDO will set this database as its master database directly, with-out migrating any data.

The above two parameter files simplify the installation of Parallel SECONDO,enabling end-users to deploy the system on large-scale clusters without learning toomany details about the underlying Hadoop framework and the cluster construction.However, according to the practical observations, we noticed that it is still difficultto set up the system for end-users who completely unfamiliar with the computercluster management. For example, both Hadoop and Parallel SECONDO need theSSH protocol to be the communication level of the cluster, which is not activatedby default in common operating systems. In order to help end-users to detect theunderlying cluster environment and prepare all required parameters, we provide agraphical preference editor for Parallel SECONDO.

3.3.2 Graphical Preference Editor

The graphical editor is used to further simplify the deployment of Parallel SEC-ONDO, helping end-users to prepare the Parallel SECONDO configuration files anddetect the underlying cluster in order to make sure that it is ready for the installa-tion. Its main frame is shown in Figure 3.6a, listing some basic information of thelocal computer. Before the installation, the master computer of the cluster musthave a common SECONDO installed and also the needed Apache Hadoop down-loaded. If all their checks pass, the frame should look like Figure 3.6a, or else itlooks like Figure 3.6b, telling which part is missing with red warning messages.

(a) Main Frame (b) Main Frame with Error Information

Figure 3.6: The Main Frame with Error Information


When the complete system checking succeeds, the four buttons on the mainframe are activated for end-users to prepare the preference files for different en-vironment. The button Single Node prepares the files for installing the system ona single computer. The second button Simple Cluster asks end-users to describea private cluster, while all the other parameters are set with default values. Af-terward, the Advanced button allows end-users to set special values regarding tohis/her special setting of the cluster. At last, the button Import can read the existingParallel SECONDO preference files hence end-users can make some detailed settingbased on the current environment.

Single Node

The “Single Node” button creates the preference files for deploying Parallel SEC-ONDO on a single-computer, in order to give end-users a trial of the system. Theframe prompted with this button is very simple, shown in Figure 3.7, asking end-users to set the IP address of the computers where they would like to install thesystem and also the “NS4M” feature. The “NS4M” option means whether the cur-rent SECONDO database, if it exists, can be used as the master database. By default,the target IP address is filled in with the address of the current computer.

Figure 3.7: The Frame for Single-Computer Installation

After clicking the “Create” button, the preference files are created automati-cally with all default parameters. The system contains one DS, which is used asthe mDS and also the sDS. In the meantime, if the files are prepared for the currentcomputer, a self-detection program is started to check whether all parameters areavailable.

The self-detection includes the following procedures:

1. Check the availability of the SSH and Screen services. Both are required tomake sure that Parallel SECONDO runs correctly.

2. All Hadoop and SECONDO port numbers are available on the current com-puter, without being taken by other processes.


3. The directory prepared for the DS is available for end-users, i.e. the currentuser should have the read and write access to that disk partition.

If all the checking succeeds, then end-users can use the generated preferencefiles to deploy the system directly with the auxiliary tool ps-cluster-format.

Simple Cluster

(a) Main Frame (b) Cluster Definition

Figure 3.8: The Frame for Simple Cluster Installation

The “Simple Cluster” setting, as shown in Figure 3.8a, prepares the preferencefiles by defining the cluster only, while all the other parameters are set with defaultvalues. The frame mainly contains a table listing all Data Servers, where eachrow defines one DS with its IP address, DS directory and the Mini-SECONDO port.Each DS can be set as either Master, Slave or both.

At the right-side of the DS table, we provide three ways to insert Data Serversinto the table. Firstly end-users can simply add one more DS by clicking the “Add”button. Secondly end-users can prepare DSs in a text file, following the comma-separated format and then use the “Import” button to read all Data Servers. At last,end-users can use the “Describe” button to set a cluster with the frame shown inFigure 3.8b.

The cluster describing frame allows end-users to set a set of DSs on computerswith continuing IP addresses. Besides, he/she can set the DS directory on everyincluded computer. At last, all described Data Servers are inserted into the table ofthe cluster frame.

After adding all DSs, end-users can check the availability of the cluster byclicking the “Check” button. He/She can also use the “Create” button to checkthe availability and also create the preference files at the same time. The checking


procedure verifies not only all features required for the single-node checking, butalso ensures that all involved computers are accessible without needing a password.

Advanced and Import Setting

(a) SECONDO Setting (b) Hadoop Setting (c) Cluster Setting

Figure 3.9: Advanced Setting Frame

For users who wish to have more flexible and specific settings in Parallel SEC-ONDO, like changing Hadoop and Mini-SECONDO parameters, we provide the“Advanced” setting frame, as shown in Figure 3.9. The frame contains three tabs,adjusting the parameters of Mini-SECONDO, Hadoop and cluster, respectively. Theuser-setting parameters are compared with default values automatically and arehighlighted if they are different. The parameters of tables in Figure 3.9a and 3.9bare divided by sections, while the section names are shown in blue rows. At last,part parameters marked in uneditable grey rows in the Cluster setting table are con-stant setting for Parallel SECONDO, hence they cannot be changed by any user. Thecluster setting tab is almost the same as the simple cluster setting, where end-userscan add a large number of DSs by importing a CSV file. In this frame, end-userscan check and create SECONDO and Parallel SECONDO preference files either sep-arately or together, adding more flexibility to the system.

Considering that end-users may have already generated their own Parallel SEC-ONDO preference files, we provide the “Import” button to read the existing files andlet end-users do further adjustments in the system. It uses the same frame as the“Advanced” setting.

Chapter 4

Declarative Language Support

Due to its simplicity and scalability, the MapReduce paradigm and its “official”open-source implementation Hadoop are widely accepted by many institutionscoming from both industry and academic communities to process their large-scaledata. However, since they put more weight on the system scalability, many con-ventional database technologies are absent [43, 48], especially the support of high-level declarative languages like SQL. There exist some explanations [10] that thefunctions for MapReduce tasks are too complicated to be expressed in SQL, butthe MapReduce low-level and rigid paradigm still creates barriers for coworkersand causes custom user code to be very difficult to maintain and reuse.

Regarding this issue, early Hadoop studies proposed some declarative lan-guages like PigLatin [37] and Hive-QL [50]. Both enable end-users to state theirqueries in a SQL-like manner, then compile the queries into Hadoop jobs. How-ever, these compilers are basically built upon simple, rule-based optimizers. Thereis no cost-based optimization technique used, hence the generated Hadoop jobsare not always efficient, especially with queries involving complicated join andaggregate operations [1].

After so many years of development, using declarative language to presentqueries in Hadoop gradually becomes the mainstream trend. In [46], Google pro-poses F1, a distributed relational database system, which includes a fully functionaldistributed SQL query engine. Facebook also proposes their Presto engine, whereend-users can analyze gigabytes to petabytes data by stating SQL queries. All theseclaim that MapReduce and Hadoop have turned from “NoSQL” to “No, SQL” era.

Naturally, Parallel SECONDO also goes with this tide and proposes our owndeclarative language. Instead of being designed from scratch, Parallel SECONDO

inherits the language support from SECONDO including two levels:

1. An SQL-like language, with which queries are stated with SQL notations

45

46 CHAPTER 4. DECLARATIVE LANGUAGE SUPPORT

like SELECT, FROM and WHERE. Special SECONDO operators can be statedwithin the queries directly, although they are not included in the SQL stan-dard. At last, the optimizer generates the query plans that are formulated inthe second level language.

2. An executable language. In this language, the queries are formulated withdatabase objects and operators of the active algebras in the system. Theyrepresent the query plans precisely and hence can be carried out by the SEC-ONDO query processor directly.

Compared with the abstract SQL-like expression, the second-level language ismore complex and needs end-users to learn specialized operations and their syn-tax. However, it is also more intuitive, since end-users can flexibly build up theirqueries with arbitrary operators in order to achieve the best performance. Besides,extending the second-level for parallel processing is also the necessary first step forthe continuous work. Therefore, at the present stage, Parallel SECONDO providesthe declarative language on the second level. Consequently, a parallel data modelis proposed in Section 4.1, including the representation of distributed objects forvarious SECONDO data and also the special operators for stating Map/Reduce pro-cedures.

Afterward, we introduce the PBSM (Partition-Based Spatial Merge) methodin Section 4.2. It is a spatial join method that is widely used for processing var-ious join-related problems in Parallel SECONDO. Different kinds of approachesare introduced to process the spatio-temporal (moving objects) data, and also toimprove the performance by building up real-time index structures during the run-time. Then we evaluate this method on processing both spatial and spatial-temporaljoins in parallel.

Apart from the join queries, Parallel SECONDO can also present all kinds ofmoving objects-oriented queries listed in the benchmark BerlinMOD [16], in orderto improve the performances on generating large-scale benchmark data sets andprocessing certain costly example queries. Two examples are specifically intro-duced in Section 4.3 with more details.

4.1 Parallel Data Model

4.1.1 SECONDO Executable Language

Representing a query in SECONDO executable language is intuitive, since end-users can specify the complete procedure with step-by-step operators. For example,Figure 4.1 depicts an SQL-like query, while its corresponding executable query is

4.1. PARALLEL DATA MODEL 47

shown in Figure 4.2. Note that in the realistic SECONDO system, all words in theSQL-like query cannot start with capital. We state the example in such a way onlyfor keeping all involved object names the same.

SELECT[SName, Bev]

FROMStaedte

WHERE[Bev > 270000, SName starts "S"]

Figure 4.1: Query in SQL-like Language

query Staedte feedfilter[(.Bev > 270000) and (.SName starts "S")]project[SName, Bev]

consume

Figure 4.2: Query in SECONDO Executable Language

This example query selects tuples from Staedte which is a relation of the SEC-ONDO sample database berlintest, including the following attributes:

{SName:string, Bev:int, PLZ:int, Vorwahl:int, Kennzeichen:int}

For historical reason, all names here are set in German. Staedte means cities,containing the information about some German cities. The attribute SName indi-cates the city name, while the Bev tells its population. The PLZ denotes the city’spost code, then the Vorwahl means the area code and the last Kennzeichen is theindicator for all vehicle licenses within that city.

This SQL-like query selects the tuples in which the Bev value is larger than270,000 and the SName starts with “S”. Here starts is a SECONDO operator,returning true if the given string starts with the condition text.

Consequently, in the executable language, the example query simply formu-lates this procedure with database objects and operators, shown in Figure 4.2. Itstarts with query to indicate that the following lines belong to a SECONDO query.The feed operator is a post-fix operator that scans a relation sequentially and loadsits tuples into a memory stream. Each tuple is then fed into the operator filter withtwo condition clauses. Within each clause, the “.” gets the attribute value from theinput tuple and delivers it to the upcoming operator. Afterward, the project oper-


ator extracts only the mentioned attributes from the input tuples. At last, all resulttuples are displayed as a relation on the terminal via the consume operator.

Apparently, the executable queries are basically composed with data objectsand active operators. In order to formulate parallel queries in the executable lan-guage as well, a proper representation for all distributed objects and operators de-scribing various Map/Reduce procedures are required.

4.1.2 Representing Distributed Objects

In SECONDO, data like a rectangle, a moving object’s historical trajectory or a timeinterval can be stored as individual objects and invoked with their names. It canalso be stored together with its other feature information, then be saved as a tuplein a relation. For example, in a relation about city roads, each tuple contains astreet’s shape (as a line) and also its further information like name, length, speed-limitation etc. Nevertheless, not all objects can be kept in relations, e.g. the indexstructures like B-Tree or R-Tree can only be saved as individual objects.

Unlike the standalone SECONDO system that keeps all data in the local database,Parallel SECONDO needs to store the data distributively and transfer them amongthe sDSs during query procedures. In order to uniformly represent these distributeddata and enable end-users to access them through the master database only, the fol-lowing issues are considered.

• Till now, there are nearly one hundred data types proposed in SECONDO [26,27] and more may be extended. Therefore, the distributed data representationshould be designed as a generic structure, being available for all or at leastmost SECONDO existing and future data types.

• The sizes of different objects vary a lot. For example, a rectangle is small,containing only the positions of its button-left and up-right vertices. How-ever, a moving object like a vehicle’s complete trajectory over a long pe-riod may contain hundreds or even thousands of vertices, as large as tens ofmegabytes.

• In different circumstances, a relation can be either duplicated or partitionedin sDSs. Take the example of matching vehicle trajectories to a road net-work. It is possible to duplicate the road network to all sDSs and then evenlydistribute the trajectories by their arbitrary non-geometric feature in a round-robin manner. It is also possible to partition both the network and the tra-jectories based on their geometric positions. In addition, different partitionstrategies can also be used for the same relation. Therefore, the distributed


data representation should be flexible enough to support all these circum-stances.

• As mentioned in Section 3.2, only SECONDO relations can be materialized asdisk files, then be stored and shuffled in PSFS. However, it is also necessaryto represent the distributed data that can only be stored in slave databases.

Small Size Large SizeEqual DELIVERABLE Duplicated PS-MatrixUnequal PS-Matrix PS-Matrix

Table 4.1: The Classification of Parallel SECONDO Distributed Objects

Regarding all above issues, we divide the distributed data into four kinds basedon two aspects: size and equality, shown in Table 4.1. The size is usually decidedby data types. Like we said, a rectangle is small since it contains only two vertices,but a polygon is large since it may contain hundreds of vertices. Normally the datatypes containing no FLOB structure are classified as small distributed objects. Theequality depends on whether the data is duplicated or partitioned on the sDSs. Inthe former case, every sDS has the same data as the others. While in the latter case,each sDS contains a disjoint part of the same data set.

The DELIVERABLE data represents the equal small objects, which are oftenused as global parameters. For example, in a parallel containment query for spatialobjects, the query window is small and should be broadcast to all slave databases.Such data are not worthwhile to be loaded to all slave databases in advance sincethey vary frequently in different queries. In this case, we define them as DELIV-ERABLE data.

Not all SECONDO objects can be used as DELIVERABLE data since theparsing and duplicating overhead on large objects are considerable. For this reason,only 34 SECONDO data types are set as DELIVERABLE, all listed in Table 4.2.Among them, only two of them are allowed to contain the FLOB data.

DELIVERABLE objects are stored only on the master database as commonSECONDO objects; thus they can be easily created and updated. They are invokedby their names and broadcast to every slave database by embedding the valuesinside the task queries as constant objects. If a non-DELIVERABLE object isinvoked, it can be detected by the master database and the query is aborted imme-diately.

Apart from DELIVERABLE objects, all other SECONDO objects are dis-tributed to sDSs with a structure named PS-Matrix, shown in Figure 4.3. A PS-Matrix is derived by partitioning a SECONDO object o with two distribution func-


Type Algebra Flob Number1 date DateAlgebra 02 instant DateTimeAlgebra 03 duration DateTimeAlgebra 04 edge GraphAlgebra 05 vertex GraphAlgebra 06 gpoint NetworkAlgebra 07 rect4 RectangleAlgebra 08 rect RectangleAlgebra 09 rect3 RectangleAlgebra 010 rect8 RectangleAlgebra 011 point SpatialAlgebra 012 spatiallabel SpatialAlgebra 013 geoid SpatialAlgebra 014 int StandardAlgebra 015 real StandardAlgebra 016 bool StandardAlgebra 017 string StandardAlgebra 018 longint StandardAlgebra 019 ulabel SymbolicTrajectoryAlgebra 020 ubool TemporalAlgebra 021 ibool TemporalAlgebra 022 upoint TemporalAlgebra 023 ipoint TemporalAlgebra 024 ireal TemporalAlgebra 025 uint TemporalAlgebra 026 ureal TemporalAlgebra 027 cellgrid3d TemporalAlgebra 028 cellgrid2d TemporalAlgebra 029 iint TemporalAlgebra 030 istring TemporalExtAlgebra 031 ustring TemporalExtAlgebra 032 interval TemporalUnitAlgebra 033 filepath BinaryFileAlgebra 134 text FtextAlgebra 1

Table 4.2: DELIVERABLE Data Types


tions, Row(o) and Column(o). The Row(o) divides the object into R parts, eachis viewed as one row of the matrix. Besides, each row can be further divided intoC pieces based on another distribution function Column(o) and build up a R×Cmatrix at last. All piece data belonging to the same matrix row must be kept on thesame sDS. It is possible that R is larger than the cluster scale N , i.e., the numberof the sDSs, hence one sDS may contain multiple rows. Due to different partitionstrategies, it happens that certain pieces of the matrix are empty, like the whiteblocks illustrated in Figure 4.3.

When a large object really needs to be broadcasted on the cluster, it can be firstduplicated on every sDS and then a PS-Matrix is made by setting each duplicationas one row of the matrix. However, usually we do not recommend it since it maycause serious performance decline.

sDS 1

sDS N

sDS 2

sDS 1

sDS m

... ...

... ...

... ...

... ...

... ...

... ...

... ...

C

R

N

Figure 4.3: PS-Matrix

Regarding different SECONDO data types, various approaches can be used togenerate the PS-Matrixs, in order to achieve the balanced workload assignmenton the cluster. Since SECONDO objects can be stored as relation attribute values,hence they can be distributed along with the relation based on its other features. Forexample, a set of vehicle trajectories can be distributed by their licence numbers ina round-robin manner. It can also be partitioned based on trajectories’ geometricfeatures, like the method that will be introduced in Section 4.2.

The data pieces of a PS-Matrix are distributively stored on sDSs. In the meantime, its structure is kept on the master database in a so called flist object. Theflist is prepared as a wrap structure in order to describe all kinds of distributedSECONDO objects with the following elements:


flist := name:string × type:text × ClusterInfo × matrix× status:bool × kind:{DLO|DLF} × UMQ:text

The element name stands for the prefix name of all piece data that are dis-tributed on sDSs. It can be indicated by end-users, but also be implicitly generated.Later type contains the schema of the distributed object, hence the correctness ofparallel queries involving flist objects can be verified on the master database. Theterm ClusterInfo describes the structure of the underlying cluster, prepared tomigrate the flist objects later. The PS-Matrix is stored in matrix, indicating thelocation of each piece data. Afterward, the boolean status denotes whether thedata has been distributed. If it is false, then all parallel queries with this objectcannot start. The other two elements kind and UMQ are introduced later.

On sDSs, the PS-Matrix piece data can be saved in PSFS nodes as disk files,in order to be transferred among sDSs over the network. Not all kinds of datacan be stored in PSFS nodes since only the relations can be exported by usingPSFS operators, other kinds of objects like index structures can only be stored asindividual database objects. Consequently, two kinds of flist objects are proposed,according to the different storage for the piece data. For an flist object, its kind isdenoted by the kind element.

1. Distributed Local Files (DLF): Each piece data of a R × C PS-Matrix isexported to a disk file in PSFS, called sub-file. Taking fault-tolerance intoconsideration, one sub-file can be kept on several adjacent sDSs, as we men-tioned in Section 3.2. Correspondingly, a slave database can read a set ofsub-files from the other sDSs and load them to its own database. Therefore,sub-files are prepared to exchange data among the slave databases duringthe parallel procedures. A PS-Matrix containing sub-files is represented asa DLF flist . At present, this kind of flist is only prepared for relations thatcan be exported as sub-files. However, it widely applies to most SECONDO

data types since they can be stored in relations as attribute values.

2. Distributed Local Objects (DLO): A DLO flist only stands for an N × 1PS-Matrix, where each row is saved as a normal SECONDO object, calledsub-object. The sub-objects belonging to the same flist also have the samename. DLO flist can wrap all available SECONDO data types. However,since the sub-objects cannot be transferred among the Data Servers, DLOis not as flexible as DLF, it is mainly prepared for generating parallel indexstructures.

Via DELIVERABLE and flist objects, it is possible to represent and invokedistributed data in the master database, hence enable end-users to access Parallel


SECONDO only on the mDS. Next, in order to partition objects that are kept in themaster database into flist objects and vice versa, the following Flow and Hadoopoperators are proposed.

4.1.3 Flow Operators

Essentially, the Flow operators are prepared to concatenate the sequential querieswith parallel queries. The former is able to process small-scale data efficiently on astandalone SECONDO database. The latter is then often used to process large-scaledata in parallel, since they are very time-consuming to be processed on a singlecomputer.

So far, two Flow operators spread and collect are developed, shown in Table4.3. The first partitions a stream of tuples into sDSs as sub-files, then returns theirlocations as a DLF flist . In contrast, the collect gathers the sub-files from sDSsbased on the given DLF flist and imports their the tuples into one stream.

The Flow operators improve the flexibility of Parallel SECONDO, since themaster database can be used as both a common single-computer SECONDO systemand also the entrance for Parallel SECONDO. Therefore, end-users can chooseeither system to process the queries, based on their complexities.

For instance, in [51] we demonstrate the procedure of matching a set of trajec-tories based on their semantic patterns. The pattern-matching operator matchesis remarkably efficient in SECONDO. In our evaluations, it only needs 6.6 secondsto match a certain pattern by sequentially scanning 293,000 symbolic trajectories.However, creating symbolic trajectories is very time-consuming. It needs to matcheach geographical trajectories into a given road network, and the whole proceduremay last days with one computer. To improve the performance on the data prepa-ration, we set up Parallel SECONDO on our six-computer cluster and generate allrequired symbolic trajectories in less than three hours.

Kind Name Signature

Flowspread stream(T)→ flist(T)collect flist(T)→ stream(T)

HadoophadoopMap flist x fun(Map) x bool → flist

hadoopReduce flist x fun(Reduce)→ flisthadoopReduce2 flist x flist x fun(Reduce)→ flist

Assistant para flist(T)→ T

Table 4.3: Parallel Operators


4.1.4 Hadoop Operators

... ...

... ...

... ...

... ...

C

R

... ...

... ...

... ...

... ...C'

R

... ...

... ...

... ...

... ...

C'

R... ...

C'

hadoopMap

hadoopReduce

hadoopReduce2

Figure 4.4: The Map/Reduce Procedures Described by Hadoop Operators

The Flow operators only can access the sub-files in PSFS, hence DLO flist ob-jects can only be created with Hadoop operators. They are prepared as the secondkind of parallel operators in Table 4.3.

The Hadoop operators describe the Map/Reduce procedures in Parallel SEC-ONDO. Basically they accept one or two flist objects and an UDF (User DefinedFunction) as the task query, then return the distributed result still as an flist ob-ject. Both DLF and DLO flist objects can be used as the input for the Hadoopoperations and the result can also be either kind.

Within each Hadoop operator, a template Hadoop job is prepared. It parses theinput flist objects, then generates an executable Hadoop job based on the givenUDF parameter. Next, the executable job is carried out by the Hadoop framework,while the UDF are processed by slave databases simultaneously as Map or Reducetasks. Each task reads the input data from PSFS or slave databases, dependingon what kind of flist object is used. Besides, PSFS operators are implicitly usedinside the tasks, in order to shuffle the intermediate data through PSFS.

Figure 4.4 depicts the basic procedure that the Hadoop operators provide. At


first, the hadoopMap accepts an R × C flist and an UDF Map. Therefore,its executable job generates R Map tasks and each processes the Map query in aslave database with one row of data belonging to the input flist object. At theend of the Map stage, each task partitions the intermediate result into C ′ piecesand materializes each piece into PSFS as a sub-file. As a result, the hadoopMapgenerates an R× C ′ flist object.

Afterward, the intermediate flist is then delivered to a hadoopReduce op-erator containing a Reduce UDF. Its executable job creates C ′ Reduce tasks, eachreads one column data of the input flist from multiple sDSs over PSFS, then pro-cesses the Reduce query in a slave database. The hadoopReduce2 operatorworks similarly, except it receives two intermediate R × C ′ flist objects and itsReduce UDF states a binary query. Besides, in its Reduce stage, every task readsone column data from each input flist object. At last, the Reduce operator, eitherhadoopReduce or hadoopReduce2, also generates a R × C ′ flist object asthe result. Since each Reduce task can only be finished on one slave database, theresult can be transposed to a C ′ × 1 flist at last and then be used as the input foranother hadoopMap operation.

As implied in their names, HadoopMap describes the procedure happeningin the Map stage of a Hadoop job, while the other two describe the Reduce proce-dure only. Until now, the Map and Reduce UDF cannot be stated together withinone Hadoop operator. They are developed in such a way since historically SEC-ONDO operators do not accept two contextual functions as their parameters. There-fore, in queries where both Map and Reduce operators are used, several Hadoopjobs are processed and additional scheduling overhead is wasted.

Regarding this issue, we propose a bool argument in the hadoopMap oper-ator, named Executable. By default this parameter is set true. However, in theabove case, it can be set false. Then the hadoopMap operation does not gener-ate an executable Hadoop job, instead it generates an unexecuted flist object, whilethe Map UDF is set inside the result’s UMQ (Unexecuted Map Query) element.Afterward, the unexecuted flist object is set as the input for the next Reduce oper-ator, which casts the Map UDF inside its Map stage and its own Reduce UDF inthe Reduce stage.

It is also possible to run the Map or Reduce operators alone. In this case,the hadoopMap prepares only one Reduce task, since its Map UDF has beencarried out by Map tasks, to return the intermediate R×C ′ flist object as the finaloutput. For Reduce operators, R Map tasks are processed to re-distribute the inputdata, since it is possible that the input flist objects are partitioned with differentdistribution functions as the result flist object.

The MapReduce paradigm essentially represents a filter-aggregation proce-dure, hence we prepare hadoopMap as an unary operator to express all possible


containment queries. Afterward, we use the unary hadoopReduce to representall possible aggregate operations and the binary hadoopReduce2 for join oper-ations. However, it happens that certain operations need more than two distributeddata objects within the UDF, like the filter condition or a range query by using theindex structure.

One one hand, we use the DELIVERABLE objects to denote all small-sizeand broadcasted data. On the other hand, we propose an assistant operator namedpara to indicate all these auxiliary flist objects. If para includes a DLO flist ob-ject, then each Hadoop task reads one sub-object from the assigned slave database.If the flist of kind DLF, then the task reads all its data from the whole system.

4.2 PBSM: Partition Based Spatial Merge

The PBSM method [42, 58] is widely used in many studies for processing thespatial join based on the MapReduce paradigm, since it is especially effective whenneither input has prepared the index on the joining attribute beforehand. In ParallelSECONDO, in order to cope with various complex queries, geometric data are notalways distributed based on their spatial feature, hence it is difficult to prepare aglobal index structure for all possible queries. Besides, such data may be generatedand redistributed based on various properties, making the index structures moredifficult to prepare in advance. Therefore, Parallel SECONDO often adopts PBSMto process the join query on both spatial and moving objects (spatio-temporal) data.

As mentioned in Section 3.1, spatial join includes two steps [38]: filter and re-finement. The first step creates candidate results which have objects that possiblyfulfill the join condition, based on their approximate information like MBR (Mini-mum Bounding Box). The second step then further detects each candidate to checkwhether the objects fit the join condition with their precise shapes. This strategy isalso applied in PBSM.

In the PBSM filter step, tuples are partitioned into chunks based on the MBRsof their join attribute values. Both relations are partitioned with the same approachin order to guarantee the query correctness. Afterward, tuples belonging to thesame chunk are joined to generate the candidate tuple-pairs. Next in the refinementstep, candidate pairs are traversed and checked with objects’ detailed information.

Several important issues are often studied for PBSM. One is the approaches forhandling the boundary objects, which have MBRs overlapping several chunks, likethe red rectangles shown in Figure 4.5. To address this problem, two main strate-gies are proposed: multiple assignment [58] and multiple matching [35, 59]. Theformer method partitions the object to every chunk that its MBR covers, causingduplicated results in different chunks and additional storage overhead. In contrast,

4.2. PBSM: PARTITION BASED SPATIAL MERGE 57

chunk 0 chunk 1

chunk 2 chunk 3

Figure 4.5: The Boundary Crossing Objects in PBSM

the latter method assigns each object to only one chunk which is then matchedfor several times in the follow-up join procedure. Although the multiple match-ing approach reduces the storage overhead, it requires more computation and I/Ooverhead, since it needs to access the remote data frequently. Therefore, ParallelSECONDO adopts the multiple assignment for all boundary objects.

Besides, within the filter step, it is possible that one tuple pair is generated onseveral chunks, causing duplicated results at last. In order to eliminate the dupli-cated results, [42] sorts all filter result by tuple identifiers. However, it blocks theprocedure since no duplicates can be found before the sorting finishes. Besides, inthe distributed environment, candidate pairs are generated distributively on sDSs,it needs considerable computation and network overhead to perform the parallelsorting on them.

Regarding this issue, Parallel SECONDO uses the method proposed in [15, 58]to eliminate the duplicated result, by leaving the candidate pairs in their commonsmallest cell only. For example, the common smallest cell of the two rectangles Aand B in Figure 4.6 is 4. It is exactly the cell where the left-bottom vertex of thetwo rectangles’ overlapped area lies. Therefore, these two rectangles’ candidatepair is created only in cell 4, while the pair in cell 5 is ignored.

The other main issue in PBSM is the partition method, which should assigneach chunk with even number of tuples as much as possible. However, in the pres-ence of a non-uniform distribution, a simple partition method may produce chunkswith large size differences. For example, in Figure 4.5, the maximum chunk0 con-tains two times more objects than the mininum chunk2. Regarding this issue, likemany other studies, Parallel SECONDO first divides the data universe into a finegrained cell-grid, then it maps the cells to chunks in the manner of round-robin orhashing. The number of the cells in the grid is much greater than the number of


1 2 3

4 5 6

7 8 9

A

B

Figure 4.6: The Common Smallest Cell

the chunks. In this way, as shown in Figure 4.7, the maximum chunk3 containseight objects, while the minimum chunk0 contains six objects, 25% less than thechunk3. Therefore, the size differences between these chunks are reduced.

All above approaches are implemented in Parallel SECONDO as operators,which are elaborated in the following subsections.

chunk 0 chunk 1 chunk 2 chunk 3 chunk 0




Figure 4.7: The Partition Method in PBSM

4.2.1 PBSM in Parallel SECONDO

Most MapReduce based joins are processed in the Reduce stage, called reduce-side join. It is more generic than the map-side join since it does not require theparticipant data sets to be distributed by their joining attributes at first.

Consequently, PBSM in Parallel SECONDO is also implemented as a reduce-side join operation, by dividing the PBSM filter step into two procedures: parti-tion and join. The first is processed in the Map stage, partitioning objects intochunks. Afterward, objects are emitted to PSFS as the intermediate result, eachslave database partitions its map result based on the chunk number. In the Reducestage, each task collects all objects belonging to the same chunk, then processes


the join and the refinement procedures. Besides, the duplicated candidates are alsoremoved within each cell.

In order to partition geometric objects into chunks, a cellgrid structure is pro-posed in SECONDO. To better explain it, we introduce the structure based on the2D space, although it works for both two and three dimensional objects.

A 2D cell-grid is defined with a quadruple:

cellgrid2d := (x0, y0):point × xs:real × ys:real × nx:int

The (x0, y0) defines the left-bottom point of the grid, while the xs and ysdefine cell size on the X-axis and the Y-axis, respectively. The last nx limits thenumber of cells on the X-axis, but the grid can be unlimitedly extended on theY-axis. Each cell has a unique number, it starts 1 from the left-bottom cell, thenincreases from left to right and then bottom to upside. The 3D cell-grid is definedin the same way, except it needs a 3D left-bottom point and three cell edge sizes. Itis restricted on both the X and Y dimensions, but also boundless on the left Z-axis.

Based on the cell-grid definition, a set of operators are added for processingthe PBSM method, shown in Table 4.4.

Name Signature

cellnumberrect x cellgrid2d → stream(int)rect3 x cellgrid3d → stream(int)

gridintersectscellgrid2d x rect x rect x cell:int → boolcellgrid3d x rect3 x rect3 x cell:int → bool

parajoin2

stream(tuple(C1, T1)) x stream(tuple(C2, T2)) x C1 x C2x fun(stream(tuple(T1)) x stream(tuple(T2))→ stream(tuple(T1,T2)))→ stream(tuple(T1,T2))

itSpatialJoinstream(tuple(T1)) x stream(tuple(T2))→ stream(tuple(T1, T2))

Table 4.4: Extensive SECONDO Operators for PBSM

The cellnumber operator finds all cells that are covered by the given rect-angle and returns their numbers. The operator gridintersects returns true onlyif the cell is the two rectangles’ common smallest cell. It works with the filteroperator, eliminating the duplicated results without blocking the procedure.

The parajoin2 operator is specifically developed for processing parallel joinprocedures. Both input streams should be sorted by their C1 and C2 attributes atfirst, in order to process tuple pairs by group. Within each group, different joinapproaches can be used as the internal parameter function. At last, the join results


from every group are returned as the final output. Essentially, this operator dividesa large join operation into small tasks in order to process them independently.

Secondo Distributed Join (SDJ) for PBSM

As studied in Section 3.2, shuffling intermediate data via PSFS is more efficientthan HDFS. Therefore, in Parallel SECONDO, PBSM is implemented by usingPSFS to transfer the intermediate data. Since data is partitioned and exchangedin slave databases, this implementation is named SDJ. Its pseudo code is shownin Algorithm 2, processing the spatial join on two relations R and S, while theirjoining attribute is r and s, respectively.

In the Map stage, both operand relations are extended with an attribute cell(lines 4-5, 8-9). It is extended with the operator extendstream, hence if thebounding box covers several cells then the tuple is duplicated for each cell. Thecell numbers are calculated with the operator cellnumber, denoting the cells thateach joining object’s bounding box overlaps in the Grid. The Grid is preparedin advance, by analyzing small-scale samples from both operand relations. Afterthe extension, both result relations are partitioned into ChunkNum files with thePSFS operator fdistribute (lines 6, 10), based on the cell value. For each createdfile, its chunk number and some other meta information fileInfo are extracted,making up a 〈key, value〉 called FIPair (lines 12-14). The FIPairs are emittedinto HDFS in order to apply Reduce tasks, while the files are actually exchangedwithin PSFS completely.

In the Reduce stage, each task receives a set of FIPairs belonging to thesame chunk. Based on these, it first extracts the file lists filesR and filesS forboth operand relations (lines 18-19), then the data are read from PSFS with theoperator ffeed. Next, the parajoin2 operation is used to process the PBSM joinfor each cell within the chunk and the duplicated results are removed with thegridintersects operator. At last, the refinement is processed with the intersectsoperator, by checking the detailed shapes of the joining attribute values.

Hadoop Distributed Join (HDJ) for PBSM

In contrast of SDJ, Parallel SECONDO is also able to distribute the intermediatedata only with HDFS, i.e. using the Hadoop framework only to shuffle the in-termediate data, like what we introduced in Section 3.2.2. Such an approach isnamed Hadoop Distributed Join(HDJ) and its pseudo code is shown in Algorithm3. Several operators are especially provided for HDJ, listed in Table 4.5.

At first, the doubleexport operation (line 2) homogenizes tuples from twodifferent relations into 〈key, value〉 pairs, so as to shuffle them within HDFS. Each


Algorithm 2: SECONDO Distributed Join (SDJ) with PBSM

1 function Map (k,v):2 for tag in (0,1) do3 if tag == 0 then4 files = SECONDO.extendstream (R,5 cell: cellnumber (bbox (r), Grid)).6 fdistribute (cell, ChunkNum) ;

7 else if tag == 1 then8 files = SECONDO.extendstream (S,9 cell: cellnumber (bbox (s), Grid)).

10 fdistribute (cell, ChunkNum) ;

11 foreach file in the files do12 chunk = extractChunk (file) ;13 fileInfo = extractFileInfo (file) ;14 FIPair = (chunk, fileInfo) ;15 emit(FIPair) ;

16 function Reduce (k,v arr):17 chunk = k ;18 filesR = extractFiles (v arr, 0) ;19 filesS = extractFiles (v arr, 1) ;

20 SECONDO.parajoin2 (ffeed (filesR).sort (CellR),21 ffeed (filesS).sort (CellS), CellR, CellS).22 filter (gridintersects (bbox (r), bbox (s))).23 filter (intersects (r,s)) ;

result tuple contains one input tuple with two attributes: KEY and VALUE. For oneinput tuple, its indicated key attribute value is extracted and converted into thenested-list expression, then set as the KEY of the result. In HDJ, the key attributeis the new extended cell (line 5). Besides, the tuple’s complete binary value is en-coded into a text string with Base 64 [5], this string is set into the VALUE attributeof the result tuple, together with an integer tag. The tag is assigned with 1 or 2,indicating from which relation the included tuple comes.

Afterward, slave databases emit the homogenized 〈key, value〉 pairs to HDFSwith the operator send, as binary sockets. In particular, the send operator setsKEY to be its optional Head attribute (line 5), hence the KEY is set in front of


Algorithm 3: Hadoop Distributed Join (HDJ) with PBSM

1 function Map (k,v):2 tuple arr = SECONDO.send ( doubleExport (3 extendstream (R, cell:cellnumber (bbox (r),Grid)),4 extendstream (S, cell:cellnumber (bbox (s),Grid)),5 cellR, cellS), KEY ) ;6 foreach tuple in tuple arr do7 MRPair = tuple ;8 emit(MRPair) ;

9 function Reduce (k,v arr):10 Rel = SECONDO.CreateRelation ();11 foreach group in v arr do12 foreach v in group do13 MRPair = v ;14 SECONDO.Append (Rel, receive (MRPair)) ;

15 SECONDO.Append (Rel, zTuple) ;

16 SECONDO.parajoin (Rel).17 filter (gridintersects (bbox (r), bbox (s))).18 filter (intersects (r, s)) ;

every tuple block during the transfer process and thus the parsing overhead in Maptasks for SECONDO tuples can be avoided. At last, each tuple sent from the slavedatabase is converted into the 〈key, value〉 pair directly, named MRPair, thenemitted into HDFS.

Each Reduce task processes one chunk data in PBSM and the cells are fargreater than the chunks, hence it receives MRPairs belonging to many cells butwithin the same chunk. Since the MRPairs have been sorted by their cell valuesduring the Shuffle stage by Hadoop, as we set the cell as the key for each MRPair.Therefore, each Reduce task first divides all input MRPairs into groups based ontheir key values (line 11), then uses the operator receive to append them intothe temporal relation Rel created in the slave database (line 14). The receiveoperator works similarly as send, importing data from HDFS back into SECONDO

database. The groups of MRPairs are divided with a special object named zTuple,it is actually an empty MRPair by setting the tag as 0.

In the end of the Reduce task, the operator parajoin is used to finish thePBSM procedure. It first divides the MRPairs into groups by zTuples, then


Name Signature

doubleexportstream(tuple(T1)) x stream(tuple(T2)) x Key1 x Key2→ stream(tuple(KEY:string , VALUE:text))

send stream(tuple(T)) x port:int x [Head]→ number:int

receive IP:string x port:int → stream(tuple(T))

parajoinstream(tuple(KEY:string, VALUE:text)) x

fun(stream(tuple(T1)) x stream(tuple(T2))→ stream(tuple(T1,T2)) )

→ stream(tuple(T1, T2))

Table 4.5: Operators Extended for Hadoop Distributed Join

parses and transforms them to their original tuples. In the meantime, the tuplesare separated into two sides based on the internal tag value, then the parameter joinfunction within the parajoin operator processes the tuples belonging to the samecell and generates the candidate tuple-pairs. At last, the gridintersects operatoreliminates the duplicated results and the intersects operator carries out the finalrefinement step.

4.2.2 PBSM with In-memory Index

Compared with HDJ, SDJ shuffles the intermediate data via PSFS in order toreduce the unnecessary transfer and parsing overhead, achieving a better perfor-mance. However, during the evaluations in later Section 4.2.5, it is found that SDJperforms extremely worse than HDJ when used on relatively small clusters, due totwo problems.

On one hand, as mentioned in Section 3.2.1, an object’s detailed geometric datais far larger than its other information. They are stored in an independent FLOBstructure and kept on the disk, then read into the memory only when they are reallyrequired. However, PBSM in Parallel SECONDO is implemented as a reduce-sideprocedure and both its filter and refinement steps are processed in the Reduce stage.Therefore, these large FLOB data are also shuffled as the intermediate data fromMap to the Reduce stage. Afterward, in each SDJ Reduce task, although the FLOBdata is completely useless in the filter step and only part of them are needed inthe refinement step, it is completely read by the ffeed operator and cached in thememory buffer.

On the other hand, each Reduce task needs to process the join procedure onmany cells of the grid. Since the SDJ approach uses the parajoin2 operation toprocess the join by cells, all input tuples should be sorted based on the Cell attribute


at first, shown in lines 20-21 of Algorithm 2. During the sorting procedure, alltuples including their FLOB data are cached in the memory buffer, which normallyis set with a size threshold. On small clusters, data assigned to each Reduce taskis much larger than in big clusters, hence the memory buffer overflows and causeshigh disk flush overhead.

In contrast, HDJ does not have this problem, since it uses Hadoop to sort theintermediate results automatically based on their key values, which is set by theCell attribute in PBSM. Although the FLOB data is also transferred in HDJ, it takesno part in the sorting procedure. Therefore, the memory buffer does not overflowat all, making HDJ overcome SDJ on small clusters.

Regarding this issue in SDJ, a new operator is proposed in Parallel SECONDO,in order to process the Reduce tasks by chunks, instead of cells. Therefore, thesorting procedure can be prevented and the FLOB data does not need to be cachedin the memory buffer all the time.

This new operator is named itSpatialJoin and its signature is also listed inTable 4.4. It basically performs the index-based nested-loop spatial join on alltuples belonging to the same chunk, by building a memory index on the fly. Allleft side tuples are read into memory first and indexed as a MMR-Tree. Then alltuples from the other side are scanned; for each of them the MMR-Tree is probedto find the matching tuples. If the left side relation is too large to be completelyindexed within the memory, it is then divided into several parts making each partfit the given buffer size. Afterward, the nested-loop join can be performed for eachpart.

With the help of the itSpatialJoin operator, the sorting procedure in SDJ isno longer needed and the additional disk I/O overhead can be omitted. This ap-proach is named SDJ-Index and later will be compared with SDJ in the experimentevaluations.

Normally the cell-grid prepared for the SDJ-Index approach is not as fine asthe one prepared for the SDJ and HDJ approaches. SDJ-Index processes the joinonly within chunks, enlarging the cell sizes can reduce the duplication ratio causedby fine-coarse grids.

4.2.3 PBSM for Moving Objects Data

PBSM supports both the spatial and spatio-temporal join queries, hence in ParallelSECONDO, we can use it to process not only the spatial but also the moving (spatio-temporal) data.

In SECONDO, “moving” (temporal) types are presented with the sliced repre-sentation method [19]. It decomposes the temporal development of a value into


fragments called “slices”, hence the development within each slice can be de-scribed by “simple” data types.

Figure 4.8: The Sliced Representation of Moving Point Values

Figure 4.8 illustrates the representation of mpoint (moving point) values inSECONDO. In factor, they are 3-dimensional trajectories which is cut into slicesbased on the T-axis. Each slice contains the object’s motion in a straight line withina time interval. Here the slice is represented as the unit type (i, v), where i is a timeinterval and v is a pair of points indicating the departure and the destination of themovement.

Based on the sliced representation, PBSM can also be adjusted to process mov-ing objects, by using a 3D cell-grid to partition the moving objects into disjointchunks. However, during such procedure, several new problems arise.

Firstly, the space scale and the time scale are very different. The moving ob-jects data are collected with various measurement systems, hence the scale onthe space and time dimensions could be very different. For example, one vehi-cle moves on the X-axis with the speed of 40km/h, its one second movement isthen partitioned into a cell-grid, where each cell represents the motion of one sec-ond within a 1 m2 area. In this case, this trajectory’s bounding box overlaps 11cells, thus it is duplicated for 11 times in the partition step of PBSM. Apparentlythis approach causes a large duplication and many useless computations.

Regarding this issue, the moving objects data should be preprocessed accord-ing to some scale policies before using the PBSM method. Normally, two scalepolicies: WORLD and AVGUNIT are used. The first scales up the data in order tomake the whole data universe as a cube, while the second makes the average unitto be a cube. In our evaluations, we usually use the former policy.

Secondly, one large moving object may cause disastrous duplication. It is oftenthat an mpoint records a long-term movement, like a vehicle’s commute within one


month. Such a movement is spatially restricted as it always moves around a certainarea of the city, but it covers a large interval on the time axis. This phenomenoncauses a large duplication on moving objects since they often cover hundreds oreven thousands cells of the grid. Due to this problem, we normally first decom-pose the moving objects into their units, then partition only the units instead of thecomplete trajectories.

4.2.4 Represent PBSM in Executable Language

1SELECT2V1.Licence AS Licence1,3V2.Licence AS Licence24FROM5Vehicles V1, Vehicles V26WHERE7V1.Licence < V2.Licence8and V1.Type = "truck"9and V2.Type = "truck"10and sometimes(distance(V1.Journey,V2.Journey) <= 10.0)

Figure 4.9: The SQL Query of the 6th BerlinMOD Query

Both SDJ and HDJ can be easily represented in SECONDO executable lan-guage, with the parallel data model proposed in Section 4.1. Here we demon-strate this procedure by converting an example spatio-temporal join into parallelqueries with different approaches. The example is the 6th range query from theBerlinMOD [16], which is a benchmark prepared for evaluating moving objectsdatabases. It generates a relation Vehicles with the schema:

relation{Moid:int, Licence:string,Model:string, Type:string, Journey:mpoint}

Each tuple of this relation contains a vehicle’s complete trajectory over thewhole simulation period, being stored as a mpoint object. The vehicle’s otherinformation like the licence number, model, etc are also stored. A scale factorSF is used to control the relation size, by setting the simulation’s duration and thenumber of the vehicles.

Days =√SF ∗ 28

V ehicleNum =√SF ∗ 2000


In addition, BerlinMOD also provides a set of example queries in order toenumerate common moving objects range queries from different perspectives. Inthis subsection, we use the 6th range query as the example and its SQL statementis listed in Figure 4.9. It makes a self-join on the Vehicles relation, finding thevehicle-pairs which both are trucks and they have even been close to each other byless than 10 meters. Both distance and sometimes are SECONDO operators.The former generates the distance between two vehicles, which varies along withthe time. The latter looks for whether there exists a moment that their distance isless than 10 meters.

1let OBACRres060 =2Vehicles feed filter[.Type = "truck"]3projectextendstream[Moid, Licence4; UTrip: units(.Journey)]5extend[Box: enlargeRect(6WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]7extendstream[Cell: cellnumber(.Box, CELLGRID)]8sortby[Cell] {V1}9Vehicles feed filter[.Type = "truck"]10projectextendstream[Moid, Licence11; UTrip: units(.Journey)]12extend[Box: enlargeRect(13WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]14extendstream[Cell: cellnumber(.Box, CELLGRID)]15sortby[Cell] {V2}16parajoin2[Cell_V1, Cell_V2; . ..17symmjoin[(.Licence_V1 < ..Licence_V2)18and (gridintersects(WORLD_CELLGRID,19.Box_V1, ..Box_V2, .Cell_V1))20and sometimes(distance(.UTrip_V1,..UTrip_V2) <= 10.0)]]21project[Moid_V1, Licence_V1, Moid_V2, Licence_V2]22sortby[Moid_V1, Moid_V2] rdup count23consume;

Figure 4.10: The Sequential Query for the 6th BerlinMOD Query

Figure 4.10 describes the executable query statement for processing the exam-ple query sequentially in a SECONDO database. It is longer than the SQL querysince it needs to describe the query plan with the PBSM method precisely.

First, it reads the tuples from Vehicles as a stream with the operator feed, thenleaves only the truck trajectories by using the filter operator on the attribute Type.Afterward, the operator projectextendstream decomposes each trajectory into


units and duplicates the vehicle’s Moid and Licence for each unit.

1let Vehicles_Moid_dlf = Vehicles2feed spread[;Moid, CLUSTER_SIZE, TRUE;]34let OBACRres060 =5Vehicles_Moid_dlf6hadoopMap[DLF, FALSE; . feed filter[.Type = "truck"]7extendstream[UTrip: units(.Journey)]8extend[Box: enlargeRect(9WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]10projectextendstream[Licence, Box, UTrip11;Cell: cellnumber(.Box, CELLGRID) ] ]12Vehicles_Moid_dlf13hadoopMap[DLF, FALSE; . feed filter[.Type = "truck"]14extendstream[UTrip: units(.Journey)]15extend[Box: enlargeRect(16WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]17projectextendstream[Licence, Box, UTrip18;Cell: cellnumber(.Box, CELLGRID) ] ]19hadoopReduce2[Cell, Cell, DLF, PS_SCALE, FALSE20; . sortby[Cell] {V1} .. sortby[Cell] {V2}21parajoin2[ Cell_V1, Cell_V222; . .. symmjoin[ (.Licence_V1 < ..Licence_V2)23and gridintersects(24WORLD_CELLGRID, .Box_V1, ..Box_V2, .Cell_V1)25and sometimes(26distance(.UTrip_r1,..UTrip_r2) <= 10.0)]]27project[Moid_V1, Licence_V1, Moid_V2, Licence_V2]28sortby[Moid_V1, Moid_V2] rdup ]29collect[]30sortby[Moid_V1, Moid_V2] rdup31consume;

Figure 4.11: The Parallel Query (SDJ) for the 6th BerlinMOD Query

Next, each unit is extended with a bounding rectangle Box. It is calculatedby first scaling up the unit’s MBR with the function WORLD SCALE BOX, sincethe whole data universe is scaled up according to the WORLD policy, as men-tioned above. Later, the box is enlarged on both the X and Y-axis for 5 meterswith the operator enlargeRect, in order to ensure all final results are includedin the candidate results after the join processing. At last, the units are partitionedinto the CELLGRID which is prepared beforehand, with the cellnumber and


extendstream operator. All above procedures are stated in the lines 2-7 of thequery.

The same procedure is processed again for the other side relation (line 9-14).Then the left side input is renamed with the alias V1 and the right side is re-named with V2. Both are sorted based on the Cell attribute, hence the parajoin2operator can process the join by cells (line 16-20). In each cell, the Cartesianproduct on both sides tuples is first made with the symmjoin operator, then thegridintersects is used to eliminate the duplicated results, also the distance andsometimes operators are used to refine the candidates. At last, duplicated resultsare eliminated globally with the operator rdup, in case two trucks meet more thanonce during their trips (line 21-22). The final result is stored in relation OBACR-res060 with the consume operator.

1let OBACRres060 =2Vehicles_Moid_dlf3hadoopMap[DLF, FALSE; . feed filter[.Type = "truck"]4extendstream[UTrip: units(.Journey)]5extend[Box: enlargeRect(6WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]7projectextendstream[Licence, Box, UTrip8;Cell: cellnumber(.Box, CELLGRID) ] ]9Vehicles_Moid_dlf10hadoopMap[DLF, FALSE; . feed filter[.Type = "truck"]11extendstream[UTrip: units(.Journey)]12extend[Box: enlargeRect(13WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]14projectextendstream[Licence, Box, UTrip15;Cell: cellnumber(.Box, CELLGRID) ] ]16hadoopReduce2[Cell, Cell, DLF, PS_SCALE, TRUE17; . {V1} .. {V2} symmjoin[ (.Licence_V1 < ..Licence_V2)18and gridintersects(19WORLD_CELLGRID, .Box_V1, ..Box_V2, .Cell_V1)20and sometimes(21distance(.UTrip_V1,..UTrip_V2) <= 10.0)]22project[Moid_V1, Licence_V1, Moid_V2, Licence_V2]23sortby[Moid_V1, Moid_V2] rdup ]24collect[]25sortby[Moid_V1, Moid_V2] rdup26consume;

Figure 4.12: The Parallel Query (HDJ) for the 6th BerlinMOD Query

Apparently, this example query can be divided into two parts according to the


MapReduce paradigm. The partition step (lines 2-13) is processed in the Mapstage, while the join and refinement steps (lines 14-18) are then embedded into theReduce stage. Therefore, it can be easily converted into a parallel query for ParallelSECONDO, shown in Figure 4.11.

Before processing the parallel query, the relation Vehicles is distributed on thecluster with the spread operator, generating a DLF flist object Vehicles Moid dlf(line 1-2). The relation is distributed by the Moid attribute, based on the numberof sDSs CLUSTER SIZE. Therefore, each sDS receives one row of the flist object.In the following part of this thesis, all flist objects are named by such triples,including its original relation name, partition attribute and the distributed flist kind.

In the parallel query, the parameter functions in the hadoopMap operations(lines 6-11, 13-18) are set exactly the same as in the sequential query (lines 2-7,8-13). Note that the executable parameter in both Map operations are false, hencetheir Map tasks are merged into the Map stage of the upcoming hadoopReduce2operation.

Similarly, in the hadoopReduce2 operation, its Reduce UDF (lines 20-26)is also the same as the join part of the sequential query (lines 14-18). Besides, itshuffles the intermediate data by the Cell attribute, while the number of the parallelReduce tasks is set to PS SCALE. Therefore, the join procedure is divided intoPS SCALE disjoint chunks, all cells data belonging to the same chunk are shuffledinto the same Reduce task.

In addition, hadoopReduce2 contains a special argument isHDJ, indicatinghow to shuffle the intermediate result. Here it is false, hence the query is processedwith the SDJ approach. At last, the parallel result is returned as a DLF flist object,hence can be gathered to the master database with the collect operator. Here therdup operation is used twice, one in the Reduce UDF and another at last, in orderto reduce the parallel result size as much as possible.

The generic process of converting SECONDO sequential queries into their cor-responding parallel queries is illustrated in Figure 4.13. Basically end-users onlyneed to divide the sequential query into pieces based on the MapReduce paradigm,then set them into the Hadoop operations as Map and Reduce UDF, respectively.All the internal data exchanges, either via PSFS or HDFS, are completely embed-ded inside the Hadoop operators.

Therefore, transforming the example query into the HDJ approach is also verysimple, while the transformed query is listed in Figure 4.12. It looks very similar asthe SDJ query and their hadoopMap operations are exactly the same (line 2-15).However, within the hadoopReduce2 operation, its isHDJ parameter is set true(line 16) for indicating the HDJ approach is used to shuffle the intermediate data.Besides, it does not need the sorting procedure since the intermediate MRPairsare sorted automatically during the Shuffle stage. Furthermore, the Reduce UDF


Vehicles

spread spread

hadoopMap

Map

hadoopMap

Map

hadoopReduce2

Reduce

H D F S

ffeed

+ sortby[Cell]

collect

fdistribute

[Cell]

FIPairs

fdistribute

[Cell]

ffeed

+ sortby[Cell]

P S F S

P S F S

(a) Secondo Distributed Join (SDJ)

Vehicles

spread spread

hadoopMap

Map

hadoopMap

Map

hadoopReduce2

Reduce

H D F S

receive

+ parajoin[Cell]

collect

doubleexport

[Cell] + send

MRPairs

(b) Hadoop Distributed Join (SDJ)

Figure 4.13: Query Converting Process for PBSM

describes the joining procedure for each cell instead of for each task, since theparajoin operation is implicitly invoked inside the template Hadoop job (line 17-21).

Furthermore, the SDJ-Index approach introduced in Section 4.2.2 can also beeasily represented in the executable language. Since it is very similar to the SDJquery, we list its statement later in Appendix B.3. As a SDJ variance, it changesonly the Reduce UDF of the hadoopReduce2 operation. Basically it uses theitSpatialJoin to take place of the parajoin2 (line 17-24), in order to not usethe sortby operations that cause the performance decline on processing large-scale data in small clusters. The itSpatialJoin generates the candidate results ofthe spatial join by only comparing the units’ extended bounding boxes. Therefore,a filter operation is used afterward to process the refinement. Among these filterconditions, both sides’ units should locate at the same cell (line 20), since thegridintersects operation is prepared by cells.


4.2.5 Evaluations

Here the performance of Parallel SECONDO is evaluated again by processing thePBSM method with different approaches. The HDJ and SDJ are compared in orderto illustrate the effect of the native store mechanism. Besides, the SDJ-Index is alsocompared with SDJ, exhibiting outstanding performance by introducing the real-time index structure.

All these evaluations are performed on both spatial and spatio-temporal ob-jects, showing Parallel SECONDO has well inherited the SECONDO capability onprocessing specialized data types. The testbed is still set on our private cluster,introduced in Section 3.2.2. All data sets used in the evluations are provided bypublic benchmark or organization:

• The spatial data set is provided by the OpenStreetMap project [29], describ-ing the road network of the federal state North Rhine-Westphalia in Ger-many. We convert the data set into a SECONDO relation named ROADS anduse it to perform the spatial self-join to find all intersected streets. The eval-uation queries are separately listed in Appendix B.2, since they are too longto be well exhibited here.

The ROADS relation contains 732,054 tuples, taking about 927 MB diskspace. Each tuple contains one road represented as a line object [19] andits other attributes. On average each road contains six segments, which arestraight lines stated with two end-points. In evaluations, we usually need toprocess the same query with data sets of different scales. Therefore, we en-large the data set by duplicating it and use the duplication times as the scalefactor SF . The coordinates of the duplicated roads are translated accord-ingly, hence they can be disjoint from the original data set.

• The spatio-temporal data set is generated by benchmark BerlinMOD [16],which has been introduced above. Its 6th range query is also used to performthe evaluations, while the data set sizes can be adjusted with the SF values.When SF is 1.0, the generated data set contains 2,000 vehicles’ history tra-jectories during 28 days, taking about 11 GB disk space. Each trajectory ison average composed of 27,759 units.

Both spatial and spatio-temporal join queries are processed with the PBSMmethod, partitioning the data into chunks with a 2D and 3D cell-grid, respectively.The cell-grid is generated beforehand, based on their small samples. Same as thelast evaluation in Section 3.2.2, the Parallel Improvement (PI) is used to illustratethe speed-up of Parallel SECONDO. In contrast, the elapsed time is used in thescale-up evaluations, since it is very difficult for a standalone SECONDO system toprocess the data sets requiring several computers’ resources.


0

5

10

15

20

1 2 3 4 5 6

Para

llel Im

pro

vem

ent (t

imes)

Cluster Scale

HDJSDJ

(a) Speed-up

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6

Ela

psed T

ime (

sec)

Cluster Scale

HDJSDJ

(b) Scale-up

Figure 4.14: HDJ and SDJ Performances On Spatial Data

The performance of the spatial join is shown in Figure 4.14 and 4.15. In thespeed-up evaluation, the cluster increases from 1 to 6 nodes, while the data scalefactor is set is 2. Along with the increase of the cluster, more Reduce tasks areused in parallel and each processes less data, hence the PI of both methods increaselinearly.

At first, SDJ performs worse than HDJ. Especially when there are less thanthree nodes, SDJ cannot even finish the procedure. This is caused by caching alarge amount of useless FLOB during the sorting procedure of the Reduce tasks inSDJ, as explained in Section 4.2.2, However, when the cluster scale is larger than 3,SDJ performs much better than HDJ, as we observed in the evaluation on standarddata.

0

5

10

15

20

1 2 3 4 5 6

Para

llel Im

pro

vem

ent (t

imes)

Cluster Scale

SDJSDJ-Index

(a) Speed-up

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6

Ela

psed T

ime (

sec)

Cluster Scale

SDJSDJ-Index

(b) Scale-up

Figure 4.15: SDJ Performance On Spatial Data With In-memory Index

Additionally, the same speed-up evaluation is also made between SDJ and SDJ-Index, shown in Figure 4.15. By building memory index structures to prevent thesorting procedure, SDJ-Index is able to perform a linear speed-up and is much moreefficient than SDJ. Of course, it is possible to process it on small clusters with only


one or two computers, gaining considerable improvements by parallel processing.The evaluation results on the scale-up of these approaches are presented in

Figure 4.14b and 4.15b. It sets the scale factors of both relation and cluster from1 to 6, step by 1. Based on the results, it is clear that all three methods keepa comparatively stable performance, except for scale factor 1. Since when theparallel procedure happens only on one computer, the intermediate results are notshuffled over the network, hence the traffic overhead can be reduced. Among them,SDJ-Index always keeps the best performance.

0

10

20

30

40

50

60

70

1 2 3 4 5 6

Para

llel Im

pro

vem

ent (t

imes)

Cluster Scale

HDJSDJ

(a) Speed-up

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6E

lapsed T

ime (

sec)

Cluster Scale

HDJSDJ

(b) Scale-up

Figure 4.16: HDJ and SDJ Performances On Spatio-Temporal Data

0

10

20

30

40

50

60

70

1 2 3 4 5 6

Para

llel Im

pro

vem

ent (t

imes)

Cluster Scale

SDJSDJ-Index

(a) Speed-up

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6

Ela

psed T

ime (

sec)

Cluster Scale

SDJSDJ-Index

(b) Scale-up

Figure 4.17: SDJ Performance On Spatio-Temporal Data With In-memory Index

The performances of the spatio-temporal join are shown in Figure 4.16 and4.17. In the speed-up evaluation, the BerlinMOD SF is set 1 and we process iton the increasing clusters with 1 to 6 computers. As usually, SDJ performs betterthan HDJ by introducing the native store mechanism. Although it seems that thedifferences between SDJ and HDJ are not significant, it is noted that the PIs hereare much higher than the ones we got in the parallel spatial join, for both HDJ and

4.3. PARALLEL BERLINMOD 75

SDJ approaches. With all six computers in the cluster, SDJ gets 13 times PI for thespatial join operation, while it can gain 20 times here.

In the comparison between SDJ and SDJ-Index, shown in Figure 4.17, the trendis similar as for the spatial join. For smaller clusters, SDJ-Index is much moreefficient than SDJ. Generally, the advantage of SDJ-Index is that it does not needto sort hence it avoids the overhead of caching the useless FLOB data. However,along with the growing cluster, the same size data are partitioned into more chunks,hence more but smaller MMR-Tree structures are built and the advantage becomesless remarkable.

In the scale-up evaluation, the cluster still extends from 1 node to 6 nodes. Dif-ferently, since the query is quite expensive, the data scale factor increases from 0.5to 3 by steps of 0.5. The result of the scale-up evaluation is represented in Figure4.16b and 4.17b. Apparently, SDJ always keeps a better performance than HDJ,but both of them cannot remain a stable query cost, and change significantly alongwith the scale factors. This is because the 6th query contains a selection predicate,by which only the “truck” trajectories take part in the join operation. Since Berlin-MOD does not provide a constant selectivity for different vehicle types, the per-formances in this experiment are not stable enough. Compared with the other twoapproaches, the scale-up of SDJ-Index is more stable, although it is still slightlyaffected by the selectivity of the data set.

4.3 Parallel BerlinMOD

Besides PBSM, many other queries can also be declared and processed in Paral-lel SECONDO. Here all seventeen range queries in the BerlinMOD benchmark areconverted into their corresponding parallel queries, for demonstrating the compati-bility of Parallel SECONDO. Since they are proposed by all aspects for the movingobjects management.

4.3.1 Parallel Data Generation

BerlinMOD provides some scripts to generate the data sets in a SECONDO database,while end-users only need to set up the SF value. However, this procedure is verytime-consuming. Generating the data set with SF 1 needs at least two hours on acommon commodity computer [16]. In order to accelerate this step, a parallel datageneration for this benchmark is developed.

It consists of a set of SECONDO scripts and a Hadoop program, hence the datageneration is partitioned into independent Map and Reduce tasks, each is processedwithin one slave database. Besides, it also provides an auxiliary tool, by which


end-users can generate any scale data sets in parallel still by simply setting the SFvalue. With them, generating a SF 1 data set in our own cluster costs only 15minutes. All these generation tools are published on our website 1.

The data generated in parallel is not the same as the sequentially generateddata. However, it still keeps the repeatability, i.e. the same data set is generatedwhen the SF is set the same.

The BerlinMOD data generation simulates a number of vehicles’ trips to andfrom work during the week, and also some additional trips at evenings or weekends.It is mainly divided into the following steps:

1. Firstly a simple Berlin road network is generated by loading several exter-nal files. Then based on the number of vehicles, some positions are set asHomeNodes, while some others are set as WorkNodes.

2. Secondly each vehicle’s trip over the complete simulation period is createdwith a series of SECONDO queries. A global random seed is used during thisprocedure, in order to make the simulation as close to the reality as possibleand also ensure the repeatability of the generator.

3. Thirdly several sample relations are also generated, which are used in therange queries.

4. At last, the generated trips are adjusted on their spatial and temporal scaleswith the “WORLD” policy, as mentioned in 4.2.3.

More details about the simulation are introduced in [16], here we only intro-duce several challenges during the procedure of transforming it into parallel pro-cessing.

Based on the MapReduce paradigm, the top three steps are divided into the Mapstage since they can be generated independently on all slave databases in parallel.Each slave database simulates an equal number of vehicles’ trajectories. Unlikethe sequential generation that sets a unique random seed for all vehicles, here theseed is set for each vehicle in order to guarantee the repeatability of the data set.The random seed is also set differently for each vehicle, keeping the simulationresult close to the reality.

In addition, for adjusting the data by the WORLD policy, a global scanningfor all created moving objects is required. Therefore, the last generation step isdivided into the Reduce stage. Besides, in the range queries, most joining relatedoperations are processed with the PBSM method, hence a global 3D cell-grid isalso created within the Reduce stage.

1http://dna.fernuni-hagen.de/secondo/ParallelSecondo


According to this division, the BerlinMOD generator script is divided into mapand reduce scripts. A special Hadoop program is prepared, hence the scripts areprocessed on all slave databases within its corresponding stages. Besides, twoother SECONDO scripts are also prepared. They are processed on the masterdatabase only to prepare some global objects. The first is carried out before theHadoop program, partitioning the vehicles into Map tasks. The second is executedat last, generating a set of flist objects for accessing the data that are distributivelygenerated on slave databases.

In the BerlinMOD benchmark, two approaches are used to store the simulatedtrips: Object-Based (OBA) and Trip-Based (TBA). The first contains each vehi-cle’s full description, including its complete trajectory over the simulation period.The second breaks these long trajectories into disjoint pieces, each describing a ve-hicle’s continuous motion between two long-time idle states. Both approaches aresupported in the parallel generated data sets, but only the OBA trips are used in ourevaluations. It is distributively stored as an flist object named Vehicles Moid dlo.It is set as a DLO flist since we need to build up various index structures on it. Theschema of this object is:

flist(relation{MoId:int, Licence:string, Model:string,Type:string, Journey:mpoint})

In the following, two BerlinMOD example queries are introduced and con-verted into parallel queries. They use two flist objects that are also created duringthe above parallel generation. First the Vehicles Licence btree Moid dlo is a dis-tributed B-Tree. It is set up on every sub-relation of the dataScar Moid dlo, withthe Licence as the key attribute. Second the QueryLicences Dup dlf is a DLF flistprepared for the sample relation QueryLicences. It duplicates the sample relationcontaining 100 vehicle licences on every sDS.

4.3.2 Parallel Range Queries

BerlinMOD provides seventeen range queries in total, covering all possible aspectsin moving objects databases. They all can be converted into the parallel queries andthe transformation procedure is quite similar. Therefore, only the Q1 and Q10 areexplained in this subsection with more details, in order to demonstrate the compat-ibility of Parallel SECONDO. All their statements, including these two examples,are listed in Appendix C.2, since it is not necessary to introduce them all in themain text.


Q1

The first range query finds a set of query vehicles’ models by probing a B-Tree builton all vehicles’ licences. The sequential query is listed in Figure 4.18. It traversesthe sample relation QueryLicences with the loopsel operator, then uses each sam-ple’s licence to probe the index Vehicles Licence btree. Afterward, the results areextracted from the source Vehicles relation with the exactmatch operator. Resultvehicles’ licence and model are projected and stored in the relation OBACRes001.

1let OBACRres001 = QueryLicences feed {O}2loopsel[3Vehicles_Licence_btree Vehicles4exactmatch[.Licence_O]]5project[Licence, Model]6consume;

Figure 4.18: The 1st BerlinMOD Example Query

Accordingly, the parallel query for this example is very simple. It uses onehadoopMap operator hence processing the query in the Map stage only. TheMap UDF looks very similar as the sequential query, each task traverses the samesample relation and probes its respective B-Tree index. Both the distributed re-lation dataScar Moid dlo and index Vehicles Licence btree Moid dlo are denotedwith the para operator, hence their sub-objects are used in the parallel Map tasks.At last, the distributed result is returned as a DLF flist , which can then be gatheredto the master database with the collect operator and saved as a normal SECONDO

relation by using the consume operation.This sequential query performs efficiently on a single computer. By compari-

son, its parallel query is much slower, due to the factor that running a Hadoop jobneeds certain overhead, like communicating DSs, assigning tasks etc. For efficientqueries like the above one, such overhead is more expensive than the query itselfthus it is not worthwhile to process them in parallel. In fact, in all range queries ofBerlinMOD, fourteen of them are efficient enough by running in a standalone SEC-ONDO database, but performing worse after being transformed to parallel queries.Therefore, the Flow operators are proposed to concatenate the sequential and par-allel queries, in order to select proper systems to process different scale problems.

Nevertheless, the cost of those efficient queries only takes 4% of the totalelapsed time for the benchmark, when they are processed on a single computer.Therefore, it is important to convert and process the other three queries: the sixth,ninth and tenth in parallel. The sixth query has been explained with many detailsin the last subsection, while the ninth can also be processed within the Map stage


only. Therefore, we introduce the tenth query in the following.

Q10

This query finds when and where the top ten samples meet the other vehicles, i.e.,two vehicles at least once have a distance between each other less than three meters.Its sequential query is listed in Figure 4.19.

1let OBACRres010 =2QueryLicences feed head[10]3loopsel[ Vehicles_Licence_btree Vehicles4exactmatch[.Licence]5project[Licence, Journey]] {V1}6Vehicles feed7project[Licence, Journey] {V2}8symmjoin[(.Licence_V1 # ..Licence_V2) ]9filter[ (everNearerThan(10.Journey_V1, .Journey_V2, 3.0)) ]11extend[12QueryLicence: .Licence_V1,13OtherLicence: .Licence_V2,14Pos: .Journey_V1 atperiods15deftime((distance(.Journey_V1,.Journey_V2) < 3.0)16at TRUE)]17filter[not(isempty(deftime(.Pos)))]18project[QueryLicence, OtherLicence, Pos]19sort rdup20consume;

Figure 4.19: The 10th BerlinMOD Example Query

It first finds the trajectories of the top ten query vehicles also by probing theVehicles Licence btree index on the relation Vehicles (line 2-5). Afterward, it usesa symmjoin operation to compare every trajectory-pair composed by one sampleand another vehicle from the Vehicles relation (line 8).

If two vehicles have met each other, their licences and the meeting points arereturned as the result. This is decided by performing the everNearerThan oper-ator, which determines whether the two vehicles’ trajectories have ever been closerthan the certain distance (line 9-10). Afterward, the deftime tells the meeting mo-ments of the two vehicles, and the atperiods returns all their meeting positions(line 14-16). All these special operators are explained with more details in [26].

Most elapsed time of this query is spent on the symmjoin operation, which


should be converted and processed in parallel. It is of course possible to convertthe query straightforwardly and keep using the symmjoin operator, since herethe joining attribute’s type is string , so as to make the converted query look verylike the sequential statement. However, in such a way, one side of data needs to beduplicated on every sDS and will cause a heavy network traffic since each vehiclecontains its complete trajectory over a long period. Note that here it is not a equi-join (line 8) thus it is impossible to use other join approaches like hash join toprocess it in parallel. Regarding this issue, we also adopt the PBSM method toprocess the join operation and its statement is listed in Appendix C.2.

Basically the parallel query is made up of two unexecuted hadoopMap andone hadoopReduce2 operation, hence all three can create one Hadoop job atlast. The first hadoopMap operation (lines 2-6) works similar as Q1, probingthe distributed B-Tree Vehicles Licence btree Moid dlo to find the top ten samplevehicles’ trajectories from the distributed relation Vehicles Moid dlo. Afterward,trajectories from both sides are decomposed into units and partitioned into the 3Dgrid WORLD GRID 3D (line 7-12 and 15-20). Next the PBSM method is pro-cessed in the Reduce stage with the SDJ-Index approach (line 21-27). Candidateresults are filtered according to the current query condition, and the meeting posi-tion of each result unit-pair is calculated.

In the above procedure, trajectories are decomposed into units and their meet-ing points are calculated by units. However, it is possible that two vehicles meetmore than once during their journeys. Therefore, at the end of the parallel query, agroupby operation is used to aggregate the results based on all possible vehicle-pairs, then concatenate their meeting positions (line 36-39).

This parallel query is complicated, about two times as long as the sequentialquery, since it uses the PBSM method for achieving a better efficiency. In orderto demonstrate its performance on Parallel SECONDO, simple evaluations are pre-pared below.

4.3.3 Evaluations On Benchmark Queries

Several evaluations are made about the BerlinMOD benchmark queries on oursmall-scale private cluster. As mentioned, it is not always worthwhile to trans-form and process a sequential query in parallel, since there exists a certain constantoverhead for performing the job itself. However, the improvement brought by us-ing Parallel SECONDO to prepare the data and process the three expensive queriesis still very important.

Figure 4.20 illustrates the performance for generating BerlinMOD data sets,while Figure 4.21 demonstrates the evaluation for the 10th range query. Both thespeed-up and the scale-up are tested. Note that the metric PI is not used here for


0

3000

6000

9000

12000

15000

1 2 3 4 5 6

Ela

psed T

ime (

sec)

Cluster Scale

Generation

(a) Speed-up

0

3000

6000

9000

12000

15000

1 2 3 4 5 6

Ela

psed T

ime (

sec)

Cluster Scale

Generation

(b) Scale-up

Figure 4.20: Data Generation in Parallel BerlinMOD

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6

Ela

psed T

ime (

sec)

Cluster Scale

Q10

(a) Speed-up

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6

Ela

psed T

ime (

sec)

Cluster Scale

Q10

(b) Scale-up

Figure 4.21: The 10th Example Query in Parallel BerlinMOD

the speed-up evaluation, since we choose the SF 6 as the scale factor of the dataset. It is difficult to create such a big data set with a standalone SECONDO system,which can only use one processor core and one disk. In contrast, in the parallelprocessing, all six processor cores and two disks are fully used when we create asingle computer cluster. In the scale-up evaluation, both the data and cluster scalesincrease from 1 to 6, step by 1.

From the evaluation result, it is known that Parallel SECONDO is able to keepa stable performance for both procedures. Besides, in small clusters with less thanthree computers, the system performs differently when more computers are in-volved. This is mainly caused by the issue of caching useless FLOBdata, as wediscussed in Section 4.2.2. In order to further improve the system’s performance,some optimization technologies are studied later in Chapter 6, and a set of im-proved PSFS operators are proposed.

Chapter 5

Cloud Evaluation

Compared with other parallel processing systems, Hadoop has an outstanding scal-ability. It can be deployed on hundreds or even thousands of computers, while thecommon parallel databases can only be deployed on no more than one hundredcomputers [1]. Consequently, since Parallel SECONDO is developed based on theHadoop framework, it is necessary to evaluate its scalability on large-scale clustersas well.

In practice, it is not worthwhile for us to purchase a big number of computersby ourselves and build up large-scale clusters on them. This issue exists not onlyfor our group, but also for many other research groups and small enterprises sinceit is a big burden to acquire and manage a lot of computers. Regarding this prob-lem, many IaaS (Infrastructure as a Service) Cloud Computing vendors, like AWS(Amazon Web Service), Rackspace, Windows Azure etc, arose in the recent years.These Cloud providers lease out computing resources as virtual computers, hencethe large-scale clusters can be built up and charged only when they are needed.

Among these Cloud infrastructures, we select AWS to be the evaluation plat-form for Parallel SECONDO, based on the following reasons. First AWS is builtupon many large data centers all over the world, maintained by Amazon.com.Therefore, its service should be stable enough for our long-term research work.Second many other parallel processing studies [45, 1] are also set up based onAWS, hence we can compare our system with theirs based on the same platform.The last reason, which we are also very grateful to, is that AWS provides a consid-erable grant for us to rent their services freely, making it possible to carry our workout on hundreds of computers with a limited investment.

AWS includes many kinds of web services, like S3 (Simple Storage Service),DynamoDB, VPC (Virtual Private Cloud) etc. The computing resources are mainlyprovided through its EC2 (Elastic Cloud Computing) service, where various virtual

83

84 CHAPTER 5. CLOUD EVALUATION

computers can be rented by hours with different prices. Afterward, we use AmazonEC2 to indicate all the services that we need to build up our large-scale clusterswithin the AWS infrastructure.

In this chapter, we introduce the deployment of Parallel SECONDO on the large-scale clusters consisting of Amazon EC2 virtual computers. First, the basic Ama-zon EC2 infrastructure is introduced in Section 5.1, including its main differencesfrom the private clusters. Next, several auxiliary tools are introduced in Section5.2, in order to set up Parallel SECONDO on EC2 clusters as quickly as possible.At last, several experiments upon EC2 clusters consisting of 50 to 150 virtual com-puters are demonstrated in Section 5.3, illustrating that Parallel SECONDO wellinherits the scalability from Hadoop and is able to process large-scale data thathave never been dealt with in a common SECONDO system before.

5.1 Amazon EC2 Services

5.1.1 Hardware Configuration

Amazon.com establishes many data centers world wide, which are all named bytheir geographical locations, including: US East (N. Virginia), US West (Oregonand Northern California), Europe (Ireland), Asia (Singapore, Tokyo and Sydney)and South America (Sao Paulo). Except in few specific period, like Christmas time,not all these computing resources are fully used by the company itself. Many areidle for a long time, costing a lot of time and money only to maintain them on theracks. In order to use these computers effectively, AWS (Amazon Web Services)is provided.

Through AWS, end-users can also use these idle resources inside the data cen-ters, by renting virtual computers that are called EC2 instances. Each instanceis assigned with certain computing, memory and storage resources. By being as-signed with diverse resources, the instances are classified into various types andcharged with different prices. For example, in our evaluations, we usually chosethe “large” instances, each has 4 EC2 Compute Units as the CPU, 7.5 GB memory,850 GB storage and 64-bit platform. One instance costs $ 0.34 every hour. Besidesthe “large” type, there exist many other instance types, some provide much moreCompute Units while some have extreme large memory. Therefore, end-users canselect different instances to create clusters based on their own demands.

Instead of providing instances with physical processor models, AWS measuresthe computing capability with so called ECU (EC2 Compute Unit). One ECUprovides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeonprocessor. EC2 instances are built on massive amounts of commodity computers,where the physical processors frequently upgrade over time. Besides, end-users

5.1. AMAZON EC2 SERVICES 85

cannot indicate specific computers to provide their needed resources. All detailsabout applying and scheduling these physical computers are invisible for end-users.In the end, each type instance provides only a consistent and predictable amount ofECU, in order to make it easy for developers to compare the computing resourcesbetween different instance types.

Although the “large” EC2 instance provides a big storage space, it cannot besimply treated as a normal hard drive. Mainly it is divided into two parts, a 10GB root partition and 2×420GB ephemeral drives. The architecture about EC2storage is simply shown in Figure 5.1. The root partition is set inside the instance,its data exist during the whole lifetime of the instance. The ephemeral drives is alsonamed instance store, they are much larger than the root partition, attached withthe instance when it is launched. However, they are not set as permanent storagedevices thus their lifetime varies according to the instance’s status.

In Amazon EC2, instances have four status: started, stopped, rebooted andterminated. When an instance is started, both its root partition and instance storeare created and attached, hence end-users can store the data into either of them. Ifit reboots (intentionally or unintentionally), the data in both devices survive sincethe instance is turned off only for a very short time. Furthermore, when an instanceis stopped, it is then not charged with any more cost but its instance stores aredetached and their data are lost, only the root partition data are kept still. At last,data in both devices lost only when the instance is terminated for ever.

Host Computer

Instance A Root

Instance B Root

Ephemeral 0 Ephemeral 1 Ephemeral 2

EBS S3

Snapshot

Snapshot

Instance Store

Figure 5.1: EC2 Storage Architecture

Since the instance store is activated only when the instance is running, EC2 of-fers an alternative solution for providing elastic and permanent storage among in-


stances: Amazon EBS (Elastic Block Store). The EBS volumes work like portablebulk storage devices among the instances. They are applied for independently,with arbitrary sizes, and can be attached to any running instance. They can alsobe detached from one instance and reattached to another, in order to migrate databetween them. Data inside an EBS volume are permanently kept until the deviceitself is deleted. In order to backup their data periodically, snapshots can also bemade for EBS volumes and kept in Amazon S3 (Simple Storage Service), whichprovides a much cheaper solution for storing massive amounts of permanent data.

5.1.2 Software Configuration

Besides the different types of hardware platforms, EC2 also provides abundantsoftware resources. Each instance is started with an AMI (Amazon Machine Im-age), which contains all necessary information to boot the instance, at least a ready-to-use operating system, like Linux or Windows. In addition, many other soft-wares can also be prepared inside the AMI hence end-users can use them directlyby launching the image with one EC2 instance. For example, a LAMP (Linux+ Apache + MySql + PHP) based website can be quickly built up by starting aninstance with the AMI that has all required softwares installed.

In order to meet various customized requirements, Amazon and the EC2 com-munity provide thousands of public AMIs. Therefore, end-users can search forAMIs based on their specific criteria, and then launch the instances based on thosewithout installing those softwares by themselves. Additionally, software providerscan also propose their work by publishing their own AMIs containing the latestsystem, hence other people can use the software directly by launching an EC2instance. Note in AMIs, only the information inside the root partition is saved,since in practice the AMI is loaded into the root partition of the instance when it isstarted.

Our group also provides the EC2 community with a free AMI, which containsa pre-installed Parallel SECONDO system. With this AMI, end-users can quicklystart up Parallel SECONDO on a large-scale EC2 cluster, hence they can use thesystem to process their customized data immediately. Its details are introducedlater in Section 5.2.

To help end-users to get familiar with Amazon EC2 services, AWS provides apoint-and-click web-based console to manage EC2 instances, shown in Figure 5.2.In the left panel of this console, end-users can select various EC2 services, likeEC2 instances, EBS volumes or AMIs. All used instances are listed on its top-rightpanel, while each instance’s specific properties are illustrated on the bottom rightcorner.

While the web-based console provides an intuitive interface for end-users,

5.1. AMAZON EC2 SERVICES 87

Figure 5.2: The Amazon EC2 Web-based Console

Amazon EC2 also provides a set of CLI (Command LIne) tools to process themanagement more efficiently. They are written in Java hence can be invoked inmultiple platforms. Therefore, they are also used in our own auxiliary tools todeploy Parallel SECONDO quickly on large-scale EC2 clusters.

5.1.3 EC2 Instance Performance

In order to rationally evaluate different parallel procedures on clusters consistingof EC2 instances, it is necessary to profile Amazon EC2 instances with differentbenchmarks.

Regarding this problem, [53] first compares the performances between an HPCcluster at NCSA (National Center for Supercomputing Applications) and EC2 clus-ters containg instances with the similar hardware configuration. The benchmarkused in this evaluation is NPB (NAS Parallel Benchmark), which is particularlydeveloped for evaluating HPC systems’ performances.

It evaluates the performance on the single computers, revealing that EC2 in-stances have an approximate 7-21% degradation relative to the NCSA computers,although their configurations are quite similar. Furthermore, this study comparesthe performance between clusters and then observes a performance degradationabout 40-1000% degradation. However, this is mainly caused by the InfiniBand


switch that the NCSA cluster uses, which provides a much larger network band-width (at least 2 Gbits/s) than the EC2 cluster.

Apart from comparing EC2 clusters with private clusters, the performancevariances between EC2 clusters are also great. [45] made a horizontal compar-isons from several perspectives like CPU, I/O and network, over EC2 clusters con-structed in different regions and times. The results show that EC2 performancevaries a lot and often falls into two bands having a large performance gap in-between.

For example, although EC2 instances provide the computing capacity by ECUs,two different processor types: Intel Xeon and AMD Opteron can still be identifiedin the running instances. Surprisingly, there exists a high performance variance be-tween these two types of processors. The former is about 35% more efficient thanthe latter. In addition, the performances are also influenced by the time when theinstances are applied. Usually within one data center, the performance variance ishigher during the work time since end-users normally run the applications duringtheir work time.

Based on the above observations and analyses, we use the following two meth-ods in our evaluations, to reduce the performance variance as much as possible:

1. It is currently impossible to specify the instances’ processor types beforelaunching them, also the instances are charged immediately when they arestarted. Therefore, we normally repeat the same experiment on several dif-ferent clusters, each for several times.

2. We usually start the experiments in the data center located at Northern Vir-ginia in the German day-time, which is night time in USA. In this way, ourexperiments can be less influenced by the other applications running in thesame data center.

5.2 Set up Parallel SECONDO on EC2 Clusters

Normally in private clusters, each DS is set on a hard disk with all its components.Since the data are permanently stored, it is possible to turn off the cluster when itis not needed, while all data are still kept on the disks. However, in a EC2 instance,the DS can only be set upon the instance store since the root partition is too small.Consequently, its data are lost once the instance is not needed any more as theinstance store can only provide ephemeral storage. Therefore, in Amazon EC2,when a cluster is not demanded, no matter if the cluster is turned off or terminated,its Parallel SECONDO data are lost for the rest of time.

5.2. SET UP PARALLEL SECONDO ON EC2 CLUSTERS 89

Besides, when a new instance is started, it is assigned with two IP addresses:public and private. The instance is accessed from the Internet with the public ad-dress, which is then mapped through NAT (Network Address Translation). Thisaddress is associated with a running instance only, hence it is released when theinstance is turned off. Differently, the private addresses are set based on the RFC1918 standard, accessing instances within the Amazon EC2 network with a muchhigher efficiency (nearly 1 Gbits/s) and a lower cost. Furthermore, it is combinedwith the instance during its whole lifetime and only be released when the instanceis terminated.

Regarding these special settings in Amazon EC2 instances, an initializationprocedure is prepared for the Parallel SECONDO AMI. It is carried out only whenthe instance is launched for the first time, migrating the DS from the root partitionto the instance store, since the AMI can only be loaded into the root partition.Moreover, when a number of instances are started to make up a cluster, all theirprivate IP addresses are recorded into the DS Catalog and duplicated on every DS,in order to make sure that all the network traffic inside Parallel SECONDO followsthe highest bandwidth, lowest cost and latency path through the EC2 network.

In order to process the initialization procedure and set up Parallel SECONDO

on large-scale EC2 clusters easily, a special AMI is prepared. It is set upon UbuntuServer 12.04 64bit as the operating system, encapsulating all DS components, in-cluding a 0.20.2 Hadoop node and a 3.3.2 Mini-SECONDO. This DS is duplicatedall over the cluster by starting all instances based on the same image.

In addition, we provide a set of auxiliary tools to prepare the EC2 clusters forParallel SECONDO based on EC2 CLI tools, shown in Table 5.1. Via these tools,we are able to create a large-scale EC2 cluster at once based on the same ParallelSECONDO AMI, then broadcast all instances’ private IP addresses to all DSs. Asa result, a Parallel SECONDO system can be deployed on hundreds of instanceswithin several minutes, instead of installing the system from the very beginning byend-users themselves.

When an instance is newly started, the embedded ps-ec2-initializeis processed automatically to finish the initialization procedure. If the instance islaunched alone through the EC2 web-based console, then it is set up with a single-computer Parallel SECONDO after the initialization. Or else, if it is started bythe ps-ec2-startInstances script and used as one cluster node, then it isinitialized only as one DS of the whole system.

The ps-ec2-startInstances first starts up a set of instances based onthe Parallel SECONDO AMI. Next it assigns one instance as the master node, whilethe others are all set as slave nodes. At last, it collects the private IP addresses of allstarted instances and broadcasts them to group all instances into the same ParallelSECONDO system. During this procedure, it happens that few instances cannot be


correctly launched, especially when we attempt to start tens or hundreds instancesat the same time. In this case, this script can abandon the failed instances and startnew ones to replace them. Detailed steps of preparing Parallel SECONDO on AWSEC2 clusters with the AMI are introduced in Appendix A.2.2.

Kind Name Function

EC2 Setupps-ec2-initialize Initialize the local instance

ps-ec2-startInstancesStart up a instance cluster with preparedParallel SECONDO

EBS

ps-ec2-createVolumes Create a set of EBS volumesps-ec2-attachVolumes Attach EBS volumes to one instance

ps-ec2-mapVolumesMap EBS volumes to a set of instances,one by each

ps-ec2-detachVolumes Detach EBS volumes from their instances

Table 5.1: Auxiliary Tools for Parallel SECONDO on EC2

As mentioned above, during our evaluations, Parallel SECONDO is often usedto repeatedly process a large amount of benchmark data. In private clusters, thesedata are stored on one or several computers and then loaded before processing thequeries. The loading procedure is trifling in the experiments, since we own thewhole resources all the time.

Nevertheless, this procedure becomes crucial in Amazon EC2. On one hand,we often use large amounts of benchmark data for the cloud evaluations, as bigas hundreds of gigabytes, kept in some large-sized EBS volumes. These data areloaded into the instances when a new cluster is created. They are first imported intothe master database and then distributed to sDSs with the Flow operators. Duringthis process, all involved instances should keep online. One the other hand, sincewe usually use large-scale clusters here, hence tens or even hundreds of instancesare started together during the evaluations. Consequently, the cost of maintainingall these running instances for loading the benchmark data becomes very expensive.

Regarding this issue, in addition to the EC2 setup scripts, we provide a set ofEBS tools to reuse the massive amounts of benchmark data with a relatively lowercost. At present, these tools are provided only for our own experiments and willbe published in the coming future. With these tools, the loading procedure is madewith the following steps:

1. Firstly, the master instance is started and attached to the existing EBS thatcontains the large-scale benchmark data. It then partitions the data intopieces as equal as the number of sDSs.

5.3. EVALUATIONS IN EC2 CLUSTERS 91

2. Secondly, the ps-ec2-createVolumes script is used to create a set ofsmall EBS volumes, with the number also as equal as the sDSs number.

3. Afterward, ps-ec2-attachVolumes attaches the new volumes to themaster instance as well, which then distributes the partitioned data pieces tothe volumes. Each volume contains only one piece data. The volumes arethen detached after the distribution finishes.

4. At last, the EC2 cluster with the prepared Parallel SECONDO is started, eachinstance attaches one small EBS volume containing one piece benchmarkdata, by using the ps-ec2-mapVolumes script.

Apparently, with the help of these EBS tools, the EC2 cluster does not haveto be started in the top three steps, until the data have already been partitioned.Besides, the last mapping step is processed within each instance’s initializationprocedure, costing no additional overhead. Therefore, we can process the large-scale evaluations on EC2 clusters in a more economic way.

In Section 4.3, we have used our small cluster to generate the BerlinMODbenchmark data with SF as 1, and shortened the generation cost from two hoursin one computer to fifteen minutes. Similarly, we can also use the large-scale EC2clusters to generate the data for the cloud evaluation.

In total three data sets are prepared, the largest one sets the SF as 30, con-taining 10,954 vehicles’ trajectories in 153 days. Such a big data set has neverbeen generated before with SECONDO on a single computer, but Parallel SEC-ONDO makes it possible by integrating the computing resources of hundreds ofcomputers. The generation is processed on a cluster consisting of 110 EC2 “large”instances, costing nearly five hours in total and generating the relation as large as350 GB. The other two data sets denote the SF as 10 and 20, while their relationshave sizes 150 and 250 GB. All these data sets are used in the evaluations below.

5.3 Evaluations In EC2 Clusters

The cloud evaluations for Parallel SECONDO are all processed in large-scale EC2clusters. Restricted by the budget, only the spatio-temporal join query is examinedwith the above three large-sized BerlinMOD data sets.

The experiments still measure the system’s speed-up and the scale-up perfor-mances. In the speed-up evaluation, the data scale factor is 10, while the clusterincreases from 50 to 150 “large” type instances, with the step of 50. On the con-trary, in the scale-up evaluation, the same sized clusters are used to process thedata sets with scale factors of 10, 20 and 30, respectively. Both of them evaluate


the system based on the running time, since such heavy computations cannot befinished on one computer hence it is impossible to measure the PI value.

The experiments compare the performances of the HDJ and SDJ approaches aswell, based on the 6th query in the BerlinMOD benchmark. However, in order toprocess them without costing too many grants, certain adjustments are made on thequeries. First since HDJ naturally sorts the intermediate data based on their cellnumbers, during the Shuffling stage in Hadoop. Consequently, in the interest of afair comparison, SDJ also keeps the sorting process on the cell numbers. In thiscase, the join is performed cell by cell as it was done in HDJ and SDJ. Afterward,for the entries of each cell, the join is now performed with the itSpatialJoinoperation by building up the in-memory index structure on the fly, rather than eval-uating the Cartesian product with the symmjoin operation.

Furthermore, instead of filtering the trajectories only for trucks, we now applythe join to one quarter of the data set. The original query filters for trucks toreduce the data size, and limit the complexity of the query. However, the selectivityof this condition is too small, and varies along with the scale factors of the datasets. After the filter operation, the data sets are not large enough to evaluate theparallel processing, especially in large clusters. This problem could be solved byconsuming more grants on generating much larger data sets. Nevertheless, weintend to remove the filter condition and in fact use the first 25% in all data sets inthe evaluations.

After all these adjustments, we name these two approaches HDJ-Index’ andSDJ-Index’, respectively. Since the changes on the queries are slight, their state-ments are not specially illustrated.

0

200

400

600

800

1000

1200

1400

1600

50 100 150

Ela

psed T

ime (

sec)

Cluster Scale

HDJ-Index’SDJ-Index’

(a) Speed-up

0

500

1000

1500

2000

2500

3000

50 100 150

Ela

psed T

ime (

sec)

Cluster Scale

HDJ-Index’SDJ-Index’

(b) Scale-up

Figure 5.3: Evaluation On Spatio-temporal Data in Cloud

The comparison of these two approaches on large-scale EC2 clusters is shownin Figure 5.3. In the speed-up evaluation, shown in Figure 5.3a, we see that HDJ-Index’ performs a distinct linear speed-up along with the increase of the cluster

5.3. EVALUATIONS IN EC2 CLUSTERS 93

scale. SDJ-Index’ performs better than HDJ-Index’ when the cluster scale is notvery large, but it has a large degeneration when the cluster consists of more than100 slaves.

Figure 5.4 shows the detailed overhead on every MapReduce stage of thesetwo approaches. Note that here the sum of these stage overhead is different fromthe last result shown in Figure 5.3a, since they are observed in clusters with thesame scales but consisting of different EC2 instances. Apparently, both methodsappear the same tendency in the Map stage. Along with the increase of the cluster,each Map task processes less data hence the overhead decreases. Besides, the maptasks here are more efficient in SDJ-Index’. Next in the Shuffle stage, SDJ-Index’also performs better since it only needs to shuffle the light-weighted synopsis datainstead of the complete data set.

Nevertheless, the performance of SDJ-Index’ is much worse than HDJ-Index’in the last Reduce stage, where HDJ-Index’ spends decreasing overhead since itprocesses less data when more computers are added. In contrast, SDJ-Index’ needsalmost the same overhead, no matter how large the cluster becomes. This is mainlycaused by the factor that SDJ-Index’ has to collect files from PSFS in the Reducestage and each task copies hundreds of files from the other computers one after an-other. By comparison, HDJ-Index’ depends on HDFS to transfer the intermediateresults within the Shuffle stage. It is well designed for transferring large amountsof data and also the master node can optimize the network traffic according to thecurrent status. Therefore, its performance is much better than PSFS at large scale.

0

200

400

600

800

1000

1200

1400

50 100 150 50 100 150

Avera

ge E

lapsed T

ime (

sec)

Cluster Scale

HDJ-Index’ SDJ-Index’

MAPSHUFFLE

REDUCE MAPSHUFFLE

REDUCE

Figure 5.4: Average Step Overhead for the Cloud Speed-up Evaluation


Figure 5.3b depicts the result of the scale-up evaluations, revealing that HDJ-Index’ keeps a relatively stable scale-up along with the increase of the cluster, dueto the advantage of Hadoop’s optimized data transfer mechanism. On the otherhand, SDJ-Index’ starts to be more expensive than HDJ-Index’ when the clusterscale is larger than 100 nodes, and the main cost is also spent on copying interme-diate results in the re-distribution step. Besides, the larger the cluster is, the morecost is spent in SDJ-Index’ for shuffling the intermediate data, it increases linearlyalong with the expansion of the cluster.

Nevertheless, when less than 100 nodes are used in the cluster, SDJ-Index’ stillgains considerable advantages against HDJ-Index’, by removing data migrationoverhead as much as possible. In other words, the effect of these overhead is so bigthat it can only be eased in clusters with more than 100 slaves, and it is better tobe removed in any kind of hybrid parallel processing systems by coupling Hadoopand database engines like SECONDO.

Chapter 6

Optimization

So far, Parallel SECONDO has been established by combining the Hadoop platformand a set of SECONDO databases. Besides, a parallel data model is proposed,enabling end-users to state their queries in SECONDO executable language.

Like many Hadoop extensions, Parallel SECONDO is able to use HDFS as boththe data and task communication levels, while the distributed SECONDO databasesare only used to process database procedures. In addition, we adopt the native storemechanism, hence the databases can shuffle the intermediate data by themselvesthrough a simple distributed file system named PSFS. In the meantime, HDFS ismerely used for assigning and scheduling the parallel tasks. In such a way, the datamigration overhead between HDFS and the SECONDO databases can be preventedand the system’s overall efficiency is improved.

For measuring the improvement by using the native store mechanism, a setof evaluations are performed on our small cluster between two parallel join ap-proaches HDJ (Hadoop Distributed Join) and SDJ (SECONDO Distributed Join),also their respective variances. Besides, clusters consisting of hundreds of AWSEC2 instances are also created to evaluate the system’s performance at large scale.

In most cases, SDJ performs better than HDJ by omitting the data migrationoverhead, but its two main defects are also revealed. Firstly SDJ spends muchhigher cost than HDJ on transferring the intermediate result on the large-scale clus-ters, since it cannot make the fullest use of the network resource. Secondly SDJloads and caches much useless data during the Reduce stage, causing expensiveoverhead for flushing the overflowed cache to disk. Furthermore, both SDJ andHDJ use the PBSM approach to process the parallel join on multi-dimensional ob-jects, shuffling the complete data sets over the network. Nevertheless, the spatialjoin is normally processed in two steps: join and refinement, only few objects’detailed data is required in the second step. Therefore, a large amount of detailed

95

96 CHAPTER 6. OPTIMIZATION

data is shuffled on the network although they are useless at all during the wholeprocedure.

Regarding all above issues, a series of optimization technologies are proposedin this chapter, in order to further improve the SDJ performance. All of them areused to improve the data access efficiency in PSFS, listed in Table 6.1. Theseoperators are still classified into two basic kinds: export and import. However,they are further divided into three access modes, while each mode contains a set ofoperators performing the same functions. Their names are set differently based onthe belonging mode, but the operators performing the same function always followthe same syntax.

In principle, the mode2 operators improves the PSFS efficiency by omittingthe unnecessary disk I/O overhead, while the mode3 operators are more efficientby further reducing the useless network traffic. All of them are elaborated in thefollowing subsections.

Mode1 Mode2 Mode3 Signature

Exp

ort fconsume - fconsume3

stream(tuple(T)) × FileInfo(T)× PSFSInfo→ bool

fdistribute - fdistribute3stream(tuple(T)) × Key× FileInfo(T) × PSFSInfo→ stream(tuple(Suffix:int , Num:int))

Impo

rt

ffeed ffeed2 ffeed3FileInfo(T) × PSFSInfo→ stream(tuple(T))

pffeed pffeed2 pffeed3LocInfoRel × FileInfo(T) × PSFSInfo→ stream(tuple(T))

- fetchFlobstream(tuple((incomplete Ti)...)) × Ti

→ stream(tuple(Ti...))

Table 6.1: Extended Operators for Optimized PSFS Access

6.1 Pipeline File Transfer

As concluded in Section 5.3, the declining performance of SDJ-Index’, an SDJvariance, is mainly caused by the sequential file transferring. In its Reduce stage,each task collects files from the other sDSs and loads them into the local Mini-SECONDO database. Each file is transferred and loaded with the ffeed operator,while all imported tuples are concatenated into one stream through the concatoperation. During this procedure, the files are transferred sequentially, based onthe basic file transfer protocol.

For each remote file, ffeed needs to connect the target sDS first and then start

6.1. PIPELINE FILE TRANSFER 97

0

50

100

150

200

250

100 200 300 400

Ela

psed T

ime (

sec)

File Pieces

1GB5GB

(a) Sequential Copy

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10

Ela

psed T

ime (

sec)

Thread Number

File Pieces = 200

1GB5GB

(b) Parallel Copy

Figure 6.1: Shuffling Overhead on the Cluster

the data transfer. Since the files are collected one after another, no data is deliveredwhile the connection is being prepared, thus the network traffic is actually wastedduring the first stage. In order to demonstrate the low performance caused by thesequential file transfer, a simple evaluation is prepared. It transfers two data setswith sizes 1 GB and 5 GB, respectively. Each time they are divided into a numberof files being transferred sequentially on the cluster. We cannot evaluate this proce-dure on real large-scale clusters since our AWS grant is expired. However, we canstill simulate it by increasing the number of the file pieces, due to the factor thatthe larger a cluster is, more file pieces are transferred on it. The transferring costrelated to the file numbers is illustrated in Figure 6.1a. Apparently, the overheadincreases linearly along with the file pieces, explaining the reason why SDJ-Index’becomes so expensive on the large-scale clusters.

Regarding this issue, a new approach is proposed by transferring the files inpipeline. It starts a number of threads, each collecting one file from a remotesDS. Therefore, multiple files can be collected simultaneously and the networkthroughput is enlarged. This procedure is also evaluated by transferring the filepieces of above two data sets, with a number of threads run in parallel. Figure6.1b depicts the evaluation result. Clearly the performance is improved when morethreads are used, and it becomes stable when the files are delivered with more thanfive threads, meaning the network resource is completely used by the data transfer.

Based on this approach, a new PSFS import operator is proposed. It uses par-allel threads to collect a set of files belonging to the same PS-Matrix, then returnstheir tuples as a stream. Since this operator uses the pipeline to collect files, it isnamed pffeed (Pipeline File Feed).

The syntax of the pffeed is listed in Table 6.1. Compared with the ffeedoperator, it accepts one more relation argument named locInfoRel. This is a fixedtype relation containing three integer attributes: Row, Column and Dest. The first


two indicate the two integer suffices of the data file within the PS-Matrix. Thelast value then denotes the first possible sDS storing the file. Apart from that, theFileInfo parameters indicate the files’ mutual prefix name and disk path, while thePSFSInfo contains parameters like the typeDS and the duplicateT imes.

The pseudocode of this operator is illustrated in Algorithm 4. Once a file iscollected, it is inserted into the fileList in order to load the tuples from them oneby one (line 4). Tuples are started to return whenever a new file is collected (line13). During the pffeed operation, the number of the parallel threads is limited(line 9) to prevent the disk interference among the threads. This threshold is calledPipeWidth, it is usually set 10 based on the result of the above evaluation.

Algorithm 4: Pipeline File Feed (pffeed)

1 function CollectFile(locInfo):2 targetFile = getRemoteFilePath (locInfo) ;3 copyFile (targetFile) ;4 insert targetFile to fileList ;5 threadNum−− ;

6 function GetNextTuple(locInfoRel):7 while fileList.isEmpty() do8 for locInfo ∈ loclInfoRel do9 if threadNum < PipeWidth then

10 threadNum++ ;11 let t be a new thread ;12 t.start(CollectFile, locInfo) ;

13 return fileList.first.readTuple () ;

This newly created pffeed operator and all other existing PSFS operators arecataloged in Mode1 in Table 6.1, since they provide the basic data access in thedistributed environment.

6.2 FLOB Accessing in PSFS

This section intends to propose an optimized approach to read FLOB in PSFS, bycaching only the required data. As explained in Section 3.1, the FLOB structureis created in SECONDO to contain objects’ big sized detail information, in orderto reduce the unnecessary disk I/O overhead. To better explain the problem ofaccessing FLOB in PSFS, it is necessary to review the current FLOB management

6.2. FLOB ACCESSING IN PSFS 99

in the SECONDO and Mode1 PSFS operations.Figure 6.2 illustrates an example of the tuple storage mechanism in SECONDO.

All tuples with the same schema are saved in a Tuple file as a relation. The Tuplefiles are managed by Berkeley DB which is used in SECONDO for the underlyingstorage. Every tuple is stored as a record with a unique record id. The exampletuple here contains several attributes (a, b, c, d) and some of them (a, c, d) haveFLOBs. All these attributes are stored together within the Root block withoutincluding the FLOB data. For attributes a and c, their FLOBs (X and Y ) aresmaller than a certain threshold, hence the FLOBs are stored along with the tuplerecord in the rear Extension part. Differently, a large FLOB like Z in the attributed is stored in a separate FLOB file, which is in principle also a record file storingFLOBs only. Usually the small FLOBs are cached in company with the tuples, onlythe large FLOBs are stored and accessed differently. Unless otherwise specified, inthe rest of this chapter, FLOB usually means the externally stored data.

The attribute uses a logical pointer named flobId to locate its FLOB with fourelements: fileId, recordId, offset and mode. The first two identify the FLOB fileand the slot of the FLOB record. The third stands for the offset inside the record,while the last denotes different FLOB status.

Originally, two FLOB modes are provided: 0 and 1. In mode 0, the FLOBis stored with the tuples in a permanent SECONDO relation. When a tuple is re-quested, only its data inside the Tuple file is cached. The FLOBs are not read atfirst, until they are specially needed by some operation, then it is read from thedisk and buffered in the PersistentFlobCache. This mechanism helps SECONDO toavoid a lot of disk I/O, since the FLOB data is not always needed in many opera-tions. In contrast, a FLOB with mode 1 means it is temporarily created during thequery procedure and cached in the NativeFlobCache. When the NativeFlobCacheoverflows, the extra FLOB is then flushed into a temporary FLOB file named Na-tiveFlobFile based on the LRU (Latest Recently Used) principle. In this case, theFLOB is always cached until it is not used any more.

In Mode1 PSFS operations, a tuple is exported together with its FLOB datainto the data file as one binary block, in order to transfer them as a whole on thecluster. When it is read from PSFS, its FLOBs are also read and cached in Native-FlobCache, no matter whether they are really needed by the upcoming operations.Consequently, during the blocking procedures like the sort operation, these FLOBdata cannot be released from the cache until the whole operation finishes. Alongwith the increase of the data set, more FLOBs are cached, forcing the disk to flushthe cache frequently and damaging the system’s performance.

Regarding this issue, we adjust the flobId structure to identify FLOB data storedin external data files. Since flobId is implemented as a logical pointer with constantsize, we cannot extend it but only alter the meanings of its existing elements. Firstly


a b c d

Root

X Y

Extension

Z

Tuple File

FLOB File

Figure 6.2: The Tuple Storage in SECONDO

all SECONDO record files are created in specific paths and named with a long in-teger value, thus the fileId is defined as a long integer too. However, data files inPSFS can be named with random string values. In order to map the data file nameswith the fileId values, a list structure is created to cache all opened data files duringthe query procedure and the files’ serial numbers are then used as the fileId values.The recordId is useless here since the data files are not Berkeley DB record files.Next the offset indicates the FLOB’s internal position of the data file. In the end,the mode is set to 2, in order to tell it apart from the current FLOB modes.

Based on this approach, two new operators ffeed2 and pffeed2 are extended,they are cataloged as import PSFS operators in Mode2, shown in Table 6.1. Theyimport tuples from PSFS without reading the FLOB data, but only create flobIdsin mode 2 to denote them. Therefore, FLOBs can be read from the disk only whenthey are needed, the disk overhead on flushing the useless FLOB cache can beavoided.

The FLOB data is always read from the disk, if it is indicated with a mode2 flobId. Therefore, when it is repeatedly asked, the disk reading overhead alsobecomes considerable. In this case, it is better to cache the FLOB into a memorybuffer like NativeFlobCache after reading it for the first time. Consequently, theflobId should also be changed.

However, the current SECONDO FLOB management does not allow to changethe flobId while reading the data, hence a new PSFS import operator fetchFlobis proposed accordingly. It accepts a stream of tuples and the attribute names thathave FLOBs that need to be read. Then it reads the FLOB data from the disk andcaches them into the NativeFlobCache. At the same time, it also uses a new mode1 flobId to indicate the cached FLOB in order to read it from the memory buffer

6.3. DISTRIBUTED FILTER AND REFINEMENT 101

later.Note that the fetchFlob operator reads and caches all incoming FLOB data,

hence it should be used cautiously by end-users. For example, it usually should beset after the blocking operators to prevent caching the useless FLOB again.

6.3 Distributed Filter and Refinement

Although the Mode2 operators improve the PSFS import efficiency by not readingthe useless FLOB, considerable network traffic is wasted on transferring them.Regarding this issue, this section uses the spatial join procedure as the example,proposing a novel mechanism to deliver only the requested FLOB data.

In principle, the join operation on multi-dimensional objects follows such atwo-step strategy [38]:

• Filter step: It rapidly eliminates the objects that cannot satisfy the querypredicate based on their approximate information. For example, if the MBRs(Minimum Bounding Rectangle) of two polygons do not overlap, then itis completely unnecessary to check their intersection with their complexshapes. This step generates the candidates for the final result.

• Refinement step: Each candidate is examined with the objects’ detailed in-formation, and it is removed from the result if the detection fails.

Apparently, objects’ FLOB data are only used in the second step. Since in mostspatial join queries, a big part of objects can be eliminated in the first step, hencetheir FLOBs are completely useless during the whole procedure. In standalonesystems, such a two-step strategy improves the efficiency by reducing the overheadof accessing useless FLOBs.

Nevertheless, as mentioned in Section 4.2.1, Parallel SECONDO uses the PBSMmethod to process the spatial join based on the MapReduce paradigm. It is imple-mented as a reduce-side join thus both the filter and refinement steps are processedin the Reduce stage. Since the candidates are generated in the the filter step, alltuples from both sides relations should be shuffled over the network in order to en-sure the query correctness. In other words, it is impossible to tell which FLOB datawill be used before the Shuffle stage, hence all of them have to be transferred onthe cluster. Therefore, the transfer overhead on these useless FLOB is completelywasted.

Regarding this issue, an optimized mechanism is developed. It is named DFR(Distributed Filter Refinement) since it is studied specifically for improving theperformance on the parallel spatial join procedure. Furthermore, a set of operators


are proposed. They are cataloged as the PSFS operators in Mode3 and listed inTable 6.1.

In SECONDO, objects’ approximate information is always kept within the tu-ples’ “Root” block, shown in Figure 6.2. All of them are required in the first filterstep of spatial join, while only a part of objects’ FLOB data is needed in the secondrefinement step. Therefore, tuples’ Root and Extension blocks should be exportedseparately from the FLOB data. Accordingly, two Mode3 export PSFS operators:fconsume3 and fdistribute3 are developed, exporting the tuples into two datafiles: tuple data file and FLOB data file. The tuple data file contains tuples’ Rootand Extension blocks, while their large sized FLOB are exported to the FLOB datafile. The tuple data files are named with the same rule as the data files in the otherPSFS modes, while the FLOB data files are named like flobFile.X . The X is along integer value, since later they need to be indicated by the flobIds.

For each exported tuple, all its attributes’ flobIds, if existing, are replaced withnew values that set the mode value 3, indicating the FLOB data is kept in a FLOBdata file on the cluster. Besides, other elements in the new flobIds are also changed.The offset stands for the data position in the FLOB data file and the fileId is set asthe X of the file name. Particularly, we set the recordId as the current sDS’s serialnumber in DS Catalog, denoting on which sDS the FLOB data file is stored. ThissDS is called the source DS for this FLOB.

Correspondingly, two import operators ffeed3 and pffeed3 are also proposed,which work similarly as the other two modes’ operations. Specially, only the tupledata files are transferred and read, while the FLOB data files are always left ontheir source DSs.

With all these new created operators, the SDJ approach can be processed withthe DFR mechanism, shown in Figure 6.3. In the Map stage, both sides relations areprocessed by the map function and the intermediate results are exported to PSFSwith the fdistribute3 operator. All tuple data files build up the intermediatePS-Matrix. Therefore, each Reduce task uses the pffeed3 operator to collect allrequired tuple data files belonging to the same column of the intermediate PS-Matrix and load the tuples to process the filter step.

So far, all tuples containing their objects’ approximate information are re-distributed on the cluster to generate the candidates. Afterward, each candidateshould be examined in the refinement step with their FLOB, which however arestill kept on their source DSs. In order to fetch these remote FLOB data, thefetchFlob operator is also extended for Mode3, with the following five steps:

1. It scans all input tuples to extract the requests for the FLOB data. Each re-quest is called flobOrder and inserted into a list file named flobSheet. TheflobOrder is set for each flobId created by the fdistribute3 operator in the

6.3. DISTRIBUTED FILTER AND REFINEMENT 103

Reduce Task

Map Task

input

relations

map

function

Tuple

FLOB

Tuple

FLOB

Tuple

FLOB

Tuple

Tuple

Tuple

Tuple

Tuple

FLOB

Tuple

FLOB

Tuple

FLOB

filter refine

ment

Flob

Sheet

Flob

Sheet

Flob

Sheet

result

FLOB

result

FLOB

result

FLOB

collect

Flob

Server

fdistribute3

pffeed3

Other

Map Tasks

fetchFlob

1: add

orders

2: send

sheet

3: collect FLOB

4: sent

result file

5: cache

FLOB

Figure 6.3: SDJ with the Distributed Filter Refinement Mechanism

Map stage. In total N − 1 flobSheets are created, each for one other sDS,while N stands for the number of the sDSs. If one tuple contains FLOBsfrom several source DSs, then its flobOrders are inserted into different flob-Sheets.

2. Once a sheet is prepared, it is sent to the source DS, in order to collect allrequired FLOB. The current sDS is then called requester DS.

3. On the source DS, a daemon program collectFlobServer is prepared to pro-cess the received flobSheets. For each sheet, it traverses all flobOrders andextracts the data from the original FLOB data files, then appends it to theresult FLOB file.

4. The result FLOB file is sent back to the requester DS when all neededFLOBs have been extracted.

5. In the end, the tuple is returned if all its required FLOB are collected. Afterbeing read from the result FLOB files, the data is then cached in the Native-FlobCache and the flobIds are also adjusted accordingly.

Besides the above basic steps, several methods are also used in this operatorfor reducing the unnecessary disk I/O overhead as much as possible. Firstly, a sizethreshold is used to limit the size of the flobSheets to prevent generating too largeresult FLOB files. Secondly, all flobOrders within the same flobSheet are sorted


based on their offset values. Therefore, the FLOB data file can be traversed onlyonce when collecting the result file. Thirdly, the collectFlobServer processes all itsreceived flobSheets sequentially, so as to prevent the disk interference of preparingseveral result FLOB files.

According to the PSFS operators in different modes, the hadoopReduceand hadoopReduce2 operators are also extended with a new mode argument,indicating the mechanisms for shuffling the intermediate data. This argument hasthree possible values: M1, M2 and M3, each indicates one PSFS mode. Besides,in mode 2 and 3, end-users need to set the fetchFlob operator by themselves, inorder to ensure that only the required FLOB data are collected and read.

6.4 Evaluation

In order to assess the performances of different PSFS modes, an evaluation is pre-pared on our own cluster. Since these new mechanisms are especially prepared forprocessing the queries involving massive FLOB data, we test them by using theparallel spatial join on region objects (polygons), where the FLOB data is muchlarger than the other attributes.

Two relations Lands and Buildings are prepared, still provided by the Open-StreetMap project, containing the geographical information in Germany. The firstdescribes the areas with different purposes, like the lands for the airport, military,agriculture, etc. The second describes the areas belonging to all kinds of buildings,like the apartments, hotels, houses, etc. It is common that a land object containsa number of buildings, hence the query is made by finding which buildings areintersected with the lands. Both relations contain two attributes:

{OSM_id:long, Region:region}

The first describes the unique id of the object, while the second contains itsgeographical shape, saved as a FLOB data. The Lands relation is also extendedwith one more attribute named Cnt, which is set as a tuple counter. It helps us toselect a part of land objects in the later evaluations.

The Lands relation contains in total 254,613 polygons, on average each poly-gon has 20 segments, and the complete relation is large as 1 GB. The Buildingsrelation has 11,999,773 polygons, each is made up of 5 segments in average. Thewhole relation is as large as 14 GB and nearly 86% is FLOB. In order to highlightthe improvement by reducing the useless data access, we duplicate the Buildingsrelation in our evaluation for one more time and make the duplicated objects dis-joint from the original data set. Therefore, in total the two relations here are aslarge as 29 GB, containing 24 GB FLOB data.

6.4. EVALUATION 105

1query Lands_ID_DLF hadoopMap[ DLF, FALSE2; . filter[.Cnt < real2int(0.1 * LandsNum)]3extend[Box: bbox(.Region)]4extendstream[Cell: cellnumber(.Box, LBJoin_CellGrid)]]5Buildings_ID_DLF hadoopMap[ DLF, FALSE6; . extend[Box: bbox(.Region)]7extendstream[Cell: cellnumber(.Box, LBJoin_CellGrid)]]8hadoopReduce2[Cell, Cell, DLF, PS_SCALE, M29; . sortby[Cell] .. sortby[Cell]10parajoin2[Cell, Cell; . {l} .. {b}11itSpatialJoin[Box_l, Box_b, 4, 8]12filter[gridintersects(13LBJoin_CellGrid, .Box_l, .Box_b, .Cell_l)]]14fetchFlob[Region_l, Region_b]15filter[(.Region_l intersects1 .Region_b)]16count feed namedtransformstream[PartCnt] ]17collect[]18sum[PartCnt];

Figure 6.4: Parallel Spatial Join on Lands and Buildings in SDJ-Index’

The query is formulated with the SDJ-Index’ method, shown in Figure 6.4.Lands ID DLF and Buildings ID DLF are two DLF flist objects created by spread-ing the relations on the cluster based on their OSM id attribute. In the Map stage,both relations are partitioned based on the cell grid LBJoin CellGrid (line 4 and 7).This grid is created by setting the bounding box as the intersected area of the twodata sets, while the cell size is set as ten times of the land objects’ average size.Here the mode for the hadoopReduce2 operation is set as M2 (line 8), hence theintermediate data are shuffled with Mode2 PSFS operations. Later the fetchFloboperator is used for reading FLOB (line 14). It is set intentionally after the filterstep, in order to retrieve the useful FLOB only. Each Reduce task gets the numberof the partial result, which will be then summed up in the end in order to get thenumber of this join query (line 16 - 18).

In order to demonstrate the performances affected by reading useless FLOBdata, the queries are processed with different partial selectivities on the relationLands. As shown in the 2nd line of the query, each time we select a part of landobjects, from 10% to 100% step by 10%, to take part in the join query. The Land-sNum tells the size of the Lands relation. In contrast, all building objects are used,since it is impossible to apply the filter operation on this relation based on the otherside relation’s condition. In addition, each time the query is evaluated with all threePSFS modes, and their performances are compared based on the elapsed time. All


queries are processed with all our twelve sDSs of the cluster. The result is shownin Figure 6.5.

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Ela

psed T

ime (

sec)

Partial Selectivity on the relation Lines

Mode1 Mode2 Mode3

Figure 6.5: The Performance Comparison on Different PSFS Modes

Apparently Mode1 has the worst performance, since every time it needs totransfer and cache all FLOB data. Because of the caching mechanism in SEC-ONDO, its overhead varies along with the increasing partial selectivity on Lands,but overall it becomes larger while more join results are generated. In addition,Mode2 reduces about half cost from Mode1 by avoiding to read the useless FLOB,while Mode3 further improves the performance by not transferring them at all.When the partial selectivity is 1.0, Mode3 can achieve nearly four times betterperformance than Mode1 and one time than Mode2.

There exists a certain amount of overhead of the DFR mechanism itself, sinceit needs to prepare and transfer the flobSheets and result FLOB files. In our eval-uation, even with the complete Lands relation, 15.4% FLOB data (3.7 GB) fromthe Buildings relation are actually used in the refinement step, the Mode3 approachstill has a significant advantage over the other two modes.

Chapter 7

Conclusions

7.1 Summary

In order to process massive amounts of multi-dimensional data, including spatialand moving objects, a hybrid system named Parallel SECONDO is proposed in thisthesis. By combining the Hadoop platform with a set of SECONDO database sys-tems, Parallel SECONDO can not only achieve a good scalability, but also processspecialized data types efficiently. It maintains the independence of both compo-nents and also uses their best technologies at the same time. Most issues studied inthis Ph.D project are briefly summarized in the following.

A hybrid infrastructure is established to take the advantages from both sides.It uses Hadoop as the task communication level, dividing a large job into multi-ple small tasks which can then be processed by cluster computers simultaneously.Each computer relies on its local DS (Data Server), which contains a SECONDO

database, to process the assigned tasks for achieving the best performance. Besides,multiple DSs can be set on the same computer in order to take the full usage of theunderlying cluster resources. Most importantly, a simple distributed file systemPSFS is developed, hence the distributed SECONDO databases can exchange inter-mediate data all by themselves; then the data migration overhead between Hadoopand SECONDO can be reduced to the minimum.

A parallel data model is developed as two SECONDO algebras. It includes atype constructor flist to denote distributed data, and also some related operatorsto access PSFS data and describe MapReduce procedures. Therefore, all parallelqueries can be stated in SECONDO executable language, getting rid of the rigidprogramming model in Hadoop. Besides, most existing SECONDO data types andoperators are inherited in the new system, enabling it to process the specializeddata types as well. As an example, a parallel spatial join method PBSM (Parti-

107

108 CHAPTER 7. CONCLUSIONS

tion Based Spatial Merge) is deeply introduced in this thesis. It can process thejoin on both spatial and moving objects data, using different approaches to shufflethe intermediate results on the cluster, in order to achieve the best performance.All these approaches can be stated as Parallel SECONDO queries, with few slightadjustments, showing a remarkable flexibility of the data model.

Parallel SECONDO is fully evaluated with different PBSM approaches on oursmall private cluster, demonstrating stable speed-up and scale-up performances.Besides, it is also deployed on large-scale clusters composed of hundreds of AWSvirtual computers, showing an outstanding scalability. Furthermore, a set of opti-mization technologies are also proposed to further improve the system’s efficiencyon large-scale clusters. Firstly, the pipeline mechanism is introduced to take thefull usage of the network resource. Secondly, regarding the special storage mecha-nism for spatial and moving objects data, three PSFS modes are proposed in orderto reduce the unnecessary data access and transfer overhead during the parallelprocedures.

In order to build up Parallel SECONDO as a user friendly system, a set of aux-iliary tools and virtual computer images are proposed. Therefore, end-users caneasily set up the system on different scale clusters. In addition, Parallel SECONDO

uses its master database as the main entrance, which can also be used as a stan-dalone SECONDO database. Therefore, end-users can flexibly choose either systemto process their queries with different scales, having the impression of handling acommon single-computer database.

7.2 Future Work

So far, Parallel SECONDO inherits the executable language in SECONDO, hence thequery plans can be precisely stated with database objects and operators. Althoughit improves the system’s efficiency, barriers are created for beginners because ofthe language’s complexity. Therefore, a query optimizer is needed in the future,enabling end-users to present their queries in simple SQL-like expressions.

Besides, in the current stage, SECONDO and Hadoop are loosely coupled inorder to maintain their independence. However, it is difficult for them to detecteach other’s status. For example, after submitting queries to its Mini-SECONDO

database, the Hadoop task is completely blind to their progresses, thus it is impos-sible to estimate the progress for the whole system. Therefore, a message trans-mission system is also required in the future development.

Appendix A

Install Parallel SECONDO

Parallel SECONDO can be installed on either a single computer or a cluster con-sisting of tens or even hundreds of computers. All its components are publishedsince SECONDO 3.3.2 within two algebras: Hadoop and HadoopParallel. All SEC-ONDO required libraries should be installed on all involved computers in advance,the installation steps for SECONDO can be found on our website 1.

Note that although SECONDO itself supports both unix-based and Windowsplatforms, Parallel SECONDO can not be installed in the Windows systems yet.At the current stage, several platforms including Ubuntu 10.04, Ubuntu 12.04 andMacOSX 10.6 have been fully tested. Usually we recommend that all involvedcomputers should have the same operating system, even better if they have thesame version.

At present, Parallel SECONDO is built up based on Hadoop 0.20.2. It meets allour demands in the current stage although it is not the latest release. The Hadooparchive can only be downloaded by end-users themselves, with the following linuxcommand:

$ wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz

Parallel SECONDO can be prepared in several ways. End-users can install iton their single computers or private clusters. They can also get familiar with thesystem by starting its VMWare image on the virtual machine. At last, they canset it up on large-scale clusters consisting of AWS EC2 virtual computers with ourAMI image. All these possibilities are elaborated in the following subsections.

1http://dna.fernuni-hagen.de/secondo/

109

110 APPENDIX A. INSTALL PARALLEL SECONDO

Prerequisites

If end-users attempt to install the system by themselves, please ensure all comput-ers or at least the master node has the Internet connection, in case we need to getany additional resources. Each cluster computer should be assigned with a constantIP address, by which the other computers can visit it directly. Besides, all comput-ers should have the same account created for installing Parallel SECONDO. At last,the following utilities must be prepared on all computers. They are all ordinarylinux commands working on most Linux platforms and MacOSX.

• JavaTM 1.6.x or above. The openjdk-6 is automatically prepared along withthe installation of SECONDO. However, it is better to check the Java versionwith command:

$ java -version

• SSH connection. Both DSs and the Hadoop platform rely on the secure shellas the basic communication layer. In Ubuntu, the SSH server is not installedby default but can be installed with command:

$ sudo apt-get install ssh

• The screen utility is also asked by many Parallel SECONDO auxiliary tools.In Ubuntu, it can be installed by:

$ sudo apt-get install screen

• It is important for Hadoop and Parallel SECONDO to have a passphraselessSSH connection. It can be tested with command:

$ ssh <IP>

Here the <IP> indicates the IP address for any computer within the cluster.This command is used to log into the remote computer through the secureshell based on the current user name. If a password is asked for this connec-tion, then an authentication key-pair should be created with commands:

$ cd $HOME$ ssh-keygen -t dsa -P '' -f ˜/.ssh/id_dsa$ cat ˜/.ssh/id_dsa.pub >> ˜/.ssh/authorized_keys

Afterward, try to enter the target computer again, this time it may ask to addthe current IP address to the known hosts list, like:

A.1. INSTALLATION INSTRUCTIONS 111

$ ssh <IP>The authenticity of host '... (...)' can't be established.RSA key fingerprint is .......Are you sure you want to continue connecting (yes/no)?

This verification happens only once when the ssh connection is establishedfor the first time, and the authentication can be simply confirmed by typing“yes”. It can also be prevented by putting the following three lines into thefile $HOME/.ssh/config.

Host *StrictHostKeyChecking noUserKnownHostsfile /dev/null

• Following the installation guide to prepare all SECONDO requested librarieson every cluster computer. If there shows no error during the installationprocedure, then the following command should return the output like:

$ env | grep 'ˆSECONDO'SECONDO_CONFIG=.... /secondo/bin/SecondoConfig.iniSECONDO_BUILD_DIR=... /secondoSECONDO_JAVA=.... /javaSECONDO_PLATFORM=...

If all above prerequisites are fulfilled, then we can install Parallel SECONDO

by the following steps, with the auxiliary tools introduced in Section 3.3.

A.1 Installation Instructions

Parallel SECONDO on the cluster is deployed only through the master node, henceit follows almost the same steps as the single computer installation. Both theirinstallations contain the following steps:

1. Download SECONDO source code and compile it, in order to prepare thebasic configuration file and verify the correctness of the local environment.

$ cd $HOME$ wget http://dna.fernuni-hagen.de/secondo/files/secondo-v332-LAT1.tar.gz$ cd $SECONDO_BUILD_DIR$ make

2. Normally, the following line is set at the end of the shell profile $HOME/.bashrcto initialize all SECONDO related environment variables:


source $HOME/.secondorc $HOME/secondo

However, especially in Ubuntu, this line should be set at the top of the file orat least before line:

[ -z "$PS1" ] && return

This change must be set on the shell profile of every involved computers.

3. The downloaded Hadoop archive file must be stored in the directory $SEC-ONDO BUILD DIR/bin without changing the file name:

$ cd $SECONDO_BUILD_DIR/bin$ wget http://archive.apache.org/dist/hadoop/hadoop-0.20.2/hadoop-0.20.2.tar.gz

4. Setting up the parameter file ParallelSecondoConfig.ini for Parallel SEC-ONDO. It can be prepared automatically through the graphic preference edi-tor introduced in Section 3.3.2 with default values. Otherwise it can also beset manually with the following parameters. The example file is also kept inthe clusterManagement directory of the Hadoop algebra.

• In the Cluster section, the mDS and sDS can be set with the same valueslike:

Master = <IP>:<DS_Path>:<Port>Slaves += <IP>:<DS_Path>:<Port>

The IP address of the computer instead of the hostname is asked first.Then the DS Path indicates the disk path for the DS, like /tmp. At lastit asks the listening port for the Mini-SECONDO.

• Set hadoop-env.sh:JAVA HOME to the location where the JAVA SDKis installed. This path must be identical on every computer.

• If the SECONDO database has already been created and used before,then the NS4Master parameter can be set as true to keep Parallel SEC-ONDO using it as the master database. Note that if the database is notcreated, indicating NS4Master as true may cause the failure of the con-tinuing installation.

• At last, the transaction feature is normally turned off in Parallel SEC-ONDO, in order to improve the efficiency of exchanging data amongDSs. For this purpose, the following RTFlag parameter in the SEC-ONDO configuration file SecondoConfig.ini should be uncommented.The file is normally kept in $SECONDO BUILD DIR/bin.

A.1. INSTALLATION INSTRUCTIONS 113

RTFlags += SMI:NoTransactions

5. After setting all required parameters, copy the file ParallelSecondoConfig.inito $SECONDO BUILD DIR/bin, and start the installation with the auxiliarytool ps-cluster-format.

$ cd $SECONDO_BUILD_DIR/Algebras/Hadoop/clusterManagement$ cp ParallelSecondoConfig.ini $SECONDO_BUILD_DIR/bin$ ps-cluster-format

During the installation, all DSs are created and the Hadoop is deployed too.At last, the Namenode of Hadoop is also formatted.

6. Leave the current terminal and start a new one. Verify the correctness of theinstallation with the following command:

$ cd $HOME$ env | grep 'ˆPARALLEL_SECONDO'PARALLEL_SECONDO_MASTER=.../conf/masterPARALLEL_SECONDO_CONF=.../confPARALLEL_SECONDO_BUILD_DIR=.../secondoPARALLEL_SECONDO_MINIDB_NAME=msec-databasesPARALLEL_SECONDO_MINI_NAME=msecPARALLEL_SECONDO_PSFSNAME=PSFSPARALLEL_SECONDO_DBCONFIG=.../SecondoConfig.ini....PARALLEL_SECONDO_SLAVES=.../conf/slavesPARALLEL_SECONDO_MAINDS=.../dataServer1/...PARALLEL_SECONDO=.../dataServer1PARALLEL_SECONDO_DATASERVER_NAME=...

7. Although the DSs are initialized, the Mini-SECONDO system is not dis-tributed yet, since the Hadoop algebras cannot be compiled before the aboveinitialization. At present, both algebras can be activated by adding the fol-lowing lines to the algebra list $SECONDO BUILD DIR/makefile.algebras,then recompile SECONDO.

ALGEBRA_DIRS += HadoopALGEBRAS += HadoopAlgebra

ALGEBRA_DIRS += HadoopParallelALGEBRAS += HadoopParallelAlgebra

Afterward, end-users can distribute Mini-SECONDO to all DSs:


$ cd $SECONDO_BUILD_DIR$ make$ cd $SECONDO_BUILD_DIR/Algebras/Hadoop/clusterManagement$ ps-secondo-buildMini -l

Here the -l parameter in the last command indicates the Mini-SECONDO isdistributed only on the local computer. It can be set -c to distribute the systemto the whole cluster.

So far, Parallel SECONDO has been installed on the local computer. In order tostart the system up, several commands are needed:$ start-all.sh$ cd $SECONDO_BUILD_DIR/Algebras/Hadoop/clusterManagement$ ps-start-AllMonitors$ ps-startTTYCS -s 1

Here the first command starts up the Hadoop services provided by the Hadoopplatform. Later the third command starts up all Mini-SECONDO monitors thathave been distributed. The last command starts up a text interface for ParallelSECONDO. It accepts one parameter -s to indicate which DS on the local computeris requested. Normally the main interface of the system is always identified as thefirst DS on the master computer. Therefore, the parameter is set 1.

The system can also be stopped with the following two commands. The firstturns off all Mini-SECONDO monitors, while the Hadoop services are stopped withthe second command.$ cd $SECONDO_BUILD_DIR/Algebras/Hadoop/clusterManagement$ ps-stop-AllMonitors$ stop-all.sh

If end-users want to completely remove the system from the computer, an easy-to-use tool ps-cluster-uninstall is also provided.$ cd $SECONDO_BUILD_DIR/Algebras/Hadoop/clusterManagement$ ps-cluster-uninstall

A.2 Deploy Parallel SECONDO With Virtual Images

Two virtual machine images are prepared for Parallel SECONDO, in order to helpend-users familiarize with the system as quickly as possible. One image is builtupon VMWare, it can set up a single-computer system immediately by starting avirtual computer upon it. The other image is prepared as an AMI (Amazon Ma-chine Image), by which end-users can deploy a large-scale Parallel SECONDO onclusters consisting of AMI EC2 instances.

A.2. DEPLOY PARALLEL SECONDO WITH VIRTUAL IMAGES 115

A.2.1 VMWare Image

This image is prepared and generated by VMWare, being distributed as a zip-archive file. It can be loaded into the VMware Player, which is a free softwaredownloaded from its official website 1, for both Windows and Linux platforms.The VMWare Fusion can also be used to load the system on the MacOSX plat-form, although it only has a free trial license for 30 days.

The operating system of the image is Ubuntu 12.04.02 LTS 32bit, on whichParallel SECONDO is prepared along with SECONDO 3.3.2. Both the account nameand the password in this image are set as psecondo. The system can be set up withthe following steps:

1. Install VMware Player on Windows or Linux, or VMWare Fusion on Ma-cOSX.

2. Download the VMware image of Parallel SECONDO from our website andits md5 checksum is:d17f5602ca74e8ba7211dfca6ae561a0

3. Start up the virtual machine with the image and open a terminal console,where the prompt information looks like:

Preparing the Data Server ...Initialize the Data Server based on the new IP address ...

This is a bash script prepared in the shell profile file. It is automatically ex-ecuted when the image is started for the first time, preparing and initializingthe DS. Afterward, format the Namenode of Hadoop with command:

$ hadoop namenode -format

This also needs to be processed for only once.

4. Normally, after starting the system, Parallel SECONDO can be started withcommands:

$ start-all.sh$ ps-startMonitors

5. Start up the Internet browser, which has prepared the Hadoop file system andjob monitors as its two home pages: localhost:50070 and localhost:50030.Look over them to ensure that Hadoop is working, i.e. HDFS has left thesafe mode and JobTracker has a living TaskTracker.

1http://www.vmware.com/


6. Back to the terminal console, start up Parallel SECONDO text interface withthe command:

$ ps-startTTYCS -s 1

At last, Parallel SECONDO can be stopped by the following two commands:

$ ps-stopMonitors$ stop-all.sh

A.2.2 Amazon Machine Image

The Parallel SECONDO AMI 1.2 is publicly provided on AWS EC2, in order tosimplify the preparation work for end-users. At present, it is available on the zoneUS-East (located at Northern Virginia) and its AMI id is: ami-f3167d9a. It isderived from the public AMI ami-3d4ff254, using Ubuntu-Server 12.04 64bit asthe operating system, containing all DS components on a single computer.

In the following, the steps of setting up Parallel SECONDO on either a singleEC2 instance or a virtual cluster consisting of multiple EC2 instances are intro-duced, respectively. Most steps are prepared for general AWS users, hence theirfurther details are explained in User Guide of Amazon EC2 1.

Set Up on a Single EC2 Instance

It is possible to set Parallel SECONDO on any one of the 18 kinds of EC2 instances,with the following steps:

1. Sign up for Amazon EC2. An AWS account is required for using the ser-vices. This account must be combined with a credit card, but is charged onlywith the rented resources 2.

2. Launch an EC2 instance based on the Parallel SECONDO AMI 1.2. Ama-zon provides a browser-based console dashboard to manage EC2 resources,where end-users can start a single instance by simply clicking the LauchInstance button on the start page. Nevertheless, a few options should bespecifically set for Parallel SECONDO.

• End-users can select the classic wizard to start the instance, finding andindicating the AMI with id ami-f3167d9a.

1http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/2http://aws.amazon.com/ec2/pricing/


• Up to now, the Parallel SECONDO is only published in zone US-East,hence the started instance should also locate in that region or else it isimpossible to find the image.

• In the security group setting, ports used by Hadoop include at least49000, 49001, 50010, 50030, 50070 and 50075. The mini-SECONDO

of the DS needs 11234. At last, the port 22 must be activated for theSSH connection. The name of the security group can be set with anyvalue, here we call it PSGroup.

• A key-pair is created and its private key file can be downloaded. Thename of the downloaded private key should not be changed, but itspermission should be set as read-only by the owner. Here it is namedmyKey.pem.

$ mv myKey.pem $HOME/.ssh$ chmod 400 $HOME/.ssh/myKey.pem

3. In order to connect a started instance, end-users can simply right click theinstance on the console dashboard, and choose the Connect option. Hereby,the instance is connected with the browser-based terminal MindTerm. Notethat the account name is ubuntu, also end-users should indicate the locationof the downloaded private key file and the name of the security group.

4. When the instance is started for the first time, two processes are performedautomatically.

• The first is $HOME/.parasecrc. It prepares the DS of the started in-stance based on the encapsulated DS example. If the instance is microtype, then the DS is set on the boot volume. Otherwise it is set to theinstance store in order to retrieve a much larger disk space.

• The second is ps-ec2-initialize, prepared at $HOME/secon-do/Algebra/Hadoop/clusterManagement. It sets various parameters inthe local DS according to the assigned IP address.

5. After initializing the instance, the Parallel SECONDO has already been setup. Format the namenode of Hadoop, then start the Hadoop platform andthe local Mini-SECONDO monitor. At last, start up the text interface for thewhole system.

$ hadoop namenode -format$ start-all.sh$ ps-startMonitors$ ps-startTTYCS -s 1


Afterward, end-users can look into the Hadoop platform with its two monitorports 50070 and 50030, based on the instances’ public DNS address that can befound on the Instances panel of the console dashboard.

The instance can be stopped when it is temporally not required. Normally westore Parallel SECONDO components on the boot volume, while all custom data onthe temporal instance store. Therefore, the stopped instance can be restarted andused, but all database objects are lost.

Set up on a EC2 cluster

Parallel SECONDO AMI can also be used to quickly set up a EC2 cluster, with theauxiliary tool named ps-ec2-startInstances. Different from the other Par-allel SECONDO scripts, it can run without installing SECONDO. However, it usesmany EC2 Command Line Interface Tools (CLI Tools), which should be preparedat first. Details about installing CLI Tools are also introduced in the user guidefor EC2. Afterward, Parallel SECONDO can be deployed on EC2 clusters with thefollowing steps:

1. Sign up AWS EC2 and create the key-pair and the security group. Both ofthem are required for each account, hence they can be repeatedly used for alot of clusters.

2. Download ps-ec2-startInstances and start it. All its arguments arelisted in the Table A.1, while most of them can be set with default values.

$ export EC2_KeyPair=$HOME/.ssh/myKey.pem$ ps-ec2-startInstances -n 5 -g PSGroup

The above commands create a virtual cluster consisting of five micro typeEC2 instances in the us-east-1a region. File myKey.pem is the private key,while PSGroup is the security group. Both of them are prepared whenpreparing the system on the single instance.

Among the five instances, one is viewed as the master node and its Nametag is marked as Master. The other four instances are all marked Slaves ontheir Name tags. In the mean time, the option master-is-slave is set yes bydefault, hence the master node is also used as a slave in the started system.Thereby, this cluster in total contains one mDS and five sDSs.

It usually needs several minutes for the start-up procedure. Sometimes, es-pecially when creating a cluster consisting of tens or even hundreds of in-stances, it happens that several instances cannot be started correctly. In thiscase, information is prompted like:


Till now, 4 instances are started,and there are still 1 instances pending.Would you like to:

1) Keep waiting for another ten seconds.2) Terminate all unstarted instances, and start new ones.3) Abort. (Note!! All started instances are not stopped.)

Accordingly, end-users can select different options based on the current sit-uation. Besides, when the process is about to finish, it may prompt like:

The authenticity of host '... (...)' can't be established.RSA key fingerprint is .......Are you sure you want to continue connecting (yes/no)?

Here it is asked to add the master instance to the known host list of the localcomputer, which can simply answered with yes. This information shows upevery time when a new cluster is created. If users prefer to avoid it, then thefollowing lines can be added to the $HOME/.ssh/config file

Host *StrictHostKeyChecking noUserKnownHostsfile /dev/null

3. The result information of the start-up script is:

The initialization is finished,you can log into the master node with command:ssh -i *.pem ubuntu@ec2-*.amazonaws.com

Therefore, end-users can access the master node of the cluster with the givenssh command directly.

4. The first time the master node is accessed, the Parallel SECONDO initializa-tion process starts automatically, which also needs several minutes. In theend, the whole system is fully prepared and can be used with commands:

$ hadoop namenode -format$ start-all.sh$ cd $HOME/secondo/Algebras/Hadoop/clusterManagement$ ps-start-AllMonitors$ ps-startTTYCS -s 1

All instances within the cluster can be stopped or terminated in the consoledashboard or by CLI tools, when they are no longer required. It is recommendedthat the Name tags of one cluster should not be repeatedly used in another cluster,


Arguments Description Type Default Value-h Print the help information-i AMI ID string ami-f3167d9a-n Number of instances int 1-m Master node’s Name tag string Master-s Slave nodes’ Name tag string Slaves-g Security group name string-k The private key file string $EC2 KeyPair-t Instance type string t1.micro-z Availability zone string us-east-1a

-oadditional optionsmaster-is-slave yes/no yes

Table A.1: Cluster Preparation Arguments

in order to prevent any unnecessary problem. Even when a cluster has already beenterminated, it still needs certain time to remove the instances from the account, thustheir Name tags cannot be recycled in this case.

Appendix B

Evaluation Queries

Here we list all queries used in this thesis’s evaluations, which are too long tobe displayed in the main text. They are all join queries, being divided into threeparts according to the joining attributes’ data types. All approaches like sequential,HDJ, SDJ and their respective variances are included. Of course, those have beenexplained in the main text are therefore not repeated here.

B.1 Join on Standard Data Types

The sequential expression of the 12th TPC-H example query is listed below. Itprocesses the join on two relations LINEITEM and ORDERS based on their OR-DERKEY attribute, with the hash join algorithm (line 2-8). Afterward, the joinresults are aggregated by the SHIPMODE attribute and calculate the query resultsfor each group (line 9-16).

1 query2 LINEITEM feed3 filter[ (.lSHIPMODE in4 [const vector(string) value ("MAIL" "SHIP")])5 and (.lCOMMITDATE < .lRECEIPTDATE)6 and (.lSHIPDATE < .lCOMMITDATE) ]7 ORDERS feed8 hashjoin[lORDERKEY,oORDERKEY, 99997 ]9 sortby[lSHIPMODE]

10 groupby[lSHIPMODE;11 high_line_count: group feed12 filter[(.oORDERPRIORITY = "1-URGENT")13 or (.oORDERPRIORITY = "2-HIGH")] count,14 low_line_count: group feed

121

122 APPENDIX B. EVALUATION QUERIES

15 filter[(.oORDERPRIORITY # "1-URGENT")16 and (.oORDERPRIORITY # "2-HIGH")] count]17 consume;

Next in the parallel queries, two constants Cluster Size and PS SCALE are firstcreated. The first indicates the number of sDSs, while the second tells how manyReduce tasks can run in parallel on the cluster. They are set according to our ownsix-computer cluster, also used in all parallel queries of this thesis. Afterward,both participant relations are distributed on the cluster as DLF flist objects, withthe spread operation.

1 let Cluster_Size = 12;2 let PS_SCALE = 36;3 let LINEITEM_Cnt_DLF = LINEITEM feed addcounter[Cnt, 1]4 spread[;Cnt,Cluster_Size;];5 let ORDERS_Cnt_DLF = ORDERS feed addcounter[Cnt, 1]6 spread[;Cnt,Cluster_Size;];78 query9 LINEITEM_Cnt_DLF hadoopMap[DLF, FALSE

10 ; . filter[ (.lSHIPMODE in [const vector(string)11 value ("MAIL" "SHIP")])12 and (.lCOMMITDATE < .lRECEIPTDATE)13 and (.lSHIPDATE < .lCOMMITDATE) ]]14 ORDERS_Cnt_DLF15 hadoopReduce2[lORDERKEY, oORDERKEY,16 DLF, PS_SCALE, TRUE17 ; . .. hashjoin[lORDERKEY,oORDERKEY, 99997 ]18 sortby[lSHIPMODE] groupby[lSHIPMODE;19 high_line_count: group feed20 filter[(.oORDERPRIORITY = "1-URGENT")21 or (.oORDERPRIORITY = "2-HIGH")] count,22 low_line_count: group feed23 filter[(.oORDERPRIORITY # "1-URGENT")24 and (.oORDERPRIORITY # "2-HIGH")] count] ]25 collect[]26 sortby[lSHIPMODE] groupby[ lSHIPMODE;27 high_line_count: group feed sum[high_line_count],28 low_line_count: group feed sum[low_line_count]]29 consume;

The above HDJ query basically represents the hash join procedure with Hadoopoperators. Both sides relations are redistributed based on the joining attribute OR-DERKEY (line 15). Since the tasks are not partitioned based on the aggregating

B.2. JOIN ON SPATIAL DATA TYPES 123

attribute, the groupby operation is divided into two stages. A partial aggregationis first processed in each Reduce task (line 18-24), then the groups are aggregatedglobally and the partial results are summed up (line 26-28).

For the SDJ approach, the query is almost the same as the above statement,except the isHDJ argument should be set false (line 16). The parajoin2 operatoris also not needed in the SDJ query, since we use the hashjoin as the internal joinprocedure for each Reduce task, it is not necessary to divide the tuples into smallergroups. Therefore, the statement of the SDJ query is not specially listed here.

B.2 Join on Spatial Data Types

As introduced in Section 4.2.5, we use the data set ROADS to evaluate the paralleljoin on spatial objects. This is provided by the OpenStreetMap project, describ-ing the road network of the federal state North Rhine-Westphalia in Germany. Itcontains the following attributes:

{OSM_id:int, Name: string, Ref: string, Type: string, OneWay: int,Bridge: int, Maxspeed: int, No: int, Shape:line}

Basically our queries use only few attributes of the relation: OSM id tells onestreet’s unique id and Name denotes its name. The streets’ geographical shapes arestored in the Shape attribute, as line (polyline) objects in SECONDO.

1 query2 ROADS feed extend[Box: bbox(.Shape)]3 projectextendstream[Name, No, Box, Shape;4 Cell: cellnumber(.Box, Roads_CellGrid)]5 sortby[Cell] {r1}6 ROADS feed extend[Box: bbox(.Shape)]7 projectextendstream[Name, No, Box, Shape;8 Cell: cellnumber(.Box, Roads_CellGrid)]9 sortby[Cell] {r2}

10 parajoin2[Cell_r1, Cell_r211 ; . .. symmjoin[(.No_r1 < ..No_r2)12 and gridintersects(Roads_CellGrid,13 .Box_r1, ..Box_r2, .Cell_r1)14 and (.Shape_r1 intersects ..Shape_r2) ]]15 count;

In the above sequential query, the PBSM method is still applied to processthe spatial join. Both sides streets’ bounding boxes are first partitioned into a2D cell grid named Roads CellGrid, then joined by cells with the parajoin2


operation (line 2-10). In each cell, the duplicated results are removed by usinggridintersects and the refinement is processed with operator intersects on twopolylines’ detailed coordinates (line 12-14).

This query can be simply converted to the following parallel query in the SDJapproach. It first distributes the relation on the cluster with the Flow operatorspread, based on the OSM id attribute. Afterward, just like the spatio-temporaljoin in Section 4.2, it shuffles the streets based on their belonging cells after par-titioning them into the grid (line 5-13). Within each Reduce task, a parajoin2operation is performed by cells (line 14-19). At last, the partial results are calcu-lated in tasks and then summed up on the master database.

1 let Roads_Id_dlf = ROADS feed2 spread[;OSM_id,Cluster_Size, TRUE;];34 query5 Roads_Id_dlf hadoopMap[ FALSE6 ; . extend[Box: bbox(.Shape)]7 projectextendstream[Name, No, Box, Shape8 ; Cell: cellnumber(.Box, Roads_CellGrid)] ]9 Roads_Id_dlf hadoopMap[ FALSE

10 ; . extend[Box: bbox(.Shape)]11 projectextendstream[Name, No, Box, Shape12 ; Cell: cellnumber(.Box, Roads_CellGrid)] ]13 hadoopReduce2[Cell, Cell, DLF, PS_SCALE, FALSE14 ; . sortby[Cell] {r1} .. sortby[Cell] {r2}15 parajoin2[ Cell_r1, Cell_r216 ; . .. symmjoin[(.No_r1 < ..No_r2)17 and gridintersects(Roads_CellGrid,18 .Box_r1, ..Box_r2, .Cell_r1)19 and (.Shape_r1 intersects ..Shape_r2) ]]20 count feed namedtransformstream[partCnt] ]21 collect[]22 sum[partCnt];

Stating this query in the HDJ approach is also simple, as listed below. It isalmost the same as the SDJ query, with two exceptions. One is that the isHDJ pa-rameter is set true (line 10). The other is that the parajoin2 operator is removed,since the parajoin operator is used implicitly within the HDJ approach. There-fore, the internal join operator symmjoin can be used directly in the Reduce UDFto process the join on every cell (line 11-15).

1 query2 Roads_Id_dlf hadoopMap[ FALSE

B.2. JOIN ON SPATIAL DATA TYPES 125

3 ; . extend[Box: bbox(.Shape)]4 projectextendstream[Name, No, Box, Shape5 ; Cell: cellnumber(.Box, Roads_CellGrid)] ]6 Roads_Id_dlf hadoopMap[ FALSE7 ; . extend[Box: bbox(.Shape)]8 projectextendstream[Name, No, Box, Shape9 ; Cell: cellnumber(.Box, Roads_CellGrid)] ]

10 hadoopReduce2[Cell, Cell, DLF, PS_SCALE, TRUE11 ; . {r1} .. {r2}12 symmjoin[(.No_r1 < ..No_r2)13 and gridintersects(Roads_CellGrid,14 .Box_r1, ..Box_r2, .Cell_r1)15 and (.Shape_r1 intersects ..Shape_r2) ] count16 feed namedtransformstream[partCnt] ]17 collect[]18 sum[partCnt];

The SDJ-Index approach, described in Section 4.2.2, can also be used for theparallel spatial join. As denoted below, it also abandons the parajoin2 operatorto prevent the sorting procedure. Instead, the itSpatialJoin operator is usedfor each task to process the index-based nested-loop join, creating the in-memoryindex structures on the fly (line 12).

1 query2 Roads_Id_dlf hadoopMap[ FALSE3 ; . extend[Box: bbox(.Shape)]4 projectextendstream[Name, No, Box, Shape5 ; Cell: cellnumber(.Box, Roads_CellGrid)] ]6 Roads_List hadoopMap[ FALSE7 ; . extend[Box: bbox(.Shape)]8 projectextendstream[Name, No, Box, Shape9 ; Cell: cellnumber(.Box, Roads_CellGrid)] ]

10 hadoopReduce2[Cell, Cell, DLF, Core_Size, FALSE11 ; . {r1} .. {r2}12 itSpatialJoin[Box_r1, Box_r2, 10, 20]13 filter[(.No_r1 < .No_r2)14 and (.Cell_r1 = .Cell_r2)15 and gridintersects(Roads_CellGrid,16 .Box_r1, .Box_r2, .Cell_r1)17 and (.Shape_r1 intersects .Shape_r2)]18 count feed namedtransformstream[partCnt] ]19 collect[]20 sum[partCnt];


B.3 Join on Spatio-Temporal Data Types

Most join queries about spatio-temporal data have been exhibited in the main text,here we only present the left ones for the thesis’ completeness. The SDJ-Indexapproach is first listed, although its details have already been explained in Section4.2.4.

1 let OBACRres060 =2 Vehicles_Moid_dlf3 hadoopMap[DLF, FALSE; . feed filter[.Type = "truck"]4 extendstream[UTrip: units(.Journey)]5 extend[Box: enlargeRect(6 WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]7 projectextendstream[Licence, Box, UTrip8 ;Cell: cellnumber(.Box, CELLGRID) ]]9 Vehicles_Moid_dlf

10 hadoopMap[DLF, FALSE; . feed filter[.Type = "truck"]11 extendstream[UTrip: units(.Journey)]12 extend[Box: enlargeRect(13 WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]14 projectextendstream[Licence, Box, UTrip15 ;Cell: cellnumber(.Box, CELLGRID) ]]16 hadoopReduce2[Cell, Cell, DLF, PS_Scale, FALSE17 ; . {V1} .. {V2}18 itSpatialJoin[Box_V1, Box_V2, 10, 20]19 filter[(.Licence_V1 < .Licence_V2)20 and (.Cell_r1 = .Cell_r2)21 and gridintersects(22 CELLGRID, .Box_V1, .Box_V2, .Cell_V1)23 and sometimes(24 distance(.UTrip_V1,.UTrip_V2) <= 10.0)]25 project[Moid_V1, Licence_V1, Moid_V2, Licence_V2]26 sortby[Moid_V1, Moid_V2] rdup ]27 collect[]28 sortby[Moid_V1, Moid_V2] rdup29 consume;

In addition, this query is also used to evaluate Parallel SECONDO on large-scale clusters consisting of AWS EC2 instances. However, it is modestly adjustedin order to achieve the most effective use on AWS resources. Firstly, certain per-centage of the tuples are used in the evaluation (line 3, 10), instead of filtering thevehicles by their types. This is because the selectivity for different vehicle typesin the BerlinMOD relation is very small and unstable, hence we cannot get enough

B.3. JOIN ON SPATIO-TEMPORAL DATA TYPES 127

data for the evaluations on tens and hundreds of computers. Secondly, both SDJand HDJ approaches use the in-memory index to improve the query processing,also HDJ applies the sorting on 〈key, value〉s automatically in the Shuffle stage.Therefore, in order to keep a fair comparison between them, the sorting procedureand the parajoin2 operation are kept (line 17-18). The adjusted query in namedSDJ-Index’.

1 let OBACRres060 =2 Vehicles_Moid_dlf3 hadoopMap[DLF, FALSE; . feed head[TupleNum]4 extendstream[UTrip: units(.Journey)]5 extend[Box: enlargeRect(6 WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]7 projectextendstream[Licence, Box, UTrip8 ;Cell: cellnumber(.Box, CELLGRID) ]]9 Vehicles_Moid_dlf

10 hadoopMap[DLF, FALSE; . feed head[TupleNum]11 extendstream[UTrip: units(.Journey)]12 extend[Box: enlargeRect(13 WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]14 projectextendstream[Licence, Box, UTrip15 ;Cell: cellnumber(.Box, CELLGRID) ]]16 hadoopReduce2[Cell, Cell, DLF, PS_Scale, FALSE17 ; . sortby[Cell] {V1} .. sortby[Cell] {V2}18 parajoin2[Cell_V1, Cell_V219 ; . .. itSpatialJoin[Box_V1, Box_V2, 10, 20]20 filter[(.Licence_V1 < .Licence_V2)21 and (.Cell_r1 = .Cell_r2)22 and gridintersects(23 CELLGRID, .Box_V1, .Box_V2, .Cell_V1)24 and sometimes(25 distance(.UTrip_V1,.UTrip_V2) <= 10.0)]]26 project[Moid_V1, Licence_V1, Moid_V2, Licence_V2]27 sortby[Moid_V1, Moid_V2] rdup ]28 collect[]29 sortby[Moid_V1, Moid_V2] rdup30 consume;

Consequently, the HDJ approach is also changed for the evaluation on AWSEC2, named HDJ-Index’. Basically, it introduces the in-memory index to processthe join procedure within each cell, since the symmjoin operation is too costly toprocess the large amounts of data used for the cloud evaluation (line 18).

1 let OBACRres060 =


2 Vehicles_Moid_dlf3 hadoopMap[DLF, FALSE; . feed head[TupleNum]4 extendstream[UTrip: units(.Journey)]5 extend[Box: enlargeRect(6 WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]7 projectextendstream[Licence, Box, UTrip8 ;Cell: cellnumber(.Box, CELLGRID) ]]9 Vehicles_Moid_dlf

10 hadoopMap[DLF, FALSE; . feed head[TupleNum]11 extendstream[UTrip: units(.Journey)]12 extend[Box: enlargeRect(13 WORLD_SCALE_BOX(bbox(.UTrip)), 5.0, 5.0, 0.0)]14 projectextendstream[Licence, Box, UTrip15 ;Cell: cellnumber(.Box, CELLGRID) ]]16 hadoopReduce2[Cell, Cell, DLF, PS_Scale, TRUE17 ; . {V1} .. {V2}18 itSpatialJoin[Box_V1, Box_V2, 10, 20]19 filter[(.Licence_V1 < .Licence_V2)20 and (.Cell_r1 = .Cell_r2)21 and gridintersects(22 CELLGRID, .Box_V1, .Box_V2, .Cell_V1)23 and sometimes(24 distance(.UTrip_V1,.UTrip_V2) <= 10.0)]25 project[Moid_V1, Licence_V1, Moid_V2, Licence_V2]26 sortby[Moid_V1, Moid_V2] rdup ]27 collect[]28 sortby[Moid_V1, Moid_V2] rdup29 consume;

Appendix C

Parallel BerlinMOD Benchmark

Parallel SECONDO enables end-users to process specialized data at large scale. Inorder to evaluate its performance with a broad range of queries on moving objects,the BerlinMOD benchmark [16] is converted, hence all its seventeen range queriescan be processed by Parallel SECONDO. Some of its queries like the 1st, 6th and10th examples are introduced in the main text, but we list all queries’ statements inthis appendix with brief introductions to demonstrate the flexibility of our system.

The complete benchmark can be downloaded on our website 1. Readers arewelcomed to make a trial of it on their own Parallel SECONDO systems.

C.1 Data Generation

A parallel generator is specifically prepared for creating a large amount of bench-mark data for the evaluation. It is converted from the original BerlinMOD datagenerator, containing several SECONDO scripts and a Hadoop program. The prin-ciple and the procedure of the generation have been discussed in Section 4.3, herewe only introduce its usage.

In order to simplify the process for the parallel generation, a bash script namedgenParaBerlinMOD is provided, with the following parameters:

• -h : Print the usage message.

• -d : Set up the name of the generated database.

• -s : Set up the scale factor of the data set.

• -p : Set up the number of days for the simulation period.

1http://dna.fernuni-hagen.de/secondo/files/Parallel BerlinMOD.tar.gz

129

130 APPENDIX C. PARALLEL BERLINMOD BENCHMARK

• -l : Is the data generated on a single computer.

For example, the following command creates a database named berlinmod inthe master database of a Parallel SECONDO system. All benchmark objects arethen created in this database with the SF (Scale Factor) set as 2.0.

$ ./genParaBerlinMOD.sh -d berlinmod -s 2.0

A set of auxiliary objects are also created with a SECONDO script namedBerlinMOD Parallel CreateObjects, by being processed in the master databaseafter the data generation.

Among all these created objects, some of them are often used in the follow-ing parallel queries, hence they are introduced here first. All generated trajectories(moving objects) are distributed on the cluster as a DLO flist object, named Vehi-cles Moid dlo, with schema:

flist(relation{MoId:int, Licence:string, Model:string,Type:string, Journey:mpoint})

Each record contains one vehicle’s complete track during the whole simulationperiod, stored in the Journey attribute.

Besides, several index structures are also created, stored in the master databaseas DLO flist objects. The Vehicles Journey sptuni Moid dlo creates spatial R-Treeon every slave database, based on trajectory units’ 2D bounding boxes. The Ve-hicles Journey sptmpuni Moid dlo then creates spatio-temporal R-Tree, based onunits’ 3D bouding boxes. In addition, the Vehicles Licence btree Moid dlo createsa distributed B-Tree on vehicles’ licence plate numbers.

Afterward, three cell grids are created for the benchmark: WORLD GRID 3D,WORLD GRID 2D and WORLD LAYERS 3D. They are all named after “WORLD”since the generated data is scaled up according to the WORLD policy described inSection 4.2.3. The first two are common cell grids built on the 2D and 3D spaces.The last one partitions the space on the time axis only, hence it divides the spaceinto a set of layers. Accordingly, three ratios WORLD X SCALE, WORLD Y SCALEand WORLD T SCALE are used to scale up the units during the query procedures.

Furthermore, sample relations QueryLicences, QueryPoints, QueryInstants andQueryPeriods are also created. They are all very small, containing only one hun-dred tuples, hence we duplicate them on the cluster and build as DLF flist objects,which will be introduced below along with the queries.

C.2. PARALLEL RANGE QUERIES 131

C.2 Parallel Range Queries

In the following, all parallel queries are introduced one after another. They arenumbered based on the BerlinMOD benchmark, each contains its basic explanationin common English and the formal notation in SECONDO executable language.

Q1

What are the models of the vehicles with licence plate numbers from QueryLi-cences?

1 let OBACRres001 =2 QueryLicences_Dup_dlf3 hadoopMap[ "Q1_Result", DLF; . {O} loopsel[4 para(Vehicles_Licence_btree_Moid_dlo)5 para(Vehicles_Moid_dlo)6 exactmatch[.Licence_O]] project[Licence,Model]7 ]8 collect[] consume;

This query is very easy, each slave database uses the complete sample relationto probe its partial B-Tree of Vehicles Licence btree Moid dlo. It is also introducedin Section 4.3.2. In the remaining queries, vehicles in QueryLicences are oftenfound in this way.

Q2

How many vehicles are “passenger” cars?

1 let OBACRres002 = Vehicles_Moid_dlo2 hadoopMap[ "Q2_Result", DLF3 ; . feed filter[.Type = "passenger"]4 ]5 collect[] count;

This is also quite simple. Each slave database scans its partial relation in Vehi-cles Moid dlo and filters the tuples by the Type attribute.

Q3

Where have the vehicles with licences from QueryLicences1 been at each of theinstants from QueryInstants1?


1 let OBACRres003 =2 QueryLicences_Top10_Dup_dlf3 hadoopMap["Q3_Result", DLF4 ; . loopjoin[5 para(Vehicles_Licence_btree_Moid_dlo)6 para(Vehicles_Moid_dlo)7 exactmatch[.Licence] {LL}]8 para(QueryInstants_Top10_Dup_dlo) feed {II}9 product

10 projectextend[;11 Licence: .Licence_LL, Instant: .Instant_II,12 Pos: val(.Journey_LL atinstant .Instant_II)] ]13 collect[] consume;

The QueryLicences1 means the first ten tuples from the QueryLicences sample.It is also duplicated and stored in a DLF flist named QueryLicences Top10 Dup dlf.

Likewise, the QueryInstants1 also means the first ten tuples of the sampleQueryInstants. It is duplicated too but stored in a DLO flist object named QueryIn-stants Top10 Dup dlo, since it is used in the Map UDF of the hadoopMap op-eration. Normally, flist objects within Map or Reduce UDF are created as DLOobjects, in order to let each task get one row data only. Otherwise, if they are DLFobjects, each task attempts to get its full data set.

The parallel query still can finish within the Map stage only. Each task firstfinds the trajectories by probing the B-Tree index (line 4-7), then creates a Carte-sian product of them with all required instants (line 8-9). At last, vehicles’ positionsat every instant are calculated with SECONDO operator atinstant (line 12).

Q4

Which licence plate numbers belong to vehicles that have passed the points fromQueryPoints?

1 let OBACRres004 =2 QueryPoints_Dup_dlf3 hadoopMap["Q4_Result", DLF4 ; . loopjoin[ para(Vehicles_Journey_sptuni_Moid_dlo)5 windowintersectsS[bbox(.Pos)] sort rdup6 para(Vehicles_Moid_dlo) gettuples]7 filter[.Journey passes .Pos] project[Pos, Licence]8 sortby[Pos, Licence] krdup[Pos, Licence]9 ]

10 collect[] consume;


Here each Map task first finds all vehicles that possibly have passed the samplepoint by probing its local spatial R-Tree in the Vehicles Journey sptuni Moid dlowith the windowintersectsS operator (line 3-6). Since the spatial R-Tree is builtbased upon trajectory units’ bounding boxes, a refinement is processed to ensurethat the trajectory has passed that point with operator passes (line 7).

Q5

What is the minimum distance between two vehicles, one has the licence fromQueryLicences1 while the other has the licence from QueryLicences2?

1 let OBACRres005 =2 QueryLicences_Top10_Dup_dlf3 hadoopMap[DLF, FALSE; . loopsel[4 para(Vehicles_Licence_btree_Moid_dlo)5 para(Vehicles_Moid_dlo)6 exactmatch[.Licence] ]7 projectextend[Licence8 ; Traj: simplify(trajectory(.Journey), 0.000001)]]9 QueryLicences_2Top10_Dup_dlf

10 hadoopMap[DLF, FALSE; . loopsel[11 para(Vehicles_Licence_btree_Moid_dlo)12 para(Vehicles_Moid_dlo)13 exactmatch[.Licence] ]14 projectextend[Licence15 ; Traj: simplify(trajectory(.Journey),0.000001)]16 intstream(1, PS_SCALE) namedtransformstream[RTID]17 product ]18 hadoopReduce2[Licence, RTID, DLF, PS_SCALE19 ; . {V1} .. {V2} product20 projectextend[;21 Licence1: .Licence_V1, Licence2: .Licence_V2,22 Dist: distance(.Traj_V1, .Traj_V2) ] sort rdup]23 collect[] sort rdup consume;

Here QueryLicences2 means the second ten tuples from the licence sample re-lation. We also duplicate it in a DLF flist named QueryLicences 2Top10 Dup dlf.

In principle, this query creates a Cartesian product on both sides vehicles, thenfinds the minimum distance for each vehicle-pair. Therefore, the parallel queryfirst uses two unexecuted hadoopMap operators to find both sides vehicles byprobing the licence B-Tree (line 2-6 and 9-13). Note that the right side result isduplicated on the cluster (line 16-17) for achieving a full product. The movingobjects which are stored as 3D mpoint data, are projected into 2D trajectories as


line data by using the SECONDO operator trajectory (line 8, 15). Afterward,each Reduce task calculates the minimum distance with operator distance (line22).

Q6

What are the pairs of licence plate numbers of “trucks”, which have ever been asclose as 10m or less to each other?

1 let OBACRres006 = Vehicles_Moid_dlo2 hadoopMap[DLF, FALSE; . feed filter[.Type = "truck"]3 extendstream[UTrip: units(.Journey)]4 extend[Box: scalerect(bbox(.UTrip),5 WORLD_X_SCALE, WORLD_Y_SCALE, WORLD_T_SCALE)]6 projectextendstream[Licence, Box, UTrip7 ;Cell: cellnumber(.Box, WORLD_GRID_3D) ] ]8 Vehicles_Moid_dlo9 hadoopMap[DLF, FALSE; . feed filter[.Type = "truck"]

10 extendstream[UTrip: units(.Journey)]11 extend[Box: scalerect(bbox(.UTrip),12 WORLD_X_SCALE, WORLD_Y_SCALE, WORLD_T_SCALE)]13 projectextendstream[Licence, Box, UTrip14 ;Cell: cellnumber(.Box, WORLD_GRID_3D) ] ]15 hadoopReduce2[Cell, Cell, DLF, PS_SCALE16 ; . {V1} .. {V2}17 itSpatialJoin[Box_V1, Box_V2, 10, 20]18 filter[(.Licence_V1 < .Licence_V2)19 and gridintersects(WORLD_GRID_3D,20 .Box_V1, .Box_V2, .Cell_V1)21 and sometimes(22 distance(.UTrip_V1,.UTrip_V2) <= 10.0) ]23 projectextend[; Licence1: .Licence_V1,24 Licence2: .Licence_V2]25 sort rdup]26 collect[] sort rdup consume;

This query is used as the major example of this thesis and fully explained withdifferent approaches in Section 4.2.3. Here it is stated in SDJ-Index, which is alsooften used in the other benchmark queries for processing the multi-dimensionaljoin.


Q7

What are the licence plate numbers of the “passenger” cars that have reached thepoints from QueryPoints first during the complete observation period?

1 let OBACRres007PointMinInst_Dup_dlf =2 QueryPoints_Dup_dlf3 hadoopMap[ DLF; . loopjoin[4 para(Vehicles_Journey_sptuni_Moid_dlo)5 windowintersectsS[bbox(.Pos)]6 sort rdup para(Vehicles_Moid_dlo) gettuples ]7 filter[.Type = "passenger"]8 projectextend[Pos9 ; Instant: inst(initial(.Journey at .Pos)) ]

10 filter[not(isempty(.Instant))]11 sortby[Pos asc, Instant asc]12 groupby[Pos; SlaveFirstTime: group feed min[Instant]]]13 collect[] sortby[Pos] groupby[Pos14 ; FirstTime: group feed min[SlaveFirstTime]]15 intstream(1, CLUSTER_SIZE) namedtransformstream[SID]16 product17 spread["OBACRres007PointMinInst_Dup",''18 ;SID, CLUSTER_SIZE, FALSE;];1920 let OBACRres007 =21 OBACRres007PointMinInst_Dup_dlf22 hadoopMap[ "Q7_Result", DLF23 ; . extend[MBR: box3d(bbox(.Pos),.FirstTime)]24 loopjoin[ para(Vehicles_Journey_sptmpuni_Moid_dlo)25 windowintersectsS[.MBR]26 sort rdup para(Vehicles_Moid_dlo) gettuples ]27 filter[.Type = "passenger"]28 filter[.Journey passes .Pos]29 projectextend[Licence, FirstTime,30 Pos ; Instant: inst(initial(.Journey at .Pos))]31 filter[.Instant <= .FirstTime]32 project[ Pos, Licence ]33 ]34 collect[] consume;

This query is divided into two parts. The first finds the earliest instant of eachsample point that a “passenger” vehicle passes. In the Map stage, the distributedspatial R-tree Vehicles Journey sptuni Moid dlo is used to probe every samplepoint, in order to find all “passenger” vehicles that have passed this point (3-7).


Afterward, the at operator calculates the instants when these vehicles are passingthese points (line 9). Next, an aggregation is processed to find the earliest instant(line 11-12).

However, each task can only find the first vehicle from its own sub-relation,hence a global aggregation is required after the parallel query (line 13-14). In theend of the first query, all (Pos, FirstTime) pairs are duplicated on the cluster (line15-18). Later in the second query, the FirstTime of each point is used to probethe distributed spatio-temporal R-Tree Vehicles Journey sptmpuni Moid dlo (line22-30), then the first vehicle can be found consequently.

Q8

What are the overall travelled distances of the vehicles with licence plate numbersfrom QueryLicences1 during the periods from QueryPeriods1?

1 let OBACRres008 =2 QueryLicences_Top10_Dup_dlf3 hadoopMap[DLF, FALSE; . {LL} loopsel[4 para(Vehicles_Licence_btree_Moid_dlo)5 para(Vehicles_Moid_dlo)6 exactmatch[.Licence_LL] ]7 projectextendstream[Licence; UTrip: units(.Journey)]8 extend[Box: scalerect(bbox(.UTrip),9 WORLD_X_SCALE, WORLD_Y_SCALE, WORLD_T_SCALE)]

10 extendstream[Cell: cellnumber(.Box, WORLD_GRID_3D )] ]11 hadoopReduce[Cell, DLF, "Q8_Result", PS_SCALE12 ; . para(QueryPeriods_Top10_Dup_dlo) feed {PP}13 product14 projectextendstream[Licence, Period_PP, UTrip15 ; UPTrip: .UTrip atperiods .Period_PP]16 extend[UDist: round(length(.UPTrip), 3)]17 projectextend[Licence, UTrip, UDist18 ;Period:.Period_PP]]19 collect[] sort rdup20 groupby2[Licence, Period; Dist: fun(t2:TUPLE, d2:real)21 d2 + attr(t2, UDist)::0.0]22 consume;

This query decomposes the result trajectories into units in the Map stage, thenpartitions them into the cell grid WORLD GRID 3D (line 3-10). Afterward, unitsare further distributed into Reduce tasks based on their cell numbers. In each Re-duce task, units’ travelled distances during the sample periods are calculated by


the atperiods and length operators (line 14-16). At last, each vehicle’s trav-elled distance is summed up by aggregating its all units (line 20-21).

Q9

What is the longest distance that was travelled by a vehicle during each of theperiods from QueryPeriod?

1 let OBACRres009 =2 QueryPeriods_Dup_dlf3 hadoopMap["Q9_Result", DLF; . {PP}4 para(Vehicles_Moid_dlo) feed project[Journey] {V1}5 product6 projectextend[Id_PP ; Period: .Period_PP,7 D: length(.Journey_V1 atperiods .Period_PP)]8 sortby[Id_PP, Period, D desc]9 groupby[Id_PP, Period

10 ; SubDist: round(group feed max[D],3) ]11 project[Id_PP, Period, SubDist]12 ] collect[]13 sortby [Id_PP, Period, SubDist desc]14 groupby[Id_PP, Period15 ; Dist: round(group feed max[SubDist],3) ]16 project[Period, Dist]17 consume;

This query first finds the longest distance on each slave database, also by usingthe atperiods and length operators (line 3-11). Afterward, a global groupbyoperation is processed to find the longest distance of each sample period on thecluster (line 13-15).

Q10

When and where did the vehicles with licence plate numbers from QueryLicences1meet other vehicles (distance < 3m) and what are the latters’ licences?

1 let OBACRres010 =2 QueryLicences_Dup_dlf3 hadoopMap[DLF, FALSE; . head[10] {O} loopsel[4 para(Vehicles_Licence_btree_Moid_dlo)5 para(Vehicles_Moid_dlo)6 exactmatch[.Licence_O] ]7 extendstream[UTrip: units(.Journey)]


8 extend[Box: enlargeRect(scalerect(bbox(.UTrip),9 WORLD_X_SCALE, WORLD_Y_SCALE,

10 WORLD_T_SCALE), 1.5, 1.5, 0.0)]11 projectextendstream[Licence, Box, UTrip12 ;Cell: cellnumber(.Box, WORLD_GRID_3D) ] ]13 Vehicles_Moid_dlo14 hadoopMap[DLF, FALSE; . feed15 extendstream[UTrip: units(.Journey)]16 extend[Box: enlargeRect(scalerect(bbox(.UTrip),17 WORLD_X_SCALE, WORLD_Y_SCALE,18 WORLD_T_SCALE), 1.5, 1.5, 0.0)]19 projectextendstream[Licence, Box, UTrip20 ;Cell: cellnumber(.Box, WORLD_GRID_3D) ] ]21 hadoopReduce2[Cell, Cell, PS_SCALE, "Q10_Result", DLF22 ; . {V1} .. {V2}23 itSpatialJoin[Box_V1, Box_V2, 10, 20]24 filter[(.Licence_V1 # .Licence_V2)25 and gridintersects(26 WORLD_GRID_3D, .Box_V1, .Box_V2, .Cell_V1)27 and everNearerThan(.UTrip_V1, .UTrip_V2, 3.0) ]28 projectextend[; QueryLicence: .Licence_V1,29 OtherLicence: .Licence_V2,30 DPos: (.UTrip_V1 atperiods31 deftime((distance(.UTrip_V1,.UTrip_V2) < 3.0)32 the_mvalue at TRUE )) the_mvalue]33 filter[not(isempty(deftime(.DPos)))]34 project[QueryLicence, OtherLicence, DPos] ]]35 collect[] sortby[QueryLicence, OtherLicence]36 groupby[QueryLicence, OtherLicence37 ; Pos: group feed project[DPos]38 sort transformstream concatS]39 consume;

The details of this query are explained in Section 4.3.2.

Q11

Which vehicles passed a point from QueryPoints1 at one of the instants fromQueryInstants1?

1 let OBACRres011 =2 QueryPoints feed head[10] project[Pos] {PP}3 QueryInstants feed head[10] project[Instant] {II}4 product


5 intstream(1, CLUSTER_SIZE) namedtransformstream[SID]6 product7 spread["Query_Points_Instants_top10_dup",''8 ; SID, CLUSTER_SIZE, FALSE;]9 hadoopMap[ "Q11_Result", DLF; .

10 loopjoin[ para(Vehicles_Journey_sptmpuni_Moid_dlo)11 windowintersectsS[12 box3d(bbox(.Pos_PP), .Instant_II)] sort rdup ]13 para(Vehicles_Moid_dlo) gettuples14 projectextend[Licence, Pos_PP, Instant_II15 ; XPos: val(.Journey atinstant .Instant_II) ]16 filter[not(isempty(.XPos))]17 filter[distance(.XPos,.Pos_PP) < 0.5]18 projectextend[Licence; Pos: .Pos_PP,19 Instant: .Instant_II] sort rdup]20 collect[] consume;

In this query, the product of the sample points and instants are created andduplicated on the cluster in the real-time (2-8). It exhibits the flexibility of ParallelSECONDO by proposing the Flow operator.

Afterward, the Vehicles Journey sptmpuni Moid dlo is probed to find all pos-sible vehicles (line 10-13) in the Map tasks. In the end, a refinement is processed tocheck whether these vehicles have ever got close to these points (distance < 0.5m)at the sample instants .

Q12

Which vehicles met at a point from QueryPoints1 at an instant from QueryIn-stants1?

1 let OBACRres012allInstants_Dup_dlo =2 QueryInstants feed head[10]3 extend[4 Period: theRange(.Instant, .Instant, TRUE, TRUE)]5 aggregateB[Period; fun(I1: periods, I2:periods)6 I1 union I2; [const periods value ()]]7 feed namedtransformstream[Period]8 intstream(1, CLUSTER_SIZE) namedtransformstream[SID]9 product

10 spread["OBACRres012allInstants_Dup"11 ; SID, CLUSTER_SIZE, FALSE;]12 hadoopMap[DLO; . extract[Period]];13


14 let OBACRres012 =15 QueryPoints_Top10_dlf16 hadoopMap[DLF, FALSE17 ; . loopjoin[18 para(Vehicles_Journey_sptuni_Moid_dlo)19 windowintersectsS[bbox(.Pos)] sort rdup20 para(Vehicles_Moid_dlo) gettuples21 projectextend[Licence; Journey: .Journey22 atperiods para(OBACRres012allInstants_Dup_dlo)] ]23 filter[.Journey passes .Pos]24 projectextendstream[Licence, Pos25 ; UTrip: units(.Journey)]26 extend[Box: scalerect(bbox(.UTrip),27 WORLD_X_SCALE, WORLD_Y_SCALE, WORLD_T_SCALE)]28 extendstream[ Cell: cellnumber(.Box, WORLD_GRID_3D)]]29 QueryPoints_Top10_dlf30 hadoopMap[DLF, FALSE31 ; . loopjoin[32 para(Vehicles_Journey_sptuni_Moid_dlo)33 windowintersectsS[bbox(.Pos)] sort rdup34 para(Vehicles_Moid_dlo) gettuples35 projectextend[Licence; Journey: .Journey36 atperiods para(OBACRres012allInstants_Dup_dlo)] ]37 filter[.Journey passes .Pos]38 projectextendstream[Licence, Pos39 ; UTrip: units(.Journey)]40 extend[Box: scalerect(bbox(.UTrip),41 WORLD_X_SCALE, WORLD_Y_SCALE, WORLD_T_SCALE)]42 extendstream[ Cell: cellnumber(.Box, WORLD_GRID_3D)]]43 hadoopReduce2[ Cell, Cell, PS_SCALE, DLF44 ; . {V1} .. {V2}45 itSpatialJoin[Box_V1, Box_V2, 10, 20]46 filter[(.Licence_V1 < .Licence_V2)47 and gridintersects(WORLD_GRID_3D,48 .Box_V1, .Box_V2, .Cell_V1)]49 para(QueryInstants_Top10_Dup_dlo) feed50 symmjoin[val(.UTrip_V1 atinstant ..Instant)51 = val(.UTrip_V2 atinstant ..Instant)]52 projectextend[ Pos_V2, Instant53 ; Licence1: .Licence_V1, Licence2: .Licence_V2]54 sort rdup ]55 collect[] consume;


This example also contains two queries. In the first one, all sample instants areencapsulated into one periods object. It is then duplicated on the cluster and buildup a DLO flist object named OBACRres012allInstants Dup dlo (line 1-12).

The second query seems very long, although it is basically composed by threeHadoop operators to perform a PBSM procedure. Firstly, all vehicles passing bythe sample points are found by probing Vehicles Journey sptuni Moid dlo (line17-20 and 31-34). Secondly, these vehicles’ partial journeys at the sample in-stants OBACRres012allInstants Dup dlo are calculated with the atperiods oper-ator (line 21-23 and 35-37). Thirdly, the PBSM method is processed to find allvehicle pairs that have ever got close with each other. In the end, the refinement isprocessed by checking whether two vehicles are at the same position on the sameinstant (line 49-51).

Q13

Which vehicles travelled within one of the regions from QueryRegions1 during theperiods from QueryPeriods1?

1 let OBACRres013 =2 QueryRegions feed head[10]3 filter[not(isempty(.Region))] {RR}4 QueryPeriods feed head[10]5 filter[not(isempty(.Period))] {PP}6 product7 intstream(1, CLUSTER_SIZE) namedtransformstream[SID]8 product9 spread["Query_Region_Period_top10_dup",''

10 ; SID, CLUSTER_SIZE, FALSE;]11 hadoopMap["Q13_Result", DLF; .12 loopsel [ fun(t:TUPLE)13 para(Vehicles_Journey_sptmpuni_Moid_dlo)14 windowintersectsS[box3d(bbox(attr(t,Region_RR)),15 attr(t,Period_PP))]16 sort rdup para(Vehicles_Moid_dlo) gettuples17 filter[(.Journey atperiods attr(t,Period_PP))18 passes attr(t,Region_RR) ]19 projectextend[Licence; Region: attr(t,Region_RR),20 Period: attr(t,Period_PP),21 Id_RR: attr(t,Id_RR),22 Id_PP: attr(t,Id_PP)] ]23 sortby[Id_RR, Id_PP, Licence]24 krdup[Id_RR, Id_PP, Licence]25 project[Region, Period, Licence]]


26 collect[] consume;

This query first creates the product of the sample regions and periods, thenduplicates them on the cluster with the spread operator (line 2-10). Afterward inthe Map tasks, Vehicles Journey sptmpuni Moid dlo is probed with the 3D boxes,while each is composed by a sample region and period, in order to find the vehiclesthat have passed these regions during that periods (line 12-18).

Q14

Which vehicles travelled within one of the regions from QueryRegions1 at one ofthe instant from QueryInstants1?

1 let OBACRres014 =2 QueryRegions feed head[10]3 filter[not(isempty(.Region))] {RR}4 QueryInstants feed head[10]5 filter[not(isempty(.Instant))] {II}6 product7 intstream(1, CLUSTER_SIZE) namedtransformstream[SID]8 product9 spread["Query_Region_Instant_top10_dup", ''

10 ; SID, CLUSTER_SIZE, FALSE;]11 hadoopMap["Q14_Result", DLF; .12 loopsel [ fun(t:TUPLE)13 para(Vehicles_Journey_sptmpuni_Moid_dlo)14 windowintersectsS[box3d( bbox(attr(t,Region_RR)),15 attr(t,Instant_II))]16 sort rdup para(Vehicles_Moid_dlo) gettuples17 filter[val(.Journey atinstant attr(t,Instant_II))18 inside attr(t,Region_RR) ]19 projectextend[Licence; Region: attr(t,Region_RR),20 Instant: attr(t,Instant_II), Id_RR: attr(t,Id_RR),21 Id_II: attr(t,Id_II)] ]22 sortby[Id_RR, Id_II, Licence]23 krdup[Id_RR, Id_II, Licence]24 project[Region, Instant, Licence]]25 collect[] consume;

This query is similar as the 13th example, except it probes the distributedspatio-temporal R-Tree Vehicles Journey sptmpuni Moid dlo by the products ofsample regions and instants.


Q15

Which vehicles passed a point from QueryPoints1 during a period from QueryPe-riods1?

1 let OBACRres015 =2 QueryPoints feed head[10]3 filter[not(isempty(.Pos))] {PO}4 QueryPeriods feed head[10]5 filter[not(isempty(.Period))] {PR}6 product7 intstream(1, CLUSTER_SIZE) namedtransformstream[SID]8 product9 spread["Query_Point_Period_top10_dup",''

10 ; SID, CLUSTER_SIZE, FALSE;]11 hadoopMap["Q15_Result", DLF; .12 loopsel [ fun(t:TUPLE)13 para(Vehicles_Journey_sptmpuni_Moid_dlo)14 windowintersectsS[15 box3d(bbox(attr(t,Pos_PO)),attr(t,Period_PR))]16 sort rdup para(Vehicles_Moid_dlo) gettuples17 filter[(.Journey atperiods attr(t,Period_PR))18 passes attr(t,Pos_PO) ]19 projectextend[Licence; Point: attr(t,Pos_PO),20 Period: attr(t,Period_PR),21 Id_PO: attr(t,Id_PO),22 Id_PR: attr(t,Id_PR)] ]23 sortby[Id_PO, Id_PR, Licence]24 krdup[Id_PO, Id_PR, Licence]25 project[Point, Period, Licence]]26 collect[] consume;

This query is also similar as the 13th example, except it probes the distributedspatio-temporal R-Tree Vehicles Journey sptmpuni Moid dlo by the products ofsample points and periods.

Q16

List the pairs of licences for vehicles, the first from QueryLicences1, the secondfrom QueryLicence2, where the corresponding vehicles are both present within aregion from QueryRegions1 during a period from QueryPeriod1, but do not meeteach other there and then.

1 let Query_Period_Region_Top10Dup_dlo =


2 QueryPeriods feed head[10] {PP}3 QueryRegions feed head[10] {RR}4 product5 intstream(1, CLUSTER_SIZE) namedtransformstream[SID]6 product7 spread["Query_Period_Region_top10_dup"8 ; SID, CLUSTER_SIZE, FALSE;]9 hadoopMap[; . consume];

1011 let OBACRres016 =12 QueryLicences_Top10_Dup_dlf13 hadoopMap[DLF, FALSE; . loopsel[14 para(Vehicles_Licence_btree_Moid_dlo)15 para(Vehicles_Moid_dlo)16 exactmatch[.Licence] ]17 para(Query_Period_Region_Top10Dup_dlo) feed18 product19 projectextend[Licence, Region_RR,20 Period_PP, Id_RR, Id_PP21 ; Journey: (.Journey atperiods .Period_PP)22 at .Region_RR]23 filter[no_components(.Journey) > 0]24 projectextendstream[Licence, Region_RR,25 Period_PP, Id_RR, Id_PP26 ; UTrip: units(.Journey)]27 extend[Box: scalerect(bbox(.UTrip),28 WORLD_X_SCALE, WORLD_Y_SCALE,29 WORLD_T_SCALE)]30 extendstream[ Layer:31 cellnumber(.Box, WORLD_LAYERS_3D )]]32 QueryLicences_2Top10_Dup_dlf33 hadoopMap[DLF, FALSE; . loopsel[34 para(Vehicles_Licence_btree_Moid_dlo)35 para(Vehicles_Moid_dlo)36 exactmatch[.Licence] ]37 para(Query_Period_Region_Top10Dup_dlo) feed38 product39 projectextend[Licence, Region_RR,40 Period_PP, Id_RR, Id_PP41 ; Journey: (.Journey atperiods .Period_PP)42 at .Region_RR]43 filter[no_components(.Journey) > 0]44 projectextendstream[Licence, Region_RR,45 Period_PP, Id_RR, Id_PP46 ; UTrip: units(.Journey)]


47 extend[Box: scalerect(bbox(.UTrip),48 WORLD_X_SCALE, WORLD_Y_SCALE,49 WORLD_T_SCALE)]50 extendstream[ Layer:51 cellnumber(.Box, WORLD_LAYERS_3D) ] ]52 hadoopReduce2[ Layer, Layer, PS_SCALE, DLF53 ; . sortby[Layer] {C1} .. sortby[Layer] {C2}54 parajoin2[Layer_C1, Layer_C255 ; . .. symmjoin[ (.Licence_C1 < ..Licence_C2)56 and (.Id_RR_C1 = ..Id_RR_C2)57 and (.Id_PP_C1 = ..Id_PP_C2) ]58 filter[ not(sometimes(59 distance(.UTrip_C1,.UTrip_C2) < 0.1))] ]60 projectextend[; Licence1: .Licence_C1,61 Licence2: .Licence_C2,62 Region: .Region_RR_C1, Period: .Period_PP_C1,63 Id_RR: .Id_RR_C1, Id_PP: .Id_PP_C1 ]64 sortby[Id_RR, Id_PP, Licence1, Licence2]65 krdup[Id_RR, Id_PP, Licence1, Licence2]66 project[Region, Period, Licence1, Licence2] ]67 collect[] sort rdup consume;

In principle, this query also applies the PBSM method for the parallel process-ing, although its length is very great. It intends to find the vehicles which do notmeet with each other, hence PBSM cannot be used directly. The method requiresthat for each result trajectory pair, there must be at least one unit-pair partitionedinto the same cell of the grid. However, in this query the result vehicles may bevery far away from each other all the time. Therefore, we use the spatial gridWORLD LAYERS 3D that partitions the moving objects’ units only on the timeaxis (line 24-31 and 44-51). In each layer, a Cartesian product on both sides unitsis created to find the vehicles which have never met (line 53-59).

Q17

Which points from QueryPoints haven been visited by a maximum number of dif-ferent vehicles?

1 let OBACRres017PosCount_Cell_dlf =2 QueryPoints_Dup_dlf3 hadoopMap[DLF, FALSE; . project[Pos] {PP}4 loopjoin[ fun(t:TUPLE)5 para(Vehicles_Journey_sptuni_Moid_dlo)6 windowintersectsS[bbox(attr(t,Pos_PP))]


7 sort rdup8 para(Vehicles_Moid_dlo) gettuples9 filter[.Journey passes attr(t,Pos_PP)]

10 project[Licence] ]11 projectextend[Licence; Pos: .Pos_PP]12 extend[Box: scalerect(bbox(.Pos),13 WORLD_X_SCALE, WORLD_Y_SCALE)]14 extendstream[ Cell:15 cellnumber( .Box, WORLD_GRID_2D )] ]16 hadoopReduce[17 Cell, DLF, "OBACRres017PosCount", PS_SCALE18 ; . sortby[Pos asc, Licence asc]19 groupby[Pos; Hits: group feed rdup count]20 ];2122 let OBACRres017PosMaxCount =23 OBACRres017PosCount_Cell_dlf24 hadoopMap["OBACRres017PosMaxCount", DLF25 ; . max[Hits] feed26 namedtransformstream[DisMaxHits]]27 collect[] max[DisMaxHits];2829 let OBACRres017 =30 OBACRres017PosCount_Cell_dlf31 hadoopMap["OBACRres017", DLF32 ; . filter[.Hits = OBACRres017PosMaxCount]33 project[Pos, Hits]34 ]35 collect[] consume;

This example uses three queries in total. The first calculates the number of ve-hicles passing by each sample point. In the Map stage, each slave database appliesall sample points to its sub-relation (line 4-11). Next these points are distributedbased on the 2D cell grid WORLD GRID 2D (line 12-15), in order to aggregate thepartial results (line 17-18).

The second query uses a hadoopMap operator to find the maximum visitnumber, which is then used in the last query to select the result point.

Bibliography

[1] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, andA. Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMSTechnologies for Analytical Workloads. Proc. VLDB Endowment, 2(1):922–933, 2009.

[2] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop GIS: AHigh Performance Spatial Data Warehousing System Over Mapreduce. Proc.VLDB Endowment, 6(11):1009–1020, 2013.

[3] E. Anderson and J. Tucek. Efficiency Matters! SIGOPS Oper. Syst. Rev.,44:40–45, March 2010.

[4] S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. AComparison of Join Algorithms for Log Processing in MapReduce. In Proc.ACM SIGMOD, 2010.

[5] N. Borenstein and N. Freed. MIME (Multipurpose Internet Mail Extensions)Part One: Mechanisms for Specifying and Describing the Format of InternetMessage Bodies. 1993.

[6] E. Bouillet and A. Ranganathan. Scalable, Real-time Map-Matching UsingIBM’s System S. In Proc. MDM, 2010.

[7] R. Chaiken, B. Jenkins, P.A. Larson, B. Ramsey, D. Shakib, S. Weaver, andJ. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive DataSets. Proc. VLDB Endowment, 1(2):1265–1276, 2008.

[8] M. Stonebraker D. J. DeWitt. MapReduce: A Major Step Backwards. TheDatabase Column, 1, January 2008.

[9] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on LargeClusters. In Proc. OSDI, pages 10–10, 2004.

147

148 BIBLIOGRAPHY

[10] J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool.Communications of the ACM, 53:72–77, January 2010.

[11] D. DeWitt and J. Gray. Parallel Database Systems: The Future of High Perfor-mance Database Systems. Communications of the ACM, 35(6):85–98, 1992.

[12] D.J. DeWitt, R.H. Gerber, G. Graefe, M.L. Heytens, K.B. Kumar, and M. Mu-ralikrishna. GAMMA: A High Performance Dataflow Database Machine.Computer Science Dept., University of Wisconsin, 1986.

[13] S. Dieker and R.H. Guting. Plug and Play With Query Algebras: SECONDO,A Generic DBMS Development Environment. In Proc. IDEAS, 2000.

[14] J. Dittrich, J.A. Quiane-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad.Hadoop++: Making A Yellow Elephant Run Like A Cheetah (Without ItEven Noticing). Proc. VLDB Endowment, 3(1), 2010.

[15] J.P. Dittrich and B. Seeger. Data Redundancy and Duplicate Detection inSpatial Join Processing. In Proc. ICDE, 2000.

[16] C. Duntgen, T. Behr, and R.H. Guting. BerlinMOD: A Benchmark for Mov-ing Object Databases. The VLDB Journal, 18(6):1335–1368, 2009.

[17] A. Eldawy and M. F Mokbel. A Demonstration of SpatialHadoop: An Ef-ficient MapReduce Framework for Spatial Data. Proc. VLDB Endowment,6(12), 2013.

[18] M. Erwig, R. H. Guting, M. Schneider, and M. Vazirgiannis. Spatio-TemporalData Types: An Approach to Modeling and Querying Moving Objects inDatabases. GeoInformatica, 3(3):269–296, 1999.

[19] L. Forlizzi, R.H. Guting, E. Nardelli, and M. Schneider. A Data Modeland Data Structures for Moving Objects Databases. ACM SIGMOD Record,29:319–330, 2000.

[20] E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A PracticalApproach to Self-Describing, Polymorphic, and Parallelizable User-DefinedFunctions. Proc. VLDB Endowment, 2(2):1402–1413, 2009.

[21] S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview Of The Sys-tem Software Of A Parallel Relational Database Machine GRACE. In Proc.VLDB Endowment, pages 209–219, 1986.

[22] G. Goetz. Encapsulation of Parallelism in the Volcano Query ProcessingSystems. SIGMOD Rec., 19:102–111, May 1990.

BIBLIOGRAPHY 149

[23] G. Graefe. Volcano: An Extensible and Parallel Dataflow Query ProcessingSystems. IEEE Trans. on Knowledge and Data Eng, 6(1), 1994.

[24] G. Graefe and D.L. Davison. Encapsulation of Parallelism and Architecture-Independence in Extensible Database Query Execution. IEEE Trans. on Soft-ware Engineering, 19(8):749–764, 1993.

[25] Goetz Graefe. Parallel Query Execution Algorithms. In Encyclopedia ofDatabase Systems, pages 2030–2035. 2008.

[26] R. H. Guting, M. H. Bohlen, M. Erwig, C. S. Jensen, N. A. Lorentzos, andM. Schneider. A Foundation for Representing and Querying Moving Objects.ACM Trans. on Database Systems, 25:1–42, 2000.

[27] R.H. Guting, T. Behr, and C. Duntgen. SECONDO: A Platform for Mov-ing Objects Database Research and for Publishing and Integrating ResearchImplementations. IEEE Data Eng. Bull., 33(2):56–63, 2010.

[28] R.H. Guting, V.T. De Almeida, and Z. Ding. Modeling and Querying MovingObjects in Networks. The VLDB Journal, 15(2):165–190, 2006.

[29] M.M. Haklay and P. Weber. OpenStreetMap: User-Generated Street Maps.IEEE Pervasive Computing, pages 12–18, 2008.

[30] H.I. Hsiao and D.J. DeWitt. Chained Declustering: A New Availability Strat-egy for Multiprocessor Database Machines. In Proc. Data Engineering, pages456–465. IEEE, 1989.

[31] D. Jiang, B.C. Ooi, L. Shi, and S. Wu. The Performance of Mapreduce: AnIn-depth Study. Proc. VLDB Endowment, 3(1-2), 2010.

[32] V. Kumar, H. Andrade, B. Gedik, and K.L. Wu. DEDUCE: At the Intersectionof MapReduce and Stream Processing. In Proc. EDBT, 2010.

[33] C. Lema, J. Antonio, L. Forlizzi, R.H. Guting, E. Nardelli, and M. Schnei-der. Algorithms for Moving Objects Databases. The Computer Journal,46(6):680, 2003.

[34] B. Li, E. Mazur, Y. Diao, A. McGregor, and P. Shenoy. A Platform for Scal-able One-Pass Analytics Using MapReduce. Proc. SIGMOD ’11, pages 985–996, New York, NY, USA, 2011. ACM.

[35] M. Lo and C.V. Ravishankar. Spatial Hash-joins. In SIGMOD Record, vol-ume 25, pages 247–258. ACM, 1996.

150 BIBLIOGRAPHY

[36] I. Michael, B. Mihai, Y. Yuan, B. Andrew, and F. Dennis. Dryad: DistributedData-Parallel Programs From Sequential Building Blocks. SIGOPS Oper.Syst. Rev., 41:59–72, March 2007.

[37] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin:A Not-So-Foreign Language for Data Processing. In Proc. ACM SIGMOD,2008.

[38] J. Orenstein. A Comparison of Spatial Query Processing Techniques for Na-tive and Parameter Spaces. 19(2):343–352, 1990.

[39] O. OMalley and A.C. Murthy. Winning a 60 Second Dash with a YellowElephant. Proceedings of sort benchmark, 2009.

[40] Esther Pacitti. Parallel Query Processing. In Encyclopedia of Database Sys-tems, pages 2038–2040. Springer-Verlag, 2008.

[41] J. Patel, J.B. Yu, N. Kabra, K. Tufte, B. Nag, J. Burger, N. Hall, K. Ra-masamy, R. Lueder, C. Ellmann, J. Kupsch, S. Guo, J. Larson, D. DeWitt,and J. Naughton. Building a Scaleable Geo-spatial DBMS: Technology, Im-plementation, and Evaluation. SIGMOD Rec., 26(2):336–347, June 1997.

[42] J.M. Patel and D.J. DeWitt. Partition Based Spatial-Merge Joins. ACM SIG-MOD Record, 25(2):259–270, 1996.

[43] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D.J. DeWitt, S. Madden, andM. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis.In Proc. ACM SIGMOD, 2009.

[44] M. Sakr, G. Andrienko, T. Behr, N. Andrienko, R.H. Guting, and C. Hurter.Exploring Spatiotemporal Patterns by Integrating Visual Analytics with aMoving Objects Database System. GIS ’11, pages 505–508, New York, NY,USA, 2011. ACM.

[45] J. Schad, J. Dittrich, and J.A. Quiane-Ruiz. Runtime Measurements in theCloud: Observing, Analyzing, and Reducing Variance. Proc. VLDB Endow-ment, 3(1-2):460–471, 2010.

[46] J. Shute, B.and Handy B.and Whipkey C. Vingralek, R.and Samwel,E. Rollins, M.O.K Littlefield, D. Menestrina, S.E.J. Cieslewicz, I. Rae, et al.F1: A Distributed SQL Database That Scales. Proc. VLDB Endowment,6(11), 2013.

BIBLIOGRAPHY 151

[47] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop DistributedFile Systems. In Mass Storage Systems and Technologies (MSST), 2010,pages 1–10. IEEE, 2010.

[48] M. Stonebraker, D. Abadi, D. J. DeWitt, Madden S, E. Paulson, A. Pavlo, andA. Rasin. MapReduce and Parallel DBMSs: Friends or Foes? Communica-tions of the ACM, 53:64–71, January 2010.

[49] M. Stonebraker and J. Hellerstein. What goes around comes around. Read-ings in Database Systems, pages 2–41, 2005.

[50] A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu,P. Wyckoff, and R. Murthy. Hive: A Warehousing Solution Over a Map-Reduce Framework. Proc. VLDB Endowment, 2(2):1626–1629, 2009.

[51] F. Valdes, Damiani M.L., and Guting R.H. Symbolic Trajectories in SEC-ONDO: Pattern Matching and Rewriting. In DASFAA (2), pages 450–453,2013.

[52] P. Valduriez. Parallel Database Management. In Encyclopedia of DatabaseSystems, pages 2026–2029. Springer-Verlag, 2008.

[53] E. Walker. Benchmarking Amazon EC2 for High-performance ScientificComputing. Usenix Login, 33(5):18–23, 2008.

[54] G. Wang, M.V. Salles, B. Sowell, X. Wang, T. Cao, A. Demers, J. Gehrke, andW. White. Behavioral Simulations in MapReduce. Proc. VLDB Endowment,3(1-2):952–963, 2010.

[55] Y. Xu, P. Kostamaa, and L. Gao. Integrating Hadoop and Parallel DBMS. InProc. ACM SIGMOD, 2010.

[56] C. Yang, C. Yen, C. Tan, and S.R. Madden. Osprey: Implement-ing MapReduce-Style Fault Tolerance in a Shared-Nothing DistributedDatabases. In Proc. ICDE, 2010.

[57] H.C. Yang, A. Dasdan, R. L. Hsiao, and D. S. Parker. Map-Reduce-Merge:Simplified Relational Data Processing on Large Clusters. In Proc. ACM SIG-MOD, 2007.

[58] S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. SJMR: Parallelizing SpatialJoin With MapReduce on Clusters. In Proc. CLUSTER, 2009.

[59] X. Zhou, D.J. Abel, and D. Truffet. Data Partitioning for Parallel Spatial JoinProcessing. GeoInformatica, 2(2):175–204, 1998.

152 BIBLIOGRAPHY

Author’s Biography

Personal Information

• Name: Jiamin Lu

• Birthday: Sept. 9, 1983

• Birthplace: NanTong, JiangSu, China

Education

• 09/2001 - 07/2005B.E, Computer Science, Hohai University, Nanjing, China

• 09/2005 - 07/2008M.E, Computer Science, Hohai University, Nanjing, China

Academic Experience

• Join the group “Database system for New Applications” in FernUniversitatin Hagen Germany since 2009, and granted by China Scholarship Council(CSC).

• Join the project Research on Indexing Technology for Moving Objects basedon Spatial Network, which is funded by the National Natural Science Foun-dation of China (Grant No. 60673141 ).

Publications

• J. Lu and R.H. Guting. Parallel SECONDO: A Practical System for Large-Scale Processing of Moving Objects. ICDE 2014

153

154 BIBLIOGRAPHY

• J. Lu and R.H. Guting. Parallel SECONDO: Practical and Efficient MobilityData Processing in the Cloud. BigData Conference 2013: 17-25

• J. Lu and R.H. Guting. Simple and Efficient Coupling of Hadoop with aDatabase Engine. SoCC 2013: 32

• J. Lu and R.H. Guting. Parallel Secondo: Boosting Database Engines withHadoop. ICPADS 2012: 738-743

• J. Lu and R.H. Guting. Simple and Efficient Coupling of Hadoop With aDatabase Engine. Fernuniversitat in Hagen, Informatik-Report 366 - 10/2012.

Parallel SECONDO: Processing Moving Objects Data At Large ......Parallel SECONDO: Processing Moving...

Documents

Transcript of Parallel SECONDO: Processing Moving Objects Data At Large ......Parallel SECONDO: Processing Moving...