SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System)....

33
Universit` a degli Studi di Cagliari Facolt ` a di Scienze Matematiche Fisiche e Naturali Corso di Laurea in Informatica Laurea Magistrale SPARKSQL vs RDBMS Database Query Benchmark Candidate Carlo Corona (matr. 65009) Supervisor Coordinator Prof. Diego Reforgiato Recupero Prof. G. Michele Pinna Academic Year Academic Year 2016/2017

Transcript of SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System)....

Page 1: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Universita degli Studi di Cagliari

Facolta di Scienze Matematiche Fisiche e Naturali

Corso di Laurea in Informatica

Laurea Magistrale

SPARKSQL vs RDBMS DatabaseQuery Benchmark

Candidate

Carlo Corona(matr. 65009)

Supervisor Coordinator

Prof. Diego Reforgiato Recupero Prof. G. Michele Pinna

Academic Year Academic Year 2016/2017

Page 2: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Abstract

In the near future everything will be connected to the network: people, things,

machines and operating processes will daily contribute to a permanent channel

between the real world and the virtual dimensions enabled by the Internet.

The amount of data generated by these connections will be enormous.

Big Data, their analysis and exploitation will enable the birth of a new company

and a new economy based on the value of digital data, the Data-Driven Society.

The term “Big Data” tends to refer to the use of predictive analytics, user

behavior analytics, or certain other advanced data analytics methods that

extract value from data, and seldom to a particular size of data set.

Analysis of datasets can find new correlations to business trends, prevent

diseases, combat crime and so on.

Scientists, business executives, practitioners of medicine, advertising and

governments regularly meet difficulties with large datasets in areas including

Internet search, fintech, urban informatics, and business informatics.

Scientists encounter limitations in e-Science work, including meteorology,

genomics, connectomics, complex physics simulations, biology and environ-

mental researches.

Relational database management systems (RDBMS) and desktop statistics

and visualization-packages often have difficulty handling big data.

The work may require massively parallel software running on tens, hun-

dreds, or even thousands of servers.

Page 3: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Contents

1 Introduction 3

1.1 Argument of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Context of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Purpose of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Big Data 4

3 RDBMS Database 6

3.1 Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Map Reduce 10

5 Apache Spark 12

5.1 Spark Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.2 Spark SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.3 Spark Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.4 MLlib Machine Learning Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.5 GraphX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.6 Cluster Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.7 Spark Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.7.1 The Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.7.2 The Executor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.7.3 The Cluster Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Test Environment 20

6.1 Hardware Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2 Software Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.3 Query List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.3.1 OnTime1/OnTime2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.3.2 Unica1/Unica2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.4 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Conclusions 30

Appendices 31

Bibliography 32

Page 4: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Chapter 1

Introduction

1.1 Argument of the thesis

Some queries have been selected and executed on two datasets that contain

data from United States’s airways traffic and University of Cagliari’s database.

1.2 Context of the thesis

The optimization and processing speed of large amounts of data will attract

the attention of several sectors with totally different purposes, study, research,

commerce, security and so on.

This has stimulated the realization of new highly sophisticated metologies,

algorithms and data processing tools.

1.3 Purpose of thesis

The purpose of this thesis was to measure the speed of data extraction between

an Apache Spark engine and three relational databases (RDBMS).

Page 5: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Chapter 2

Big Data

Figure 2.1:

The term Big Data indicates any such a large data collection to make it

difficult or impossible to store it in a traditional database system such as a

RDBMS (Relational Database Management System). Although it does not

refer to a particular quantity and it is commonly used in relation to quantities

that are leat as large as a terabyte ie when data can no longer be stored or

processed by a single machine.

Big Data has many features that differentiate it from traditional data col-

lections. The most important is Volume, which is, the amount of data that

must be memorized.

Another Big Data feature is Variety: data can come from different sources

and in different forms, for example they can be structured, semi-structured or

unstructured. Think about the text of a tweet, pictures or data from sensors:

they obviously correspond to different type of data, which means that their

integration requires special efforts.

Page 6: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 2. BIG DATA 5

Unstructured data can not be stored in an RDBMS, but it is stored in

NoSQL databases because they are more appropriate to manage data variabil-

ity.

RDBMSs require database’s structure to be fixed before its use so that it

remains unchanged.

An increasing percentage of the population has Internet access and a Smart-

phone, and there is an explosion of sensors due to the emerging Internet Of

Things. For this reason a great amount of data must be stored quickly.

The third feature of Big Data is Speed , which indicates how quickly new

data can be available. Technologies to control this aspect of the Big Data are

called streaming data and complex event processing, which analyze data as

it arrives and answer questions like: ”How many times was the word “apple”

searched yesterday?”

The forth feature, which is Variability refers to data’s inconsistency, which

obstructs the manipulation process and the effective data management.

Complexity, the fifth and last Big Data’s feature, indicates that data

coming from different sources need to be linked to each other to get useful

information.

The need of high scalability and the necessity to store unstructured data

make the traditional DBMS database not suitable to store Big Data. For this

reason new systems now allow you to store non-relational data types, offering

horizontal scalability and, consequently, performance’s improvement. This is

against the assignment of more resources in single machines to improve their

general performances.

Page 7: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Chapter 3

RDBMS Database

A Relational Database Management System (RDBMS) is a database man-

agement system (DBMS) that is based on the relational model invented by

Edgar F.Codd, of IBM’s San Jose Research Laboratory. In 2017, many of the

databases in widespread use are based on the relational database model.

RDBMSs have been a common choice for the storage of information in new

databases used for financial records, manufacturing and logistical information,

personnel data, and other applications since the 1980s. Relational databases

have often replaced legacy hierarchical databases and network databases be-

cause they are easier to understand and use.

A database is a collection of data (structured and logically related). A database

consists of tables. Each table is composed of records and fields. The databases

can be composed of one or more tables. Each table must contain a field that

identifies each data uniquely. This field comes defined primary key. When

designing a database, you start from the definition of the tables that are part

of the database. For each table you define the fields that represent the table

structure. Then, set the ”relationships” between tables that allow to nor-

malize (break the fat table, containing all information, in more lean tables)

by avoiding redundancies, achieving an adequate degree of efficiency and will

provide a check on errors (insert, delete, update anomalies) by setting integrity

referential.

Some example of RDBMS database are Oracle, MySQL and PostgreSQL.

Page 8: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 3. RDBMS DATABASE 7

Figure 3.1: RDBMS Database Architecture

3.1 Oracle

Oracle Database is one of the most popular database management systems

(DBMS).

Oracle Corporation, one of the largest companies in the world, was founded

in 1977 by Lawrence J. Ellison (current chief executive officer, Chief Technology

Officer and major shareholder), Bob Miner and Ed Oates, headquartered in

California.

The first available version of the publicly available Oracle Database dates

back to 1979, and since then, numerous changes and improvements have been

introduced to follow technology developments, up to version 12c R2.

Figure 3.2: Oracle Database 12c Architecture

Page 9: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 3. RDBMS DATABASE 8

3.2 MySQL

MySQL is an open-source relational database management system (RDBMS).

Its name is a combination of ”My”, the name of co-founder Michael Wide-

nius’s daughter, and ”SQL”, the abbreviation for Structured Query Language.

The MySQL development project has made its source code available under the

terms of the GNU General Public License, as well as under a variety of pro-

prietary agreements. MySQL was owned and sponsored by a single for-profit

firm, the Swedish company MySQL AB, now owned by Oracle Corporation.

For proprietary use, several paid editions are available, and offer additional

functionality.

Figure 3.3: Mysql Architecture

3.3 PostgreSQL

PostgreSQL, often simply Postgres, is an object-relational database manage-

ment system (ORDBMS) with an emphasis on extensibility and standards

compliance. As a database server, its primary functions are to store data

securely and return that data in response to requests from other software

applications. It can handle workloads ranging from small single-machine ap-

plications to large Internet-facing applications (or for data warehousing) with

many concurrent users.

PostgreSQL is ACID-compliant and transactional database, has updatable

views and materialized views, triggers, foreign keys; supports functions and

stored procedures, and other expandability as Oracle Database.

Page 10: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 3. RDBMS DATABASE 9

PostgreSQL is developed by the PostgreSQL Global Development Group,

a diverse group of many companies and individual contributors. It is free and

open-source, released under the terms of the PostgreSQL License, a permissive

software license.

Figure 3.4: PostgreSQL Architecture

Page 11: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Chapter 4

Map Reduce

Figure 4.1: Map Reduce

MapReduce is a programming template to process large datasets on paral-

lel computing systems. A Job MapReduce is defined by:

- input data

- a Map process that generates some input for each input element number of

key pairs/value

- a phase of network shuffle

- a Reduce procedure, which receives input elements with the same key and

generates summary information from those elements

- output data

MapReduce ensures that all items with the same key will be processed by the

same reducer as the mapper all use the same function hash to decide which

reducer to send key pairs/value. This programming paradigm is very com-

plicated to use directly, given the number of jobs needed to perform complex

data operations. Tools like Pig and Hive have been created to offer a high

language level (Pig Latin and HiveQL) and transform their queries into a set

Page 12: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 4. MAP REDUCE 11

of jobs MapReduce that are run in succession.

Page 13: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Chapter 5

Apache Spark

Figure 5.1: Web Console

Apache Spark is a cluster computing platform designed to be fast and

general-purpose. Spark provides an interface to program entire clusters through

implicit data parallelism and fault-tolerance.

Speed is important in processing large datasets, as it means the difference

between exploring data interactively and waiting minutes or hours. One of

the main features Spark offers for speed is the ability to run computations in

memory, but the system is also more efficient than MapReduce for complex

applications running on disk.

On the generality side, Spark is designed to cover a wide range of workloads

that previously required separate distributed systems, including batch applica-

tions, iterative algorithms, interactive queries, and streaming. By supporting

these workloads in the same engine, Spark makes it easy and inexpensive to

combine different processing types, which is often necessary in production data

Page 14: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 5. APACHE SPARK 13

analysis pipelines.

Apache Spark provides programmers with an application programming in-

terface centered on a data structure called the Resilient Distributed Dataset

(RDD), a read-only multiset of data items distributed over a cluster of ma-

chines, that is maintained in a fault-tolerant way. It was developed in response

to limitations in the MapReduce cluster computing paradigm, which forces a

particular linear dataflow structure on distributed programs: MapReduce pro-

grams read input data from disk, map a function across the data, reduce the

results of the map, and store reduction results on disk. Spark’s RDDs function

as a working set for distributed programs that offers a (deliberately) restricted

form of distributed shared memory.

The availability of RDDs facilitates the implementation of both iterative

algorithms, that visit their dataset multiple times in a loop, and interactive/-

exploratory data analysis, i.e., the repeated database style querying of data.

The latency of such applications (compared to a MapReduce implementation,

as was common in Apache Hadoop stacks) may be reduced by several orders

of magnitude.

Spark is designed to be highly accessible, offering simple APIs in Python,

Java, Scala, and SQL, and rich built-in libraries. It also integrates closely with

other Big Data tools.

Apache Spark requires a cluster manager and a distributed storage system.

For cluster management, Spark supports standalone (native Spark cluster),

Hadoop YARN, or Apache Mesos. For distributed storage, Spark can interface

with a wide variety, including Hadoop Distributed File System (HDFS), MapR

File System (MapR-FS), Cassandra, OpenStack Swift, Amazon S3, Kudu,

or a custom solution can be implemented. Spark also supports a pseudo-

distributed local mode, usually used only for development or testing purposes,

where distributed storage is not required and the local file system can be used

instead; in such a scenario, Spark is run on a single machine with one executor

per CPU core.

5.1 Spark Core

Spark Core contains the basic functionality of Spark, including components for

task scheduling, memory management, fault recovery, interacting with storage

Page 15: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 5. APACHE SPARK 14

Figure 5.2: Spark Stack

systems, and more. Spark Core is also home to the API that defines resilient

distributed datasets (RDDs), which are Sparks main programming abstraction.

RDDs represent a collection of items distributed across many compute nodes

that can be manipulated in parallel. Spark Core provides many APIs for

building and manipulating these collections.

Spark Core provides distributed task dispatching, scheduling, and basic

I/O functionalities, exposed through an application programming interface

(for Java, Python, Scala, and R) centered on the RDD abstraction, but is

also usable for some other non-JVM languages. This interface mirrors a

functional/higher-order model of programming: a driver program invokes par-

allel operations such as map, filter or reduce on an RDD by passing a function

to Spark, which then schedules the function’s execution in parallel on the clus-

ter. These operations, and additional ones such as joins, take RDDs as input

and produce new RDDs. RDDs are immutable and their operations are lazy;

fault-tolerance is achieved by keeping track of the “lineage” of each RDD (the

sequence of operations that produced it) so that it can be reconstructed in

the case of data loss. RDDs can contain any type of Python, Java, or Scala

objects.

Aside from the RDD-oriented functional style of programming, Spark pro-

vides two restricted forms of shared variables: broadcast variables reference

read-only data that needs to be available on all nodes, while accumulators can

be used to program reductions in an imperative style.

A typical example of RDD-centric functional programming is the following

Scala program that computes the frequencies of all words occurring in a set of

text files and prints the most common ones. Each map, flatMap (a variant of

map) and reduceByKey takes an anonymous function that performs a simple

operation on a single data item (or a pair of items), and applies its argument

Page 16: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 5. APACHE SPARK 15

to transform an RDD into a new RDD.

5.2 Spark SQL

Spark SQL is a component on top of Spark Core that introduced a data ab-

straction called DataFrames, which provides support for structured and semi-

structured data. Spark SQL provides a domain-specific language (DSL) to

manipulate DataFrames in Scala, Java, or Python. It also provides SQL lan-

guage support, with command-line interfaces and ODBC/JDBC server.

Spark SQL is Sparks package for working with structured data. It allows

querying data via SQL as well as the Apache Hive variant of SQLcalled the

Hive Query Language (HQL) and it supports many sources of data, including

Hive tables, Parquet, and JSON. Beyond providing a SQL interface to Spark,

Spark SQL allows developers to intermix SQL queries with the programmatic

data manipulations supported by RDDs in Python, Java, and Scala, all within

a single application, thus combining SQL with complex analytics. This tight

integration with the rich computing environment provided by Spark makes

Spark SQL unlike any other open source data warehouse tool.

5.3 Spark Streaming

Spark Streaming is a Spark component that enables processing of live streams

of data. Examples of data streams include logfiles generated by production

web servers, or queues of messages containing status updates posted by users

of a web service.

Spark Streaming provides an API for manipulating data streams that

closely matches the Spark Cores RDD API, making it easy for programmers to

learn the project and move between applications that manipulate data stored in

memory, on disk, or arriv ing in real time. Underneath its API, Spark Stream-

ing was designed to provide the same degree of fault tolerance, throughput,

and scalability as Spark Core.

Spark Streaming leverages Spark Core’s fast scheduling capability to per-

form streaming analytics. It ingests data in mini-batches and performs RDD

transformations on those mini-batches of data. This design enables the same

set of application code written for batch analytics to be used in streaming ana-

lytics, thus facilitating easy implementation of lambda architecture. However,

Page 17: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 5. APACHE SPARK 16

this convenience comes with the penalty of latency equal to the mini-batch

duration. Other streaming data engines that process event by event rather

than in mini-batches include Storm and the streaming component of Flink.

Spark Streaming has support built-in to consume from Kafka, Flume, Twit-

ter, ZeroMQ, Kinesis, and TCP/IP sockets.

5.4 MLlib Machine Learning Library

Spark comes with a library containing common machine learning (ML) func-

tionality, called MLlib. MLlib provides multiple types of machine learning

algorithms, including classification, regression, clustering, and collaborative fil-

tering, as well as sup porting functionality such as model evaluation and data

import. It also provides some lower-level ML primitives, including a generic

gradient descent optimization algorithm. All of these methods are designed to

scale out across a cluster.

5.5 GraphX

GraphX is a library for manipulating graphs and performing graph-parallel

computations. Like Spark Streaming and Spark SQL, GraphX extends the

Spark RDD API, allowing us to create a directed graph with arbitrary proper-

ties attached to each vertex and edge. GraphX also provides various operators

for manipulating graphs (e.g., subgraph and mapVertices) and a library of

common graph algorithms (e.g., PageRank and triangle counting).

5.6 Cluster Managers

Under the hood, Spark is designed to efficiently scale up from one to many

thousands of compute nodes. To achieve this while maximizing flexibility,

Spark can run over a variety of cluster managers, including Hadoop YARN,

Apache Mesos, and a simple cluster manager included in Spark itself called

the Standalone Scheduler. If you are just installing Spark on an empty set of

machines, the Standalone Scheduler provides an easy way to get started; if you

already have a Hadoop YARN or Mesos cluster, however, Sparks support for

these cluster managers allows your applications to also run on them.

Page 18: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 5. APACHE SPARK 17

5.7 Spark Architecture

In general there are a number of running processes for each Spark application

(one driver and many executor).

The driver is the manager of a Spark program, deciding the tasks to be

performed on the processes executor, that are running in the cluster. The

driver, on the other hand, could be running on the client machine.

In the main program of a Spark application (the driver) there is an object

called SparkContext, whose instance communicates with the cluster resource

manager to require a set of resources (RAM, core, etc.) for executors.

Several cluster managers are supported including YARN, Mesos, EC2 and

Spark’s Standalone Cluster Manager. A master/slave architecture is used,

where there are a coordinator process(driver) and many worker processes (ex-

ecutors).

Since each executor is in a separate process, different applications do not

can share data unless they first write to disk. If you work in a single node you

only have one process that contains both the driver and an executor, but this

is a special case. Working in a single node allows you to test applications, as

you use the same API that you would use if you were working in a cluster. A

Spark application consists of jobs, one for each action. Each job consists of a

set of stages that depend one on the other, performed in sequence and each of

which is executed by a multitude of tasks, carried out parallel by executors.

Figure 5.3: Spark Architecture

5.7.1 The Driver

The driver is the main process, that contains the main method and the user

code. The user code uses traversal operations and actions on RDDs (dis-

Page 19: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 5. APACHE SPARK 18

tributed datasets). It is run in parallel from the executor processes deployed

in the cluster. The driver can be run both within the cluster and on the client

machine that is running the Spark application. It performs the following two

functions:

- convert the user program into a task set, that is the smallest working

unit in Spark. Every Spark program is structured in this way: you read

data from disk into one or more RDDs, transform them and you recovers the

computation result. Transformation operations are done only when a result is

asked. In fact, Spark stores an acyclic graph direct (DAG) of the operations

to get the contents of a RDD. Processing or rescue/recovery operations are

transformed into a series of stages performed sequentially, each one of which

is composed of a set of tasks that are performed by the executor.

- do task scheduling on executor nodes. Scheduling tasks is made basing

on where the files are stored, to try to avoid as much as possible to transfer

them to the network. If a node fails, the platform automatically schedules it

in another node, and only lost data is recalculated.

5.7.2 The Executor

Executors are the processes that perform tasks from the driver.

Each application has its own executor (ie its own processes), each of which

can have multiple threads running.

Executors have one certain amount of memory assigned (configurable),

which allows it to store the data in memory if requested by the user appli-

cation (via the cache statement on a RDD).

Executors of different Spark applications do not communicate with each

other, causing the failure of different applications to share data with each

other unless you first write them to disk.

Executors live for the duration of an application; if a Executor fails, Spark

can continue to run the program by recalculating only lost data.

It is good that the driver and execution nodes are in the same network

since the driver continually communicates with them.

5.7.3 The Cluster Manager

Cluster managers are handling resources within a cluster.

Page 20: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 5. APACHE SPARK 19

For example, when multiple applications require cluster resources, the clus-

ter manager performs scheduling in nodes based on the memory and core of

the Free CPUs.

Some cluster managers also allow you to give different priorities to different

applications.

Spark supports the following cluster managers:

- YARN: Hadoop’s new resource manager

- Mesos

- Standalone cluster manager

In addition, Spark provides a script to run on an Amazon EC2 cluster.

Page 21: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Chapter 6

Test Environment

The tests environment consists of two separate components, the server

databases (Oracle, MySQL, PostgreSQL) installed on one virtual machine,

and the Spark Standalone Cluster installed on other two servers.

The test measures the performance of selected SQL queries that are exe-

cuted locally in the three databases and the time compared with SparkSQL

query remotely executed via the Spark Cluster.

To execute SparkSQL query, tables involved were mapped to Spark

Dataframe data abstraction.

To interface Spark Cluster with remote databases, it was necessary to use

the appropriate jdbc drivers for each single database.

In this configuration network latency is minimized because all VM resided

on the same physical server and same network (no ip routing is performed).

Figure 6.1: Server Interconnection

Page 22: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 6. TEST ENVIRONMENT 21

6.1 Hardware Requirement

Spark Cluster consist of two VmWare Virtual Machines named Spark-Master

and Spark-Slave.

Database server named SparkDB and Spark Cluster have been installed

on three VmWare Virtual Machine with OS Centos 7 at 64bit, 8 Core, 12GB

RAM.

All virtual machines reside on physical server SUN X4550 with 16 Core

and 36GB RAM.

The virtualization software VmWare Esx 5.1 is installed in that server.

6.2 Software Requirement

Three different database server are installed in the SparkDB Virtual Machine:

MySQL Database server (MariaDB) version 5.5.52;

PostgreSQL Database version 9.2.18;

Oracle Database 12c R2;

In a Spark Cluster is installed a Apache Spark version 2.2.0 with Hadoop

v.2.7 and configured in cluster mode. Is necessary to download the appropriate

JDBC connectors to connect it to all the database servers.

6.3 Query List

Two types of queries have been chosen, one on a table with millions of records

and hundreds of fields, and another construct on dozens of tables to join with

each other but only with thousands of data.

Some SQL queries have necessarily been adapted, for a compatibility issue,

in standard SQL ANSI format and subsequently executed in all environments

(RDBMS databases, SpakSQL) considered in the test.

The main reason for choosing these queries is that they are difficult to

optimize in RDBMS databases.

Partitioned tables were also used in the test queries to help reduce RDBMS

level contention.

At the same time partitionColumn option used in SparkSQL queries does

not require that RDBMS table is partitioned.

Page 23: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 6. TEST ENVIRONMENT 22

The example queries that are considered are:Query Name Description Number of Total Records

Tables Used ProcessedOnTime1 Total delayed flight per each airline 1 65.971.419OnTime2 Total flight per days ok week 1 65.971.419Unica1 Number of student evaluation of 30 21.106.599

study programs for the yeardifferent from 2016

Unica2 Number of student evaluation 30 21.106.599of study programs for the year

different 2016 (grouped)

6.3.1 OnTime1/OnTime2

Table used in this query is:Table Alias Table Real Name Description Num RowsONTIME ONTIME Airlines On-Time Performance 65.971.419

This table have million records and hundreds of fileds.

Datasets Population

To populate all databases follow the instructions in the Readme.txt file at the

link:

https://github.com/ccorona70/tesimagistrale/blob/master/QUERY/

ONTIME/DataSet%20Population/

Database Query SQL Scripts

OnTime1 (Total delayed flight per each airline) :

s e l e c t min ( year ) , max( year ) as max year , Carr i e r , count (∗ ) as cnt , sum(case when ArrDelayMinutes > 30 then 1 e l s e 0 end ) as f l i g h t s d e l a y e d , round(sum( case when ArrDelayMinutes > 30 then 1 e l s e 0 end )/ count (∗ ) , 2 ) as ra t eFROM ontimeWHERE DayOfWeek not in (6 , 7 )and Or ig inState not in ( ’AK’ , ’HI ’ , ’PR’ , ’VI ’ )and DestState not in ( ’AK’ , ’HI ’ , ’PR’ , ’VI ’ )\\GROUP by c a r r i e r HAVING count (∗ ) > 100000 and max( year ) > 2010ORDER by ra t e DESC, count (∗ ) desc ;

OnTime2 (Total flight per days ok week):

s e l e c t dayofweek , count (∗ ) from ontime group by dayofweek ;

Page 24: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 6. TEST ENVIRONMENT 23

All SQL database query are founded at link https://github.com/

ccorona70/tesimagistrale/tree/master/QUERY/ONTIME/DMLScripts

Python SpakSQL Script Example

In Spark a Python script has been created for each query.

Each script use a SparkSQL sintaxt which use a Dataframe concept to

manipulate table data.

The queries, in ANSI format, are executed in all relational databases.OnTime1 example Python script for Oracle Database:

from pyspark import SparkConf , SparkContextfrom pyspark . s q l import SQLContext , Rowimport timeconf=SparkConf ( ) . setMaster (” spark :// spark−master : 7 0 7 7 ” ) . setAppName(” ontime1o r a c l e ” ) . s e t (” spark . executor .memory” ,”8G” ) . s e t (” spark . d r i v e r .memory” ,”4G”)sc = SparkContext ( conf=conf )

sq lContext = SQLContext ( sc )

tab1=sqlContext . read . format (” jdbc ” ) . opt i ons ( u r l=”jdbc : o r a c l e : th in : e s s e3 /esse3@// spark−db :1521/ o r c l . unica . i t ” , dbtable=”ontime ” , f e t c hS i z e =10000 ,part it ionColumn=”year ” , lowerBound=2007 , upperBound=2017 , numPartit ions=11 ) . load ( )

tab1 . registerTempTable (” ontime ”)

q1 = sqlContext . s q l (” s e l e c t min ( year ) , max( year ) as max year , Carr i e r , count (∗ )as cnt , sum( i f ( ArrDelayMinutes $>$ 30 , 1 , 0 ) ) as f l i g h t s d e l a y e d , round (sum( i f( ArrDelayMinutes $>$ 30 , 1 , 0 ) )/ count (∗ ) , 2 ) as ra t e FROM ontimeWHERE DayOfWeek not in (6 , 7 ) and Or ig inState not in ( ’AK’ , ’HI ’ , ’PR’ , ’VI ’ )and DestState not in ( ’AK’ , ’HI ’ , ’PR’ , ’VI ’ ) GROUP by c a r r i e r HAVING cnt >100000 and max year > 2 0 1 0 ORDER by ra t e DESC, cnt desc LIMIT 10”)

s t a r t=time . time ( )

q1 . show ( )

p r i n t ( time . time ()− s t a r t )}

All Python scripts are available at the following links:

https://github.com/ccorona70/tesimagistrale/tree/master/QUERY/

ONTIME/SparkScripts

6.3.2 Unica1/Unica2

SQL query code for all databases is available at link:

https://github.com/ccorona70/tesimagistrale/tree/master/QUERY/

UNICA/DMLScripts

Page 25: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 6. TEST ENVIRONMENT 24

Tables used in this query are:

Table Alias Table Real Name Description N RowsP02 QUESITI P02 QUESITI Questions 8.830

ELEMENTO P02 QUESITI ELEMENTI Element 1.379ELEMENTO QUESITI PADRE ELEMENTI Element 1.379

P02 RISPOSTE P02 RISPOSTE Answers 9.875.135P02 QUEST COMP RISPOSTE P02 QUEST COMP Answers 724.956

V02 RISPOSTE V02 RISPOSTE ROW Free text answers 9.970.651ROW TESTO LIBERO TESTO LIBEROP02 TIPI FORMATO P02 TIPI FORMATO Format Types 10Q35 DATI COMP Q35 DATI COMP 412.536Q35 FAC COMP P06 FAC Faculties 81Q35 CDS COMP P06 CDS Course of study 892

Q35 DOCENTE AD VAL DOCENTI Teachers 11.595Q35 DOCENTE TIT AD VAL DOCENTI Teachers 11.595

Q35 CDS AD VAL P06 CDS Course of study 892Q35 FAC AD VAL P06 FAC Faculties 81Q35 P09 AD GEN P09 AD GEN Educational 18.607

activitiesQ35 SCUOLA P01 SCUOLA High School 13.050

Q35 TIPI TITOLO SUP TIPI TITOLO SUP High school 240degree type

Q35 P09 UD CDS P09 UD CDS Didactic units 219.255Q35 TIPI CORSO AD VAL TIPI CORSO Course type 52

Q35 NORMATIVA CDS AD VAL P07 NORMATIVA Regulations 10Q35 INVIO SEGNALAZIONE Q35 INVIO SEGNALAZIONE Send Report 6.508

Q35 NUM QUEST CDS DOC UD Q35 NUM QUEST CDS DOC UD Number questionnaires 27.008Q35 CARICHE FAC AD VAL V06 CARICHE SDR VALIDE List of assignments 416Q35 PRESIDE FAC AD VAL DOCENTI Faculty members 11.595Q35 CARICHE CDS AD VAL V06 CARICHE SDR VALIDE List of assignments 416Q35 PRESIDE CDS AD VAL DOCENTI President of the 11.595

study programQ35 DOC AD P06 DIP Department 61

VAL DIP AFFERENZAQ35 UD TIPO COPERTURA P09 UD PDSORD DOC Didactic unit 31.596

QUESITI PADRE P02 QUESITI Questions 8.830Q35 FAC CDS AD VAL P06 FAC CDS Relationship between 1.833

Faculty/Study Courses

Datasets Population

In the Oracle database, table and data was imported using the expdp/impdp

owner command.

In all other RDBMS Database (MySQL, PostgreSQL) tables and indexes

are recreated with DDL scripts.

Data between the Oracle database and other remaining RDBMS database

have been migrated using linux program called sqldata available for download

at link: http://www.sqlines.com/sqldata (”SQLines Data - Database Migra-

tion and ETL”).

DDL scripts are available at the following links: https://github.com/

ccorona70/tesimagistrale/tree/master/QUERY/UNICA/DDLScripts

Page 26: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 6. TEST ENVIRONMENT 25

Database Query SQL Scripts

All SQL database query are founded at link https://github.com/ccorona70/

tesimagistrale/tree/master/QUERY/UNICA/DMLScripts

Python SpakSQL Script Example

In Spark a Python script has been created for each query.

Each script use a SparkSQL sintaxt which use a Dataframe concept to

manipulate table data.

The queries, in ANSI format, are executed in all relational databases.

To speed up query has been used a SparkSQL options ”partitionColumn”

to parallelize the query on some large tables.So one table is imported into Spark Dataframe with command like:

tab1=sqlContext . read . format (” jdbc ” ) . opt i ons ( u r l=”jdbc : mysql : // spark−db :3306/ESSE3? user=es s e3&password=es s e3 ” , dbtable=”ontime ” , f e t c hS i z e =10000 ,part it ionColumn=”Year ” , lowerBound=2007 , upperBound=2017 , numPartit ions =12).load ( )tab1 . registerTempTable (” ontime ”)

Python scripts are available at the following links:

https://github.com/ccorona70/tesimagistrale/tree/master/QUERY/

UNICA/SparkScripts

Page 27: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 6. TEST ENVIRONMENT 26

6.4 Test Results

OnTime queries

Figure 6.2: Benchmark OnTime Queries

Page 28: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 6. TEST ENVIRONMENT 27

Figure 6.3: OnTime1

Figure 6.4: OnTime2

Page 29: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 6. TEST ENVIRONMENT 28

Unica queries

Figure 6.5: Benchmark Unica Queries

Page 30: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

CHAPTER 6. TEST ENVIRONMENT 29

Figure 6.6: Unica1

Figure 6.7: Unica2

Page 31: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Chapter 7

Conclusions

Retrieving data from RDBMS and loading it into Spark is not free.

Spark doesn´t work well for faster queries, those that use indexes or can

efficiently use an index.

Spark is recommended when the tables used in queries have millions or

billions of records.

Spark’s performance is even more pronounced, with RDBMS database like

MySQL or Oracle, when large tables are indexed and partitioned.

It can increase the OnTime query’s performance up to four times and Unica

query’s more than one hundred times.

Using Apache Spark as an additional engine level on top of RDBMS

databases can help to speed up the slow reporting queries and add more scal-

ability for the long running queries.

In addition, Spark, combined with query caching feature, can speed up

frequent queries’s execution.

Page 32: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Acknowledgments

I thank all my family for the support and patience they have had in these eight

years of work and study.

Page 33: SPARKSQL vs RDBMS Database Query Benchmark · RDBMS (Relational Database Management System). Although it does not refer to a particular quantity and it is commonly used in relation

Bibliography

[1] Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning

Spark, Lightning-Fast Big Data Analysis. O’Reilly Media, 2015.

[2] Alexander Rubin. How Apache Spark makes your slow MySQL queries

10x faster (or more). https://www.percona.com/blog/2016/08/17/apache-

spark-makes-slow-mysql-queries-10x-faster/