Implementing Hadoop distributed file system (hdfs) Cluster ...

91
FACULDADE DE E NGENHARIA DA UNIVERSIDADE DO P ORTO Implementing Hadoop distributed file system (hdfs) Cluster for BI Solution Jorge Afonso Barandas Queirós Mestrado Integrado em Engenharia Eletrotécnica e de Computadores Company Supervisor: Engenheiro Francisco Capa Supervisor: Professor João Moreira February 24, 2021

Transcript of Implementing Hadoop distributed file system (hdfs) Cluster ...

Page 1: Implementing Hadoop distributed file system (hdfs) Cluster ...

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO

Implementing Hadoop distributed filesystem (hdfs) Cluster for BI Solution

Jorge Afonso Barandas Queirós

Mestrado Integrado em Engenharia Eletrotécnica e de Computadores

Company Supervisor: Engenheiro Francisco Capa

Supervisor: Professor João Moreira

February 24, 2021

Page 2: Implementing Hadoop distributed file system (hdfs) Cluster ...
Page 3: Implementing Hadoop distributed file system (hdfs) Cluster ...

Abstract

Currently, there is a large influx of information services online, where the amount of informationthat goes through each user of the system is gigantic. This data, like any other information, obeys acertain general behavior: storage, processing, and loading (ETL concept). To this end, informationstorage systems have been created, and have evolved technologically since then, with severalimplementation options and for different purposes. World-renowned services such as Facebookor Instagram are based on this type of information as a basis for storing information. However,each system has its advantages and disadvantages. The most important indicators for evaluatinga storage system are cost-benefit and performance (speed of analysis and storage capacity) giventhe amount and flow of data presented. This works aims to implement a possible low-cost solutionto store safely a great amount of data, based on a Hadoop Cluster, with other frameworks thattogether can create an efficient and viable Big Data solution.

Besides that, this work presents a study for other possible distributed solutions, where thecomparison between different frameworks will be evaluated, as well as the distinction betweensolutions based on local versus cloud-based environments.

The responsible company in question is known for creating Business Intelligence solutions,that is, creating solutions and indicators that derive from conventional information analysis, topresent important results on a specific case study. The analysis, formatting, and simplicity ofthe data is a factor present in this concept, therefore its development refers to the request for alarge-scale storage system, hence the great need to carry out this study in a business environment.Besides, to test the viability of the implemented solution, it was created a Web Page extractionmechanism, more specifically related to the stock market, storing these values in tables, accord-ing to the universal format of column-lines, to later analyze stored data and present it on a datavisualization tool. The reason why the stock market analysis was carried out is due to the highimportance of using a large amount of data if the ultimate goal of the user is to study deeplysome type of behavior related to this area. Another factor is the mutual interest on the side of thecompany in creating a BI solution based on stock market values, for future implementations andstudies. If possible, also create some predictive models, or that give some future forecast of thebehavior of the extracted values, to improve the quality of the final decision.

i

Page 4: Implementing Hadoop distributed file system (hdfs) Cluster ...

ii

Page 5: Implementing Hadoop distributed file system (hdfs) Cluster ...

Resumo

Atualmente, há um grande fluxo de serviços de informação online, onde a quantidade de infor-mação que passa por cada utilizador do sistema é gigantesca. Esses dados, como qualquer outrainformação, obedecem a um determinado comportamento geral: armazenamento, processamento ecarregamento (conceito ETL). Para tal, foram criados sistemas de armazenamento de informação,que evoluíram tecnologicamente desde então, com várias opções de implementação e para difer-entes finalidades. Serviços de renome mundial, como Facebook ou Instagram, são baseados nestetipo de informação como base para o armazenamento de informação. No entanto, cada sistematem suas vantagens e desvantagens. Os indicadores mais importantes para avaliar um sistema dearmazenamento são: custo-benefício e desempenho (velocidade de análise e capacidade de ar-mazenamento), considerando a quantidade e o fluxo de dados apresentados. Este trabalho visaimplementar uma possível solução de baixo custo para armazenar com segurança uma grandequantidade de dados, baseada num Cluster Hadoop, com outras frameworks que juntas possamcriar uma solução de Big Data eficiente e viável.

Além disso, este trabalho apresenta um estudo para outras possíveis soluções distribuídas,onde será avaliada a comparação entre diferentes frameworks, bem como a distinção entre soluçõesbaseadas em ambientes locais e em nuvem.

A empresa responsável em questão é conhecida por criar soluções de Business Intelligence, ouseja, criar soluções e indicadores derivados da análise de informação convencional, para apresentarresultados importantes num estudo de caso específico. A análise, formatação e simplicidade dosdados é um fator presente neste conceito, pois o seu desenvolvimento refere-se à solicitação de umsistema de armazenamento em larga escala, daí a grande necessidade de realização deste estudoem ambiente empresarial. Além disso, para testar a viabilidade da solução implementada, foicriado um mecanismo de extração de páginas web, mais especificamente relacionadas ao mercadode ações, armazenando esses valores em tabelas, de acordo com o formato universal de linhas-colunas, para posteriormente analisar os dados armazenados e apresentá-los numa ferramenta devisualização de dados. O motivo pelo qual a análise do mercado de ações foi realizada deve-seà grande importância do uso de uma grande quantidade de dados se o objetivo final do utilizadorfor estudar profundamente algum tipo de comportamento relacionado com esta área. Outro fatoré o interesse mútuo por parte da empresa em criar uma solução de Business Intelligence baseadaem valores de bolsa, para futuras implementações e estudos. Se possível, criar também algunsmodelos preditivos, ou que dêem alguma previsão futura do comportamento dos valores extraídos,para melhorar a qualidade da decisão final.

iii

Page 6: Implementing Hadoop distributed file system (hdfs) Cluster ...

iv

Page 7: Implementing Hadoop distributed file system (hdfs) Cluster ...

Acknowledgements

Firstly, I would like to thank B2F, the responsible enterprise for the project. They welcome meand give me all the material need to perform this project. I would like to thank my supervisorsfrom B2F and from the faculty, namely Engenheiro Francisco Capa, Engenheiro Jorge Amaral,and Professor João Moreira, for helping me through this period, giving me the best patience andknowledge they can. Not less important, I also give many thanks to B2F’s Pedro Roseira, forhelping me too, whether in after-hours or not, even not being directly related to the dissertationproject. His knowledge about Stock Market and HDFS technologies improve the level of mywork. I would like to express my gratitude to all the professors that I had this entire course, fortheir availability and patient to make me a better person, for sure.

Finally and most importantly, I need to emphasize my gratitude to my parents. Without youtwo, I will never be able to fulfill all my childhood dreams, and one of them is to complete thiswork successfully.

Jorge Afonso Barandas Queirós

v

Page 8: Implementing Hadoop distributed file system (hdfs) Cluster ...

vi

Page 9: Implementing Hadoop distributed file system (hdfs) Cluster ...

Contents

xiii

1 Introduction 11.1 Business to Future (B2F) Presentation . . . . . . . . . . . . . . . . . . . . . . . 11.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Dissertation’s Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Hadoop Cluster Base Architecture . . . . . . . . . . . . . . . . . . . . . 52.1.2 BI Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.4 Stock Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Stock Market Behavior’s Prediction Architectures: State-of-Art . . . . . . . . . . 102.2.1 Traditional time series prediction . . . . . . . . . . . . . . . . . . . . . 102.2.2 Artificial Intelligence: Neural Networks . . . . . . . . . . . . . . . . . . 112.2.3 High-speed Learning Algorithm - Supplementary Learning . . . . . . . . 112.2.4 Traditional time series prediction vs Neural Networks . . . . . . . . . . 13

2.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.1 Hadoop MapReduce vs Apache Spark . . . . . . . . . . . . . . . . . . . 20

3 Implemented Solution 213.1 Comparision between Hadoop and other Frameworks . . . . . . . . . . . . . . . 21

3.1.1 Hadoop : Modern Data Warehouse versus Traditional Data Warehouses . 223.1.2 Hadoop versus Azure Databricks . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Hadoop Cluster Solution: Presentation of all used frameworks . . . . . . . . . . 243.2.1 Name Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.2 Data Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.3 YARN : a task/job scheduler and manager . . . . . . . . . . . . . . . . . 253.2.4 Spark Framework: Data loading framework . . . . . . . . . . . . . . . . 263.2.5 Apache Hive: mySQL database and JDBC Driver . . . . . . . . . . . . 263.2.6 Power BI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Setup of Cluster: Installation of Hadoop / YARN on three machines . . . . . . . 273.4 Stock Market’s Web Scraping: Extracting stock indicators . . . . . . . . . . . . 34

3.4.1 Important stock market’s indicators to extract . . . . . . . . . . . . . . . 343.4.2 Python Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

vii

Page 10: Implementing Hadoop distributed file system (hdfs) Cluster ...

viii CONTENTS

3.5 Spark Framework: Advantages to other solutions and configuration . . . . . . . 393.5.1 Apache Spark vs Hadoop MapReduce for running applications . . . . . 393.5.2 Apache Spark configuration over HDFS . . . . . . . . . . . . . . . . . . 39

3.6 Apache Spark Script to store extracted data in HDFS . . . . . . . . . . . . . . . 423.7 Connection to Power BI with Apache Spark framework: Apache HIVE and Spark

Thriftserver Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.7.1 Power BI: Data Load and Processing in real time . . . . . . . . . . . . . 57

4 Result of Implementation and Tests 614.1 HDFS architecture availability . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2 HDFS extraction mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3 HDFS Performance Results - Data Extracting and Load: Spark Jobs with CSV

files vs Parquet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.4 Power BI Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Conclusions and Future Work 695.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

References 73

Page 11: Implementing Hadoop distributed file system (hdfs) Cluster ...

List of Figures

2.1 Hadoop Architecture. [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 MapReduce Architecture. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Spark Architecture. [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Power BI Desktop Report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Pfizer maximum stock market value after first dose injected in a person (Google

Finance). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Neural Network Architecture for learnig the behaviour of Stock market based on

key indicators.[4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Prediction Simulation.[4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.8 Prediction results with collected data. [5] . . . . . . . . . . . . . . . . . . . . . 152.9 Prediction Model Diagram. [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.10 Time series Algorithm. [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.11 Predictive Results. [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.12 ARIMA model [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.13 ARIMA prediction results using datamining [7] . . . . . . . . . . . . . . . . . . 192.14 ARIMA prediction results using datamining [7] . . . . . . . . . . . . . . . . . . 192.15 MapReduce versus Apache Spark tests [8] . . . . . . . . . . . . . . . . . . . . 20

3.1 General Architecture of a Data Warehouse. [9] . . . . . . . . . . . . . . . . . . 223.2 Internet usage between 1990 and 2016 [10] . . . . . . . . . . . . . . . . . . . . 233.3 Azure Databricks and Oracle RAC pricing [11] [12] . . . . . . . . . . . . . . . 243.4 Implemented HDFS Data Extracting Architecture. . . . . . . . . . . . . . . . . 253.5 Hive Architecture. [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6 Hosts file located at /etc folder. . . . . . . . . . . . . . . . . . . . . . . . . . . 273.7 Configuration of core-site file. . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.8 Configuration of hdfs-site file. . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.9 Memory allocation configurations on yarn-site xml file . . . . . . . . . . . . . . 303.10 yarn-site xml file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.11 Deamons in master machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.12 Deamons in slave machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.13 HDFS Local Web Site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.14 HDFS Local Web Site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.15 YARN Job Manager Local Web Site. . . . . . . . . . . . . . . . . . . . . . . . . 333.16 Anaconda framework’s interface. . . . . . . . . . . . . . . . . . . . . . . . . . . 353.17 GoogleFinane’s code for extraction. . . . . . . . . . . . . . . . . . . . . . . . . 363.18 Inspector tool to find id of data tags. . . . . . . . . . . . . . . . . . . . . . . . . 373.19 MarketWatch’s extraction algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 383.20 Output list of stock data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

ix

Page 12: Implementing Hadoop distributed file system (hdfs) Cluster ...

x LIST OF FIGURES

3.21 Apache Spark versus Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . 403.22 Apache Spark download page . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.23 spark-default.conf file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.24 History Server Web Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.25 Bash Script to Extract and Load Data using Crontab. . . . . . . . . . . . . . . . 443.26 Python Imports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.27 First part: Exraction of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.28 HDFS folder with extracted files . . . . . . . . . . . . . . . . . . . . . . . . . . 453.29 Final stage of loading data to HDFS part 1. . . . . . . . . . . . . . . . . . . . . 463.30 Final stage of loading data to HDFS part 2. . . . . . . . . . . . . . . . . . . . . 463.31 Apple January extracted stock market values’ HDFS file(Portion of the file). . . 473.32 Hive downloaded compressed file ( Version 2.3.7) . . . . . . . . . . . . . . . . . 483.33 Hive’s hive-conf.sh file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.34 MySQL schema metastore creation . . . . . . . . . . . . . . . . . . . . . . . . . 503.35 Permissions to new Hive and MySQL user . . . . . . . . . . . . . . . . . . . . . 513.36 Metastore server username. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.37 Metastore server password. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.38 Connection URL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.39 Driver Name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.40 Created Hive table for Apple’s stock values in December 2020. . . . . . . . . . . 533.41 Created Hive table for Apple’s stock values in December 2020. . . . . . . . . . . 533.42 Hive’s table in MySQL domain. . . . . . . . . . . . . . . . . . . . . . . . . . . 543.43 ThriftServer configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.44 Connectors available at Power BI. . . . . . . . . . . . . . . . . . . . . . . . . . 553.45 ThriftServer Connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.46 Credentials to connect to HDFS. . . . . . . . . . . . . . . . . . . . . . . . . . . 563.47 Hive’s table on Power BI : preview . . . . . . . . . . . . . . . . . . . . . . . . . 573.48 Power BI Fields toolbar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.49 Moving average of Google Finance’s Close Price indicator , on Apple, in December 593.50 Hive’s table, with source font’s column. . . . . . . . . . . . . . . . . . . . . . . 593.51 Power BI Implemented Line Charts and Tables . . . . . . . . . . . . . . . . . . 60

4.1 HDFS availability check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2 YARN scheduling and motorization test. . . . . . . . . . . . . . . . . . . . . . . 624.3 Spark History Server test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.4 Hive server test with beeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.5 Hadoop Folder of Extracted data . . . . . . . . . . . . . . . . . . . . . . . . . . 634.6 Output of data in HDFS files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.7 CSV File sizes: 1,10, 100 million of rows. . . . . . . . . . . . . . . . . . . . . . 644.8 1 Million row Parquet file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.9 10 Million row Parquet file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.10 100 Million row Parquet file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.11 Time performance test in Spark: CSV files. . . . . . . . . . . . . . . . . . . . . 664.12 Time performance test in Spark: Parquet files. . . . . . . . . . . . . . . . . . . . 664.13 Power BI Final Dashboard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1 Mlib library to data training. [14] . . . . . . . . . . . . . . . . . . . . . . . . . 71

Page 13: Implementing Hadoop distributed file system (hdfs) Cluster ...

List of Tables

2.1 Correlation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

xi

Page 14: Implementing Hadoop distributed file system (hdfs) Cluster ...

xii LIST OF TABLES

Page 15: Implementing Hadoop distributed file system (hdfs) Cluster ...

Abbreviations and Symbols

AI Artificial Intelligence

BI Business Intelligence

CSV Comma-separated Values

DAX Data Analysis Expressions

ELT Extract, Lioad, Transform

ETL Extract, Transform, Load

HDFS Hadoop Distributed File System

JDBC Java Database Connectivity

MDA Multiple Discriminant Analysis

ML Machine Learning

NN Neural Network

SSH Secure Shell

URL Uniform Resource Locator

xiii

Page 16: Implementing Hadoop distributed file system (hdfs) Cluster ...
Page 17: Implementing Hadoop distributed file system (hdfs) Cluster ...

Chapter 1

Introduction

This document reflects the work performed in the final curricular unit of Integrated Master in

Electrical and Computer Engineering, telecommunications major in the current year of 2020/2021.

This dissertation was done in a professional environment at B2F - Business to Future. My

work was supervised by professor João Moreira of the Faculty of Engineering of Porto and Co-

supervised by Engineer Francisco Capa and Jorge Amaral of B2F.

1.1 Business to Future (B2F) Presentation

Business To Future (B2F) is the organization responsible for carrying out this project, providing

all the material and help needed for its development. Based in Porto, it focuses on Business Intelli-

gence (BI) solutions, with experience in large-scale projects, with other well-known organizations,

such as Amorim, HBFuller, Sonae, STCP, among others.

1.2 Context

Currently, all companies need to have an information storage system, with a capacity for a high

volume of data, where it is possible to guarantee its integrity, facility, and speed of access to

stored information. In the current market there are several solutions, but in general, they present

factors that compromise their use. These reasons may be due to the high cost of the equipment,

the difficulty in adapting the system to the architecture already implemented previously (lack of

compatibility between frameworks previously used in a business environment), security and data

integrity, among others.

To create a customized solution in the face of the requirements demanded by the entity re-

sponsible for the project, the implementation of a Cluster-based on Hadoop Distributed File Sys-

tem (HDFS) was put to the test. Its implementation, together with tests and studies carried out

on other identical solutions, based on the same concept (Big Data), will be the main points to be

explored in this project.

This concept is increasingly present in current times.

1

Page 18: Implementing Hadoop distributed file system (hdfs) Cluster ...

2 Introduction

The main focus of the responsible organization is to build Business Intelligence solutions for

its customers, where the entire storage and analysis process is done. So, there is the curiosity to

find out if a solution with these specifications would be useful for future implementations, or even

for a possible internal product if it proves to be an asset for other projects/organizations.

1.3 Motivation

In order to store a large volume of data in a secure, affordable, scalable and customized way,

Hadoop Distributed File System (HDFS) was created. In parallel with other widely used frame-

works, such as Power BI, Apache Spark and Apache Hive, it is possible to generate a visual report

with the acquired data, in this case indicators related to the stock exchange that are calculated

according to this information.

To test Big Data performance with a big amount of data, stock market quotes are extracted and

stored in HDFS system to after that could be analyzed and presented on a user-friendly interface.

The main reason for doing this work is to find out if a solution based on Hadoop, together with

some tools already used in a business environment, will be viable to implement as an Big Data

and BI solution, replicating this idea in future works. Therefore, if the results are to create and

satisfactory, there would be the possibility of implementing this technology, replacing traditional

data storage technologies still used, which present several limitations in terms of performance.

Other reason to implement this work is to,after building a Big Data infrastructure, try to imple-

ment some statistical models on suggested extracted data: Stock market quotes, in a data analysis

framework, if possible, applying linear regression models or machine learning algorithms, in order

to create a predictive analysis on stock market quotes. This requirement is not as important as the

first one presented, but if implemented, the value of the solution would increase significantly.

If this step is not carried out successfully, it is suggested that a small analysis could be made in

a final reporting tool, based on some values and subsequent basic analytical calculations, followed

by its presentation, for example, in Power BI framework - widely used in enterprise’s domain.

In order to carry out the implementation, the main advice is based on using the tools best rated

by the majority of the existing community around this concept, and trying as much as possible to

use existing tools in the business domain, in order to reduce additional costs and strengthen the

compatibility and prior knowledge in its use.

Another important factor required by the company would be to test different types of formats

used to store data, more specifically a theoretical and practical analysis of two main formats for

storing information: CSV, recognized worldwide, and Parquet (Specially designed for solutions

that use the Hadoop distribution). In this way, new solutions could be used, bypassing general

problems that these files bring, such as, for example, a high amount of unnecessary information

inside for the solution in question, which may compromise in some aspects.

Page 19: Implementing Hadoop distributed file system (hdfs) Cluster ...

1.4 Objectives 3

1.4 Objectives

The main objective of the project is to create an architecture for the extraction, storage, analysis,

and presentation of a large amount of data, being able to execute and consult millions of records

in a short period, containing the least expensive software and hardware possible. There are several

paths to be taken, where each one can present different methods for the required processes. There-

fore, with the progression of the, it will also be useful to make a comparative analysis between

the various tools and frameworks available to use, to understand what are the advantages and dis-

advantages for the work in question. Finally, the main objective will be to turn this project into a

product, where it will compete with other solutions on the market, standing out from the rest for

its specifications and performance.

1.5 Dissertation’s Document Structure

Firstly, the document presents a brief introduction to explain what is the purpose of the project,

and why the enterprise bets on this.

Next, the document has a state-of-the-art section, where the implemented architecture will

be compared to other solutions in the market, with different variations in the components used.

Also, a brief analysis on some models for stock market prediction, where methods like Machine

Learning and Artificial Intelligence are essential to perform such math calculations.

After that, it will be explained how this solution is indeed created, making it clear why it was

used in some technologies, compared to other tools available for use.

To prove the idealized project, practical results are presented, in parallel with explanations

for each proceeding. Finally, conclusions are made, pointing to some aspects that can be more

explored in the future, with more time and experience with all the technologies used.

Page 20: Implementing Hadoop distributed file system (hdfs) Cluster ...

4 Introduction

Page 21: Implementing Hadoop distributed file system (hdfs) Cluster ...

Chapter 2

Hadoop distributed file system(hdfs)Cluster for BI Solution: State of the Art

2.1 Introduction

Nowadays, some architectures can store and process a lot of information, but few bring to the table

the best to the user. Some aspects like price, efficiency, scalability, and robustness are essential to

make a product reliable in the market. This chapter will introduce some technologies, comparing

different works made around the world according to this concept.

The main idea of this work is to implement a Big Data solution for a BI solution.

A Big Data solution consists of a collection of a big amount of data, that could be even bigger,

stored in multiple machines that work together to analyze and prevent loss of information.

So, this technology demands a big allocation of space and a fast responsive behavior, if the

purpose is to create real-time results.

Business Intelligence is another concept that is directly related to Big Data, where data is pro-

cessed to create better decision options for a certain implementation. Putting these two concepts

together in an innovative architecture allows creating a strong decision-making tool.

2.1.1 Hadoop Cluster Base Architecture

In the present days, there are several architectures for implement a Big Data solution, but all of

them present four main steps:

• Program to extract data and store in the HDFS;

• Framework to run Hadoop jobs in order to process data;

• Connector from stored data to presentation framework;

• Presentation of data in BI tool.

5

Page 22: Implementing Hadoop distributed file system (hdfs) Cluster ...

6 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art

Hadoop (Hadoop, 2020)[15] is the open-source implementation of Google’s Map-Reduce

model. Hadoop is based on The HDFS (Hadoop Distributed File system) (HDFS, 2020)[16].

It is a system that tolerates faults from a certain node, allows high-throughput data streaming and

robustness. Hadoop provides wide-node storage, and parallel processing across the cluster us-

ing the Map-Reduce paradigm: programming paradigm that enables scalability across the entire

cluster.

Figure 2.1: Hadoop Architecture. [1]

In this work, Map Reduce (Map Reduce Tutorial,2020) [17] jobs are replaced by Spark Jobs

(Apache SparkTM is a unified analytic engine for large-scale data processing.) [18], because this

last one works in a different way, giving the possibility to work about 100 times faster, due to

memory-processing,instead of read-write from a disk, i.e the MapReduce’s work ethic.So, Spark

is a framework that runs over Hadoop, and can be faster and more efficient. This comparison will

be more detailed in the section Spark Framework: Advantages to other solutions and configuration.

Figure 2.2: MapReduce Architecture. [2]

Page 23: Implementing Hadoop distributed file system (hdfs) Cluster ...

2.1 Introduction 7

Figure 2.3: Spark Architecture. [3]

Therefore, the brain of the Hadoop resides on the Resource Manager layer, provided by YARN

(Apache Hadoop YARN,2020). YARN is a manager that can split the functionalities of resource

management and job scheduler/monitoring into separate processes. The main idea is to have a

Resource Manager process to manage all the jobs and an Application Master to specify resources

for each job and work along with Slave Nodes’s Node Manager to execute the different processes

in a Hadoop[19].

If the system is only projected to store data and process it, these frameworks are enough. But

to present it on a Business Intelligence platform, Hadoop must have a universal connector driver.

In this case, Apache Hive [20] is used, in parallel with a MySQL connector, to create a metastore

database with generated and processed data and querying it to the BI tool.

“The MetaStore serves as Hive’s table catalog. In conjunction with SerDes, the Meta-

Store allows users to cleanly decouple their logical queries from the underlying phys-

ical details of the data they are processing. The MetaStore accomplishes this by main-

taining lookup tables for several different types of catalog metadata.” [21]

Some of the existent architectures have the same implemented frameworks, but they differ in

the way they are linked, either in the language or in the task execution engine. Hadoop Architec-

ture is the successor of the traditional data warehouse, used by several projects, which is outdated.

In fact, Hadoop is widely used because of its commodity and low-price processing, and compa-

nies such as Facebook or Yahoo use this architecture to store and process their large amount of

data.[22].

2.1.2 BI Tools

After processing data, for the required format, there is the need to present it, in graphs or tables,

to finish the entire process of Big Data. There are some frameworks in the market to execute

this final step. The more used applications are Power BI, by Microsoft, Tableau, Qlik, Grafana,

among others Power BI is the implemented solution to this project because there is already a

paid license running on this application on another enterprise’s implementations. Other tools are

Page 24: Implementing Hadoop distributed file system (hdfs) Cluster ...

8 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art

widely used in this type of project, but imply different costs, connection configurations/number

of connectors, and performance overall. Power BI is a framework that was created to perform

Business Intelligence analytic.

This application allows connecting to multiple data sources, creating visual reports of data.

Power BI supports any platform, like Windows, Linux, MAC, even Android, and IOS. The Re-

ports can be accessed by every platform where the user is logged. Besides that, it is possible to

implement calculations over the data, using DAX (Data Analytics Expressions). The figure below

shows an example of a report, generated in this project, where is possible to see data in form of

charts, according to the timestamp of each record.

Figure 2.4: Power BI Desktop Report.

2.1.3 Data format

The data to be analyzed next has to be stored in Hadoop Cluster, even this data is temporary or will

be discarded in the future. For that, most of the projects that use this concept adopt a couple of

formats. The most common and known is CSV format, due to its simplicity and high-compatibility

with every language or system. But, to create a more efficient way to store it, Apache Foundation

created the Apache Parquet format, which is basically "" a columnar storage format available to

any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data

model, or programming language.""[23] I.e, Parquet is a columnar-format compressed file, and

store the data in a more efficient form, with a compressed dictionary based on record shredding

and assembly algorithm. For example, a CSV file with 1TB will be reduced to approximately

130G, according to DataBricks Platform [24]. The next chapter contains tests about these two

formats, showing the performance battle between them.

Page 25: Implementing Hadoop distributed file system (hdfs) Cluster ...

2.1 Introduction 9

2.1.4 Stock Market

In the present day, several persons are trying to work and make money with stock market vari-

ances. It is a difficult area to analyze, since stock volatility is high, and the concept is complex,

bringing thousands of indicators and key values that define a market. For example, due to the

recent pandemic, Pfizer Company’s stock market increase its value from a maximum of 38 dollars

at the beginning of 2020 to 42 dollars in December of 2020, reaching its limit.

To study the stock market’s behavior, a great amount of data has to be processed, as well as

well-known financial indicators. Since the enterprise focuses its solutions on products that are

related to financial or business problems, and the interest in exploring new horizons, such like the

stock market is high. This theme is the one to implement as a core of the solution.

Besides that, it is the most indicated to test in a brand-new Hadoop Architecture, just like

mentioned in the paragraph before: it requires millions of millions of records about stock values

to give a more precise prediction. Some indicators are common to all the fonts presenting that

market, but there are some differences, small ones but can change an entire vision about a stock’s

prediction. One percent of a dollar in a collection of 200 actions pursued can lead to significant

losses, or even to a bad correlation calculation between all of the scraped fonts.

All online stock market fonts provide indicators like Price, Open, Close, High, Close, Market

Cap, PE Ratio, and Dividend. These values give an idea of how a stock has changed historically,

but these indicators are not sufficient to classify a prediction. For that, there are some more impor-

tant indicators, like interest rate, vector curves, turnovers, or foreign exchange rate. They can build

a strong correlation using, for example, algorithms based on Machine Learning ( Artificial Intel-

ligence) and Deep Learning. Also, it is possible to create a prediction model based on financial

news or earning reports.

Page 26: Implementing Hadoop distributed file system (hdfs) Cluster ...

10 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art

Figure 2.5: Pfizer maximum stock market value after first dose injected in a person (Google Fi-nance).

2.2 Stock Market Behavior’s Prediction Architectures: State-of-Art

2.2.1 Traditional time series prediction

Traditional statistical models are widely used, even nowadays, due to their capability to modeling

linear relationships with key values that influence a certain stock market value. There are two

types of time series: simple regression and multivariate regression. Paul D. Yoo( et al.) studied

these methods, presenting the advantages and disadvantages of using another mathematical model.

[25]

Box-Jenkins model is a widely used example of a univariate model, also known as simple

regression. This model contains an equation that presents only one incognito. Although, this

model is not appropriated to use because it requires a lot of data to make the result precise. That

is, if the data amount is big and contains data of a low-interval period, the efficiency is high but

not a great value, compared to other architectures, presented below (60 percent).

Multivariate models are univariate models with more complexity attached,i.e contain more

than one variable. One example is regression analysis, which is compared with neural networks.

The estimates relationships between variables using a criterion, which is normally an equation.

Now, Neural Networks fully replaced these algorithms, due to their performance.

Page 27: Implementing Hadoop distributed file system (hdfs) Cluster ...

2.2 Stock Market Behavior’s Prediction Architectures: State-of-Art 11

2.2.2 Artificial Intelligence: Neural Networks

There are two main principles to predict the stock market’s values, with a considerable trust rate.

One is based on historical values, correlated with key indicators like interest rate, price records for

a long time interval, etc. And other consists of collecting text information related to enterprise’s

behaviors or even about some current event important to the majority of people/organizations.

A neural network is a series of algorithms based on a human neural system to create relations

with a big amount of data. It is useful to prediction systems because they have the power to do

several calculations and give outputs with high correlation.

The figure below presents the general architecture of a neural network system. It consists of

three layers: the input, hidden, and output layers. Each value in the network receives inputs from

another value in a low-level hierarchy and performs a weighted addiction to calculate the output.

The output’s general function to this algorithm is the Sigmoid Function, with the expression:

S =1

1+ e−x (2.1)

, where: e = Euler’s number (approx. 2.71828).

Figure 2.6: Neural Network Architecture for learnig the behaviour of Stock market based on keyindicators.[4]

2.2.3 High-speed Learning Algorithm - Supplementary Learning

The first algorithm to be presented is the High-speed Learning Algorithm - Supplementary Learn-

ing, mentioned in [4]. It takes the error backpropagation proposed by Rumelhart[26] and improves

it by automatically using scheduling pattern presentation and changing learning constants. In this

algorithm, weights are updated according to the sum of the error after processed all data. During

the learning, errors are backpropagated only to the data that the output exceeds the maximum toler-

ance. In other words, only the major unit errors will be again learned. Besides that, supplementary

Page 28: Implementing Hadoop distributed file system (hdfs) Cluster ...

12 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art

learning allows automatic change of learning constants, i.e, the parameters for each type/volume

of data will be flexible. The weight factor, as result, is expressed by:

Aw(t) = − Elearningpatterns

∗aE/dW +oAw(t −1), (2.2)

where: E = learning rate; a = momentum; learningpatterns = number of learning data items that

require error back propagation. The learning rate is automatically reduced when learning data

increases. This allows use of the constants E regardless of the amount of data.

2.2.3.1 Teaching Data

In the article mentioned [4], they create space patterns based on time-space patterns of key indica-

tors, convert them to analog values in range [0,1] (Due to Sigma function).

After that, the timing to sell or buy stocks is indicated in the Sigma function section in one unit

of output. This timing is used as teaching data and is the weighted sum of weekly records. They

study the TOPIX , which is a stock exchange index, like NASDAQ,defining the weekly return by

the stock as:

rt = lnTOPIX(t)

TOPIX(t −1), (2.3)

where TOPIX(t) is TOPIX average at week j;

rn = ∑i

φi ∗ rt+1, (2.4)

where φ is weight.

Input indexes converted to space patterns and teaching data usually comes with variations. So,

to eliminate this error, the data is pre-processed by log or error functions in order to make them as

regular as possible.

After that, a normalization function is implemented in the range [0,1] section to correct the

data distribution.

2.2.3.2 Prediction Simulation

Four different learning data types are used to simulate the neural network. They used a 33 month-

interval , and calculated the values presented in the table below.

2.2.3.3 Simulation for Buying and Selling Simulation

To demonstrate the practical behavior of the implemented algorithm, they create a simulation with

the teaching data. The strategy to buy and sell stocks is called one-point, where all money is used

Page 29: Implementing Hadoop distributed file system (hdfs) Cluster ...

2.2 Stock Market Behavior’s Prediction Architectures: State-of-Art 13

Table 2.1: Correlation Function

Correlation CoefficientNetwork 1 0.435Network 2 0.458Network 3 0.414Network 4 0.457

System 0.527

to buy stocks and all stocks held are sold at the same time. An output above 0.5 in the analog

system determines the buy of a stock and below this value means the sell of the stock.

The figure shows a plot illustrating the prediction of the TOPIX Stock.

Figure 2.7: Prediction Simulation.[4]

This simulation shows that is possible to predict the expected behavior of a stock. The buy-

and-hold line is the real price of TOPIX, and the line above represents the stock prediction of the

modular neural network architecture. The two lines show identical evolution behaviors, differing

only on the real value of the stock. But the more important factor is that in almost all the studied

periods, the prediction system can confirm the increase or decrease of TOPIX stock price, giving

a great trust index about buying and selling opportunities.

2.2.4 Traditional time series prediction vs Neural Networks

Paul D. Yoo (et al.)[25], mentioned in the subsection Traditional time series prediction compared

traditional prediction systems with recent architectures, namely Neural Network algorithms. [27]

The first difference is related to speed and fault-tolerant problems. Neural Networks execute

their tasks in parallel channels, giving reliability and high-response speed to the system. Compared

to a traditional statistical models, like Box-Jenkins, they perform better and they are more efficient.

Lawrence[27] used the JSE-system, a system that uses Neural Networks with Generic Algo-

rithms. This is a system widely used due to its high efficiency. It presents 63 indicators with the

finality of get an overall view of the market. This system normalizes all data into the range [-1,1],

Page 30: Implementing Hadoop distributed file system (hdfs) Cluster ...

14 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art

which is a normal procedure in this type of systems. Simpler systems only use indicators like his-

torical price and chart information. This method shows the capability of predicting stock market

correctly 92% of the time, and Box-Jenkins only 60%.

Paul D. Yoo (et al.)[28] studied the performance of Multiple Discrminant Analysis ( MDA):

a multivariate regression model, versus Neural Networks. Their NNs based on the prediction of

stock presents a 91%of accuracy according to only 74 % of MDA methods. In fact, the conclusion

is that Neural Networks outperform any of traditional series algorithm. They learn with their own

data rather than the relationship induced. Another advantage is that NNs have non-linear and non-

parametric learning proprieties that improving forecasting and prediction efficient. So, NN are

compatibile with Stock market data: Stock market data is hard to model and it is non-linear.

2.3 Related works

Currently, there are several projects and studies regarding Big Data and Business Intelligence

solutions. Some of them present also some implementations based on complex data analysis,

using sophisticated mathematical models.

Zhihao [5] developed an architecture for Big Data analysis that uses Hadoop, Spark and Ma-

chine Learning, in order to calculate stock market variations, based on stock indicator : return. He

mentioned the power of HDFS for store a great amount of data, capable of failure-detection and

job non-stop, that is, take care of system failures in order to not compromising all the system. He

use the system to try to predict US oil stocks, retrieving data from Yahoo Finance website.

He uses Map Reduce framework to write jobs and applications and Spark over Map Reduce

Framework to processing data, since Hadoop by itself can’t take care of it in real time.

The data comes directly from the website, in a CSV format,collecting the return of 13 oil

stocks since 2006 until 2019. He use Flume to inject data into HDFS. For processing data, he uses

Spark’s API PySpark, which works with Python language, creating data sets with the data. After

that, he uses the "Mlib" function available in Spark , creating a Linear Regression model to give

a prediction of the next prices of US oil stocks. The evaluation metrics for evaluing data are R

squared value( squared correlation).

The Results of the experiment are based on the results of the regression model.

The figure represents the prediction of the data, calculates with the "Mlib" library.

Page 31: Implementing Hadoop distributed file system (hdfs) Cluster ...

2.3 Related works 15

Figure 2.8: Prediction results with collected data. [5]

It is possible to see that some values are negative, meaning that the Oil stocks are not related

between themselves. The model was built using a regularization parameter equals to 0.3. The

Mean Average Error is equal to 1.95%, creating an idea that this model is not suitable for data

with high dimensionality.

He concludes that it is hard to predict stock data, but this technology brings to the table the

possibility of better performance if used with other tools, such as neural networks.

The proposed project at Mrs. Lathika (et al.) [6] consists of using the Map-Reduce technique

along HDFS architecture to develop a prediction model. They created a system that uses a histor-

ical stock dataset acquired from finance.yahoo.com to predict a company’s next-quarter prices.

After that, they create a prediction model. The workflow of this prediction model is shown in

the next figure.

Page 32: Implementing Hadoop distributed file system (hdfs) Cluster ...

16 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art

Figure 2.9: Prediction Model Diagram. [6]

Also, they use a time series algorithm for calculating all past movements of a certain com-

pany’s behaviour, based on several variables of stock market analytic.

Figure 2.10: Time series Algorithm. [6]

In the end, using these two main concepts, they can present some results in a CSV file, calling

it a test dataset. Using Map Reduce, they can simplify the processed data, for presenting it in

the final in a prediction model result table. The Map-Reduce reduces information by about 98.5

Page 33: Implementing Hadoop distributed file system (hdfs) Cluster ...

2.3 Related works 17

percent and the final results are presented in figure [6], relative to predicted average stock hold in

three different companies.

Figure 2.11: Predictive Results. [6]

The results are good, but the method to present the information is not pretty, because they did

not create a BI interface. Besides that, the maximum trust interval calculated for these three com-

panies is only 92 percent. This value is not high enough, because this market is very inconstant,

presents high variations: for example, eight percent in a stock market availed in 100000 euros is

8000 euros. This difference can take an investor to easily invest or not in a certain company’s

stock. An acceptable value for this prediction must be around 98 percent. So, the Achilles tendon

in this project is a low trust percentage.

Arkili [29] create an architecture using Mahout and Pydoop technologies, along with high-

performance computing tools, to try to predict stock movements, over various periods. Mahout is

a distributed algebra framework that provides mathematical models that can be used to predict the

stock market. Pydoop is a Python interface that interacts with HDFS directly, providing a write of

applications on Hadoop. They use a linear regression model for the prediction of the stock market

based on the Python scikit library. The results are based on a ten-year stock analysis on Home

depot enterprise, giving an accuracy of the results of 0.85 or 85%, based on comparisons about the

actual values and the calculated values.

M.D. Jaweed (et al.) [30] created an architecture for analyzing large datasets of stock market

data using HDFS architecture, along with QlikView, a Business Intelligence tool to present the

data, creating plots and graphics to illustrate the stock market tendency. In this project, there is no

post-processing of data, to put the data on a predictive model or similar analytics, but Qlikview’s

results can help the user to find some patterns in a user-friendly interface.

Mahantesh C. Angadi (et al.) [7] created a Data Mining stock market analysis using time series

model: auto-regressive integrated moving average(ARIMA). They wrangle the data and store it in

an R-language data frame, collecting it from the Google Finance website.

The ARIMA model is shown in the figure below, based on pre-processed historical data.

Page 34: Implementing Hadoop distributed file system (hdfs) Cluster ...

18 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art

Figure 2.12: ARIMA model [7]

The ARIMA model uses the autocorrelation function and partial autocorrelation function to

identify p,d, and q, respectively order of autoregressive part, degree of first differencing, and order

of moving average part.

The results are very conclusive, as they present some plots that illustrate the predictive values,

comparing to the actual and past ones. The figure below shows the two plots that mean real versus

predicted stock values.

Page 35: Implementing Hadoop distributed file system (hdfs) Cluster ...

2.3 Related works 19

Figure 2.13: ARIMA prediction results using datamining [7]

Figure 2.14: ARIMA prediction results using datamining [7]

The second plot presents predicted data, and both represent INFY stock values between 2014

and 2015. Both are similar so that this algorithm is effective on a short-term basis. This work has

a lack of file-storage, giving only good responses to short-term intervals because there is not built

a distributed file storage to store a great amount of data. The stock market is a non-linear concept,

and it requires a lot of data to make a prevision of its behavior. This architecture can create bad

results in a larger amount of stock predictions.

Page 36: Implementing Hadoop distributed file system (hdfs) Cluster ...

20 Hadoop distributed file system(hdfs) Cluster for BI Solution: State of the Art

2.3.1 Hadoop MapReduce vs Apache Spark

Satish Gopalani (et. al) [8] compared the performance of Map-Reduce and Spark frameworks to

process data in a Hadoop Cluster.

Both frameworks use the Map-Reduce paradigm to distribute files in the system, but the two

presents different architectures.

They use a data set that allows performing clustering using the K-means algorithm ("k-means

is one of the simplest unsupervised learning algorithms that solve the well-known clustering prob-

lem). The procedure follows a simple and easy way to classify a given data set through a certain

number of clusters (assume k clusters) fixed apriori. The main idea is to define k centers, one for

each cluster. " [31]

Speed performance tests were made, and the results are in the figure below, where is possible

to see different tests with different file sizes, using the same clustering algorithm.

Figure 2.15: MapReduce versus Apache Spark tests [8]

These results state the significant difference between these two, giving Spark better chances

for performing data streaming, machine learning concerning Big Data.

They stated that the better way to create a Big Data architecture is to mix these two frame-

works, where Hadoop will be used to cluster information while Spark will be used to process data.

This combination brings to the table both framework’s advantages.

Page 37: Implementing Hadoop distributed file system (hdfs) Cluster ...

Chapter 3

Implemented Solution

This chapter will be presented the architecture created, explaining all the technologies and method-

ologies used to perform the required parameters.

The hardware used is given by the company, B2F, well as the framework that presents the

processed data, Power BI.

The first section compares different file systems technologies, enhancing Hadoop advantages

related to multiple performance and security factors.

The second section represents the installation of the Hadoop Framework, to make it distributed

between all the nodes, that is, the configurations made to create the shared environment.

The third section explains the stock market’s Web scraping script, explaining all the code

implemented, and their relation with the rest of the architecture.

The fourth section contains an explanation about the framework responsible to create a schema

for the data and load it in the final stage.

The final section talks about how data is loaded and presented, and the alterations made, to

plot and calculate a predictive behavior of the stock’s scraped values.

All sections present a brief comparison to other solutions existent in the market, and why the

chosen ones are better for this solution.

The next couple of sections present the implemented solution in the developed Hadoop archi-

tecture, naming all used frameworks, technologies installed and configurations made to store the

data. After that, it will be presented all algorithms created to extract, load, and transform the final

data (ELT process).

3.1 Comparision between Hadoop and other Frameworks

There are some distributed file systems available on the market. Some of them are very expen-

sive and presents other disadvantages comparing to Hadoop Framework. This architecture allows

scaling the number of nodes available easily, being flexible too, and resilient to failure, because of

ambiguous configurations and files on every node of the system. This section presents advantages

and disadvantages to another similar system available in the market.

21

Page 38: Implementing Hadoop distributed file system (hdfs) Cluster ...

22 Implemented Solution

3.1.1 Hadoop : Modern Data Warehouse versus Traditional Data Warehouses

Traditional Data Warehouses are systems that are responsible to analyze data just like Hadoop

and another similar system. Hadoop is a Data Warehouse, but this technology presents some

upgrades related to ancient models, to make it more reliable. A Data Warehouse is a system that

stores, process, and retrieve data to support a decision. [9] The figure below represents the general

architecture of a Data Warehouse.

Figure 3.1: General Architecture of a Data Warehouse. [9]

This architecture is the base for every file-storage system, but traditional Warehouses have

been surpassed by the newest systems. Nowadays, technology is in a state that was unthinkable

twenty years ago, where the Internet did not have such an important role in our lives. Hadoop, in

contrast to older Data Warehouse systems, can run ETL processes in parallel, that is, on multiple

processes. Also, presents a failure system that prevents that data or processes are lost forever in

the system, creating backup files linked to the original one, on every machine, and the master node

always keeps track of information on the system domain.

The figure below represents the growth of Internet usage, as a curiosity indicator, from 1990,

where the first Data Warehouse had been implemented until now, in recent years, where every

person can access to the Internet easily.

Page 39: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.1 Comparision between Hadoop and other Frameworks 23

Figure 3.2: Internet usage between 1990 and 2016 [10]

With the big grown of the internet across the world, Traditional Data Warehouses begin to

present some difficulties in keep tracking of the requirements, because technologies implemented

in these systems are not capable to process a great amount of data in a small time interval. This is

a key factor to begin to create new store mechanisms.

So, the principal difference between these two systems is related to scalability and fast re-

sponse: A traditional Data Warehouse was created to run on a single machine, instead of deploy

files and, respectively, jobs and tasks related to them, on multiple machines. So, the concept of

Big Data can not be related to this older systems. For performing analysis of a big amount of data,

Hadoop is the better mechanism, and this is why this technology has been chosen for implementing

this project.

3.1.2 Hadoop versus Azure Databricks

Recently, Microsoft created a cloud-based file-system: Azure Databricks. It is Spark-based, allow-

ing languages like Python, R, and SQL to be used. This is a system that integrates easily with other

frameworks, such as Power BI, Azure SQL database. The performance of this system can be more

effective than Hadoop, if the configuration is made to support another Microsoft Frameworks, to

process and load data.

The objective of this work is to create a solution that is low-cost and would be able to perform

as well as high-cost solutions like Databricks, which requires a high a monthly-cost. Hadoop

is a technology that brings to the table good indicators, that concern Big Data analysis on regular

machines, not requiring the best Software and Hardware on the market. So, at a long-term analysis,

opting for Hadoop would bring the same performance results, but the cost will be much smaller.

There are other similar technologies in the market, like Oracle, but, once again, these tech-

nologies do not obey one of the principal requirements of this work, namely being a low-cost

solution.

Page 40: Implementing Hadoop distributed file system (hdfs) Cluster ...

24 Implemented Solution

Figure 3.3: Azure Databricks and Oracle RAC pricing [11] [12]

The figure presents the cost of owning an Oracle and an Azure Cluster, and it is possible to

see that these two solutions are very expensive compared to a Hadoop System, where the only real

investment is on the physical machines. An Azure Databricks cluster with 32GB of RAM will

cost about 708 dollars a month. A good physical machine of 32 GB can be bought for 500 dollars

or less. So, this is a high investment, and opting for physical instead of virtual machines can save

a lot of money to an enterprise.

3.2 Hadoop Cluster Solution: Presentation of all used frameworks

This section represents all work done around the machines available, to create a Hadoop Dis-

tributed File System (HDFS). The first thing discussed is the number of nodes needed to perform

this task. After a lot of research and a couple of meetings with the enterprise, the conclusion

reached brings the perfect number of nodes to the requirements asked: Three. That is, the imple-

mented solution has three working machines. One of them is the master node, also known as the

"Name Node".

The two remaining machines are known as "Data Nodes". HDFS works with master/slave

architecture. The figure below presents the implemented architecture, showing all components

that represent the system.

3.2.1 Name Node

Name Node is the engine responsible to manage the entire system and regulates access to the

outside. This software is allocated in one machine, connected to all machines in the system. and

Page 41: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.2 Hadoop Cluster Solution: Presentation of all used frameworks 25

Figure 3.4: Implemented HDFS Data Extracting Architecture.

ensures that data never flows by this machine, keeping it to the Data Nodes. Name Node works

with metadata information about all the files stored. In other words, this node maps all the files in

the system, managing all the executed tasks on their files.

In parallel to that, it is possible to run a secondary Name Node, to prevent a single point of

failure. So, the system presents a secondary system to work if the primary manager is failing or

down for some reason.

3.2.2 Data Nodes

Data Nodes are the machines that run an engine that store and manage the HDFS’ files. These

files are spat in blocks, and these blocks have the same specifications. The size of them can be

configured, having a default size of 128 MB for the last versions and 64 MB for older ones. Data

Nodes also can create, delete, change blocks, demanded by the Name Node. Usually, this software

is allocated to each slave node. It is not impossible to run in more than one, but this solution is not

recommended in most cases. The better option is always to have one Data Node software for each

slave node.

3.2.3 YARN : a task/job scheduler and manager

YARN is the framework responsible to manage and schedule all the jobs put on the system. YARN

works on top of Hadoop and creates two distinct engines: Resource Manager and Node Manager.

The first one is allocated on the Name Node, and the second one is created on all slave nodes.

The Resource Manager has two components: Scheduler and Applications Manager.

The Scheduler allocates resources for all applications running on top of Hadoop. This engine

does not guarantee failure prevention, so that Applications Manager was created, to monitoring the

Page 42: Implementing Hadoop distributed file system (hdfs) Cluster ...

26 Implemented Solution

state of each job. By default, the jobs running on Hadoop are from the Map-Reduce framework.

In this project, it was defined that no Map Reduce Jobs will be scheduled, but Spark Jobs. The

first reason is for compatibility reasons to the Web Scraping algorithm and the second reason is

due to performance factors. The section Spark Framework: Advantages to other solutions and

configuration is presented the pros of using this system on top of Hadoop, showing and explaining

the performance advantages compared to MapReduce jobs.

3.2.4 Spark Framework: Data loading framework

In order to take the input data from Web Scraping algorithm presented in the below section Stock

Market’s Web Scraping: Extracting stock indicators, and loaded in the final stage (where all data

will be processed and presented), Spark framework was implemented.

Apache Spark substitutes MapReduce jobs, like stated above, and take advantage of memory-

processing instead persisting temporary data in disk, which is a lot faster. The principal advantage

is that requires a minimum specification on the machine that perform this tasks, in order to not

consume all available memory. 8GB of RAM is enough to perform data loading in Spark, and this

value is nowadays cheap and easy to have on a single machine.

3.2.5 Apache Hive: mySQL database and JDBC Driver

To connect all the data in real-time to Power BI, Apache Hive was implemented, along with

MySQL. Joining these two concepts it is possible to send data to outside, where MySQL is used to

create a metastore database for Hive created tables, responsible to create its schema. Hive tables

will have inside all extracted data, using a live connection to preview its data on the final stage’s

used tool.

Apache Hive brings inside a JDBC Driver along Thriftserver script, allowing the connection

to Hive tables via outside, on a HTTP communication .

Figure 3.5: Hive Architecture. [13]

Page 43: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.3 Setup of Cluster: Installation of Hadoop / YARN on three machines 27

3.2.6 Power BI

This framework will be used to process and represent data calculations from HDFS files, connect-

ing to all the systems via JDBC(Java Database Connectivity) Driver from Apache Hive tool. Other

tools available in the market like Tableau or QlikView, presented in chapter two, also perform sim-

ilar computations. But this framework is already deployed on the enterprise responsible for this

project.

So, fewer costs are avoided, and the internal support, if any problem is stated, is much stronger.

Power BI presents a user-friendly interface, with hundreds of connectors, like MySQL databases,

CSV, Excel, Spark, among others. Besides that, this framework allows the use of some calculations

related to Machine Learning of data, like linear regressions, correlations, Moving Averages, etc.

So, after a brief study and some discussions with the B2F enterprise’s experts in this area, this

tool is the most indicated to perform the bridge Between local file system: HDFS and live data

analyzing: Power BI.

3.3 Setup of Cluster: Installation of Hadoop / YARN on three ma-chines

First of all, in order to store all the scraped data, it was implemented a Hadoop Cluster, and it

consists on three working machines,connected with Ethernet, with a Intel I5 core and 8GB of

RAM available,using the Linux Distribution : Ubuntu 20.04 (It can be downloaded in this link:

https://ubuntu.com/download/desktop/thank-you?version=20.04.1&architecture=amd64), connected

to each other, sharing information about stored files. The first step is to connect them in the same

network, and test if the three nodes can ping to all the architecture.

For that, the file in /etc/hosts is edited on all machines. The picture below represent the actual

configuration of this file on the three nodes.

Figure 3.6: Hosts file located at /etc folder.

The master node will use an SSH connection to speak to the other nodes, so that, for security

and active connection reasons, an authentication made by key-pair is used. In the master node, the

SSH key was generated by the following command in terminal:

ssh-keygen -b 4096

When running this command, a password prompt is opened. It was leaved in blank so that

Hadoop can communicate unprompted.

Page 44: Implementing Hadoop distributed file system (hdfs) Cluster ...

28 Implemented Solution

After that, in each work node, a master.pub file was created. This file contains the public key

of the master node, in order to create an only and valid authentication between the three machines.

The command used is:

cat /.ssh/master.pub » /.ssh/authorized_keys.

With that configuration, Every machine can communicate with all the nodes, and the next

step is to download the repository from Hadoop, containing the most recent framework, 2.10.1,

released on 21 of september of 2020 . It can be downloaded directly from the Hadoop site, located

on https://hadoop.apache.org/. From the terminal, the next commands are performed, to download

and extract the files to the home folder.

cd wget http://apache.cs.utah.edu/hadoop/common/current/hadoop-2.10.1.tar.gz

tar -xzf hadoop-2.10.1.tar.gz

mv hadoop-2.10.1 hadoop

The next step is to set the environment variables, that is, giving to the system the location of

all Hadoop’s files. For this, the command below is added to .profile file located on home folder:

PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH.

In the .bashrc file, located on the same folder, the following commands was set, in order to

give the path to the shell: export HADOOP_HOME=/home/hadoop/hadoop

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin.

The system can now locate the Hadoop directory. Hadoop communicates using JAVA lan-

guage. So, the next step is to download or find JAVA path and set it on the Hadoop environment

file, hadoop-env.sh. In this case, the JAVA installed is 8, so, the line included is :

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre. The next step is to define a lo-

cation for the namenode. On file /hadoop/etc/hadoop/core-site.xml , name and port of NameNode

is defined, like represented in figure below.

Figure 3.7: Configuration of core-site file.

The port used is 9000, and the location is the localhost, also know as node-master, in this case.

The path for HDFS is the next step. The figure below shows the configuration made on the file

located at /hadoop/etc/hadoop/hdfs-site.xml.

Hadoop configuration is concluded. After that, the YARN configuration was made. YARN

will keep track of all jobs executed on the system, and this engine will allocate memory and

Page 45: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.3 Setup of Cluster: Installation of Hadoop / YARN on three machines 29

Figure 3.8: Configuration of hdfs-site file.

process resources to each task. The figure below presents the configuration created in the file

/yarn-site.xml, located at /hadoop/etc/hadoop, to give an address to YARN.

In the workers’ file, at /hadoop/etc/hadoop/workers, the name of the Data Nodes are required,

to NameNode recognize them, and initializes their scripts when Cluster is enabled.

To YARN allocate all the necessary resources, the memory allocation is required to configure.

For this architecture, few calculations were made to find the configuration values. To choose the

right values, total RAM, disks, and CPU cores are considered. [32]

In the implemented solution, the number of disks are three, each machine has 8 GM of RAM

and each CPU has two cores. For 8 GB of RAM , is recommended to allocate 2 GB to system

memory and 1 GB to HBase process,if used. But, the only important value is the first one, which

represents the stack memory to run jobs.

numbero f containers = min((2∗CORES,1.8∗DISKS),TotalavailableRAM

MIN_CONTAINER_SIZE).(3.1)

The documentation present the value 512 MB for the minimum container size, for a total RAM

per node between 4 and 8 GB, which is solution’s case. So, using 8 GB of available RAM, 6 cores

and 3 disks, the minimum value is 5.4 , which is equals to 5. The final calculation is to determine

the amount of RAM per container, presented in the next equation.

Page 46: Implementing Hadoop distributed file system (hdfs) Cluster ...

30 Implemented Solution

RAMpercontainer = max(MIN_CONTAINER_SIZE,(TotalAvailableRAM)

numbero f containers. (3.2)

The final result , using the parameters presented above, is 3.33GB of RAM per container. This

values (given and calculated) represent all information needed to configure memory allocation.

The figure below represents the entries that will be needed to configure, in the file yarn-site.xml,

located at /etc/hadoop/ folder.

Figure 3.9: Memory allocation configurations on yarn-site xml file .

The yarn-site.xml file is represented below, in order to give an idea of how this file is respon-

sible to allocate all resources to Hadoop future jobs.

Page 47: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.3 Setup of Cluster: Installation of Hadoop / YARN on three machines 31

Figure 3.10: yarn-site xml file.

With YARN and Hadoop Installed , it was made a copy of all files to worker nodes, using scp

protocol , like stated below:

for node in node1 node2; do scp /hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/;

done Now, as any classic file system, HDFS need to be formated, so the command below was ex-

ecuted.

hdfs namenode -format

After that, Hadoop is ready to run and perform task-scheduling. Starting the script on Hadoop

"start-dfs.sh", a deamon in the NameNode as been created named "NameNode" and "Secondary

Namenode", for preventing a single point of failure. In the DataNodes, the deamons are also

initialized , named "DataNode". An interface via web is also prompted, stating all information

about the Cluster developed. It indicates the number of ative nodes, the files on HDFS, well as all

Page 48: Implementing Hadoop distributed file system (hdfs) Cluster ...

32 Implemented Solution

specifications(space available, Heap memory, etc.)

Using jps command is possible to see which deamons are running on the machine.

Figure 3.11: Deamons in master machine

Figure 3.12: Deamons in slave machines

Figure 3.13: HDFS Local Web Site.

Page 49: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.3 Setup of Cluster: Installation of Hadoop / YARN on three machines 33

Figure 3.14: HDFS Local Web Site.

It is also already possible to run YARN Resource Manager and Node Manager daemons, run-

ning the script "start-yarn.sh". As states before, a daemon will be created on the masternode,

called Resource Manager, responsible for monitoring and allocating resources. NameNodes are

the "servos" of Resource Manager, launching and managing containers on their nodes, reporting

the results to ResourceManager.

This framework also can be accessed via a Web user interface, to control all jobs and tasks

created /finished/running.

Figure 3.15: YARN Job Manager Local Web Site.

Page 50: Implementing Hadoop distributed file system (hdfs) Cluster ...

34 Implemented Solution

3.4 Stock Market’s Web Scraping: Extracting stock indicators

After configuring Hadoop, the next phase was to implement a system to extract data from a set of

sources, namely stock market data from Google Finance, Yahoo Finance, Market Watch, and The

Wall Street Journal websites. These sources present quotes from every stock listed on NASDAQ

and NYSE, and enterprise’s indicators and historical values. After a brief analysis, the solution

opted was to create a Python Script, autonomously, scraping a set of actual indicators that are

important to use as an after-study and data processing. Data will be stored on a single list, and

after that an Apache Spark’s script was implemented too, storing the data on a Parquet format,

ready to import to Hive’s table and posterity to the analysis and presentation tool, Power BI.

3.4.1 Important stock market’s indicators to extract

To choose which indicators will be extracted, a brief study on the financial market was made.

One of the most important indicators to model a prediction system is the Moving Average. This

indicator is not listed on these sources. The moving average is calculated by the addition of a

stock’s prices over a certain period and dividing the sum by the total number of periods. This

interval will be increasing with every input in the system, giving a stronger result in the final. This

value will be calculated after the extraction of data, and before the presentation of the results.

Other indicators are important to study the stock market, like PE Ratio, which relates a com-

pany’s share price to its earnings per share, or Market Capitalization, which gives the market value

for a company. The dividend is also a good indicator to study because it states the reward that a

company gives to its shareholders, and it is a factor that can influence a behavior of a market.

Taking into account the information available on each source, a set of indicators was defined,

to structure a general extraction model. This decision was based on Google Finance’s indicators.

This source presents eight values about each stock market, and all of them have important

reasons to include in the extraction model: Price, Open, High, Low, Market Capitalization, Price-

Earning Ratio, Dividend Yield, Previous Close, 52 Week High and 52 Week Close.

There are other important values to study, to create a stronger decision on a stock market’s

behavior, but these indicators are not listed on every source, and some of them are hidden or

impossible to extract, due to lack of permission on the website’s API or due to no existence of this

values on the website’s free front end. It is important to state again that the main factor to this

project: a predictive stock market’s analysis on a low-cost system. So, the main objective is to

try to create a solution that would be able to support decision-making with the less cost possible.

With the eight values listed above, a timestamp in the format "MM/DD/YYYY HH : MM" and

the name of the source are also put in the system, for helping in future analysis.

3.4.2 Python Script

As stated in the subsection above, a Python script was developed to extract this data from all four

sources, at the same time. In the next chapter, is explained how these values are stored on HDFS.

Page 51: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.4 Stock Market’s Web Scraping: Extracting stock indicators 35

By now, only an extraction model will be presented. Firstly, installed Python was 3.8, which is the

latest version. But in the next phase of the project, a problem occurred with this version on Apache

Spark’s framework, and the solution was to downgrade this working version to 3.7. For prevent-

ing auto-update of the Python version or changes on the libraries used, a virtual environment was

installed on the Name Node machine, where extraction and load are executed: Anaconda. This

tool offers an environment that is more resilient to failures and changes in the Python configura-

tions. This tool can be downloaded on https://www.anaconda.com/products/individual#linux and

installed with this command in terminal of Linux:

bash /̃Downloads/Anaconda3-2020.02-Linux-x86_64.sh

After installing, the command conda init begin a remote Python environment. It can be shut

down, but this system will be always active, for compatibility reasons across all the system. This

process can be also executed by the command anaconda-navigator, opening an interface to begin

the virtual environment.

Figure 3.16: Anaconda framework’s interface.

It is possible to see in the figure above that Anaconda keeps track of all libraries installed.

This system allows to save the actual configurations on a file and easily restored if any external

fail occurs.

There is two main libraries that are used to extract data from HTML websites: URLLib and

Beautiful Soup. The first one is used to open a connection to the website, while the second one

is used to read and parse page’s content. The two libraries work together in order to perform the

extraction task. For adding them to the Python environment, pip3, a package installer for Python

was installed, using the command sudo apt install python3-pip and then, in order to install the

libraries the commands pip3 install beautifulsoup and pip3 install urllib3. This two commands

will be installing the lastest version of each library.

Page 52: Implementing Hadoop distributed file system (hdfs) Cluster ...

36 Implemented Solution

With all dependencies installed, the logic of the algorithm was implemented. The figure shows

the extraction code for Google Finance’s website.

Figure 3.17: GoogleFinane’s code for extraction.

The figure above represents the logic created for extracting data from the source. An URLLib

object was created to open the connection to the website and a BeautifulSoup object was created

to parse all the website’s content. After that, using the code inspector tool in Mozilla Firefox, the

id of the value is searched, to filter the data. In this case, the price is in a span HTML tag with

id "IsqQVc NprOob XcVN5d" and the other indicators are stored in a td tag with id "iyjjgb". A

loop cycle was created also because sometimes the object does not return data. So this loop will

wait until the website has the values. The BeautifulSoup object returns an ordered list with every

record that is placed on a tag that has the id defined.

All websites present the same format of URL and the same abbreviate for each listed stock, so

that is easier to automatize this kind of task to other markets.

Page 53: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.4 Stock Market’s Web Scraping: Extracting stock indicators 37

Figure 3.18: Inspector tool to find id of data tags.

The final task is to store them on a list, appending timestamp and the abbreviate of the source.

In this case, will be saved a record named "gf", for helping in future analysis.

The other three algorithms present the same theory applied in this one, but differs on the

building of the output record list, since each site stores the required indicators in different orders

and different tags and ids so that a little transformation in the output list has been made to all the

algorithms present the same formatted output. The figure bellows presents the algorithm created

for the extraction of the MarketWatch website’s stock values, and the logic is analogous to the

other sources.

Page 54: Implementing Hadoop distributed file system (hdfs) Cluster ...

38 Implemented Solution

Figure 3.19: MarketWatch’s extraction algorithm.

This source is more complete than Google Finance, as well as The Wall Street Journal and

Yahoo Finance retrieving other indicators so that filtering was made to get only the required ones.

With a little transformation on the BeauifulSoup’s list, a formatted and required output can be

created, ready to store on an HDFS file, on a single line of it.

The figure shows the output format of this list, which means a record for a file that will be

stored in HDFS and the future analyzed.

Figure 3.20: Output list of stock data.

This algorithm will be used as functions of the main algorithm, the one that will be store this

data on HDFS files, in an automatic way, appending each value to a file that is created to store

information of each stock market each month. The next section, will be presented the main func-

tion, showing too the files that are created automatically to all the data, organized, and formatted

to load on the final stage.

Page 55: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.5 Spark Framework: Advantages to other solutions and configuration 39

3.5 Spark Framework: Advantages to other solutions and configu-ration

This section presents the environmental configuration of the framework used to store and read

HDFS files containing stock values, Apache Spark, as well as implemented algorithm to load this

data to the final stage of the architecture: the data processing and presentation on Power BI tool.

Also, a brief comparison with another similar tool is made to explain why this technology is

the chosen one to perform this task.

3.5.1 Apache Spark vs Hadoop MapReduce for running applications

Some studies were made around these two frameworks, where an evaluation of performance with

Big Data architectures was compared.

The first advantage of choosing Apache Spark over MapReduce is since the extraction algo-

rithm was made using Python language, and Spark presents an API in Python too, giving better

compatibility to embed the entire system. Using MapReduce with Python is more complicated

and does not have the same efficiency and support with libraries for data load and process.

Taking into account both functionalities, the second advantage of using Spark is due to better

speed. Spark makes data processing about 100 times faster than Map-Reduce since this system

operates with RAM and not on disk, like the Map-Reduce system. creating a "near-real-time"

working environment.

Another great advantage of using Spark is to built-in libraries like "Mlib", a machine learning

library. Map Reduce needs a more complicated and less compatible way to use machine learning

dependencies. Like MapReduce, Spark also has parallel distributed operations, allowing to run

multiple jobs at once, working in the same files if it is the case.

In chapter 2, a study was stated about Hadoop MapReduce versus Apache Framework. Other

similar studies had been made, and the results are very conclusive too: Spark framework presents

a better performance to process much information since these processes are made on memory

instead of on disk. Random Access Memories, also known as RAM, is more powerful than con-

ventional disks, allowing to read and write about 50 to 200 times faster. This is an important factor

for this project, due to the fact stock values are volatile and always are changing, so a quick ex-

traction and process of this data is essential to create a better decision-making solution. So, after

this brief analysis of both frameworks, the configuration of Spark over Hadoop was made. The

next subsection presents all configurations to embed this subsystem on the main architecture.

3.5.2 Apache Spark configuration over HDFS

This subsection represent the configuration of Apache Spark Framework on HDFS. The first step is

to download Spark binaries from Apache Spark download page : https://spark.apache.org/downloads.html.

The recommended Spark version to Hadoop v. 2.10 is 3.0.1 , with pre-built package for Apache

Hadoop 2.7 or later.

Page 56: Implementing Hadoop distributed file system (hdfs) Cluster ...

40 Implemented Solution

Figure 3.21: Apache Spark versus Hadoop MapReduce .

Figure 3.22: Apache Spark download page .

Next, downloaded Spark package is uncompressed, using the tar command, and copy all the

files to a folder named "spark" on home folder.

tar -xvf spark-3.0.1-bin-hadoop2.7.tgz mv spark-3.0.1-bin-hadoop2.7 spark .

The next step is to include Spark binaries into PATH environment, enabling HDFS to locate

configurations and Spark files. In file /home/hadoop/.profile, following line is added:

PATH=/home/hadoop/spark/bin:$PATH. The entire system are now linked to Spark frame-

work.

The next phase consists on integrate Spark with YARN, in order to configure default YARN

applications as Spark Jobs. By default, YARN is configured to Run MapReduce tasks, so that

it is required to make this change. For that, the following lines was implemented on the file

/home/hadoop/.profile export HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop

export SPARK_HOME=/home/hadoop/spark

export LD_LIBRARY_PATH=/home/hadoop/hadoop/lib/native:$LD_LIBRARY_PATH

Next, spark default template config file was renamed, in order to be the default configuration

when Spark begin its process:

mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf

On file $SPARK_HOME/conf/spark-defaults.conf, the following line was added in order to

set YARN the Job Manager of Spark: spark.master yarn.

Page 57: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.5 Spark Framework: Advantages to other solutions and configuration 41

The next step is to choose the run mode on YARN from Spark jobs. There are two modes:

Cluster mode, where if a job is started on the master node and this machine turns off, the job will

be running on Cluster, and Spark Drivers will be encapsulated inside YARN Application Master.

If Client mode is activated, if the author of a job turns offline, the job will fail but Spark executors

still run on Cluster, and a small YARN Application Master is created. In this project, spark jobs

will be running for a long time, creating a continuous flow on the Spark executor system, so that

the Cluster mode is the more appropriate so that this mode will be configured. The following

configurations are made for this mode, presenting some differences related to the Client mode,

from the perspective of memory allocation ( Cluster mode requires to allocate memory on the

cluster, and Client mode does not need this configuration).

On spark-defaults.conf file, the next line is created to set the amount of memory to Spark

Driver. spark.driver.memory 2G

Default value is 1GB, but this value is calculated for 4GB machines, so that 2GB is the best

value for 8GB machines, like the master machine, where Spark Application Master will be run-

ning. The next value to set is Spark Executors memory. On the same file, the following line was

introduced.

spark.driver.memory 1024m.

Default value is 512m but this value is calculated for 4GB machines, so that 1024MB is a

better value. The final configurations are made in order to create a History Server interface to log

all jobs that are executed on a system, presenting some statistics about its performance. In the same

file, the following lines was included, to create the History Server interface. spark.history.provider

org.apache.spark.deploy.history.FsHistoryProvider

spark.history.fs.logDirectory hdfs://node-master:9000/spark-logs

spark.history.fs.update.interval 10s

spark.history.ui.port 18080

Page 58: Implementing Hadoop distributed file system (hdfs) Cluster ...

42 Implemented Solution

Figure 3.23: spark-default.conf file .

Figure 3.24: History Server Web Interface .

After these steps, Spark is ready to run applications from the Hadoop Cluster, performing tasks

on Hadoop’s file system domain, scheduled and managed by YARN. Now, the next step is to create

an algorithm capable of using these resources to load, process, and send data to the final stage,

where it will be made an analysis and presentation of the HDFS’s stored data. The next subsection

will be presenting the logic created, along with all dependencies and additional configurations on

Spark to access HDFS files via PySpark, which used API to bridge these two technologies.

3.6 Apache Spark Script to store extracted data in HDFS

Firstly, before connecting this two frameworks using the algorithm implemented, there are some

addictional configurations that need to be made. More precisely on /home/spark/conf/spark-env.sh

Page 59: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.6 Apache Spark Script to store extracted data in HDFS 43

file, where the next lines are added in order to set variables for Spark’s Python. export PYS-

PARK_PYTHON=home/hadoop/anaconda3/bin/python3.7

export PYSPARK_DRIVER_PYTHON=/home/hadoop/anaconda3/bin/python3.7

export SPARK_CLASSPATH=/home/hadoop/apache-hive-2.3.7-bin/lib/mysql-connector-java-8.0.

22.jar

The last line was created for a future step, which is responsible to set a connection to Spark

with HIVE tables, via MySQL connector, to allow this system to connect to the Power BI tool.

This step will be discussed in the next section in more detail.

To put all records on HDFS, an Apache Spark script was created, and this script will include

all algorithms created before for extracting data, working together to store these values on HDFS

files, organized by month and the stock market. Files are created in CSV format, compressing it

in Parquet files. Both formats are allowed in this architecture, but Parquet is especially used in the

Hadoop environment and presents some advantages related to the CSV format. This method will

increase the performance of the entire system because Parquet files are created specifically this

type of architecture, capable of compressing high-size files to a low-size. A CSV with 100 Million

records can have, for example, 1TB, and the compression of the Parquet file will be decreasing its

size to approximately 10 times lower, rounding 100 GB. This is an easy transformation, allowing

the system to have more data inside, and Spark will process this file a lot faster.

In the next chapter, it will be presented some tests that were made to see some records saved

in each format, allowing to reflect if Parquet is improving the performance of the system.

This script will be extracting and storing stock market data from all four sites, in a ten-minute

loop, using an auxiliary software to run this job automatically: Crontab.

This software is used to run periodically the script, in order to make requests to the four

websites and store the information on HDFS files. The command to install this technolgy is sudo

apt-get install cron. After that, crontab -e is executed on the Linux shell, creating a new line on

the Crontab configurstion file:

*/10 9-17 * * 1-5 cd /home/hadoop/Desktop && ./script.sh.

This command allows to run the script every Monday to Friday, from 9am to 5pm. This time-

interval is when Stock markets are open, so that this script only have to run on this period. There

is no need to run it allways, because out of this period the values will not change.

The figure below represent bash file created, invoking implemented Spark’s Python script.

Page 60: Implementing Hadoop distributed file system (hdfs) Cluster ...

44 Implemented Solution

Figure 3.25: Bash Script to Extract and Load Data using Crontab.

It is possible to see that Apple, MacDonald’s and Nike stock values are requested each 10

minutes, storing it on HDFS directory.

Now, the actual Python script will be commented in order to explain every step made.

Figure 3.26: Python Imports.

The first part of the code implemented is the import section, where all libraries are imple-

mented, including the website’s extraction functions, and libraries that ensure the connection of

Python with Spark and Hadoop, namely SparkSession, and subprocess, to run commands on shell

via Python.

Next, algorithm will run each extraction function presented before in the subsection Python

Script , sequentially, in order to extract a list of values, representing website’s chosen stock indi-

cators at that point, defined by "date" variable.

Page 61: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.6 Apache Spark Script to store extracted data in HDFS 45

Figure 3.27: First part: Exraction of data

The "values" variable will return a row with every value extracted. Next step is to store it on

HDFS file, appending it to other rows that are already on HDFS directory.

For storing extracted data, it was created a directory named "scraping" on HDFS file system.

Executing the command hdfs dfs -ls scraping it is possible to see HDFS files relatively to extracted

data. This files are created in the beginning of a month, according to the algorithm sector that will

be presented below.

Figure 3.28: HDFS folder with extracted files

After the extraction is complete, the next step is to append the new record into the file that

saves the record for that month and the stock market. Appending information on Hadoop Files

with Python is a difficult task so that some middle steps are taken to create a union between all

the values from the HDFS file and the newest records. For performing this task, two Python

DataFrame were created. A DataFrame is an object that represents a table with rows and columns.

In this case, it will be represented HDFS file and each new record that arrives in the process. The

script creates a third DataFrame with the union of new with old data, storing it on a new HDFS

file, updating every time the file that contains all extracted data. The next figure presents the final

stage of the code, where new data is appended to the HDFS file already placed on Hadoop.

Page 62: Implementing Hadoop distributed file system (hdfs) Cluster ...

46 Implemented Solution

Figure 3.29: Final stage of loading data to HDFS part 1.

Figure 3.30: Final stage of loading data to HDFS part 2.

With this implemented algorithm, it is now possible to store and load this data to the last phase

Page 63: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 47

of the architecture. This file will be imported to a Hive table, which is responsible to make and

connection to the Power BI platform. The next section will explain all steps made to load HDFS

data into the processing stage. With this implemented algorithm, it is now possible to store and

load this data to the last phase of the architecture. This file will be imported to a Hive table, which

is responsible to make and connection to the Power BI platform. The next section will explain all

steps made to load HDFS data into the processing stage.

Figure 3.31: Apple January extracted stock market values’ HDFS file(Portion of the file).

3.7 Connection to Power BI with Apache Spark framework: ApacheHIVE and Spark Thriftserver Configuration

This section presents all steps made to allow communication between HDFS and the processing

and presentation tool via, Power BI. All configurations in the system will be presented, and well

as explanations about all choices and steps implemented. Apache Hive is a tool that provides a

connector from HDFS to the exterior via SQL language. It uses JDBC ( Java Database Connector)

Driver. As Hadoop, Hive was created on the same enterprise, providing the better compatibility

possible with any framework that belongs to Apache. In this case, Hadoop with Hive fits perfectly

to create a connection to other tools that want to take care of its data.

For that, Hive uses the technology of warehousing on top of Hadoop, providing querying of

data with its database tables. These tables are composed of all the stock market data that HDFS

stores. By default, Hive tables are created using a Derby metastore. Metastore is the repository

that stores metadata(information about a database) for Hive tables, including their schema and lo-

cation, creating a shared environment to other frameworks. It also can be configured with MySQL

metastore, which is universal and more compatible with any Business Intelligence environment

Page 64: Implementing Hadoop distributed file system (hdfs) Cluster ...

48 Implemented Solution

than Derby metadata. Besides that, Derby’s databases only accept one connection at a time, limit-

ing access to Hive’s data, compromising real-time purposes.

The first time this tool is configured, it was used the default metastore schema( Derby), but

only Tableau would accept it, denying any kind of communication with the Power BI tool. So,

after a reconfiguration of the system, embedding MySQL on the system will create an instant

connection. As stated before, HDFS communicates with the outside via Spark’s drivers, opening

an internal server making the Hive database online. This software is called ThriftServer. The

configurations below are made exclusively with MySQL metadata. Derby database was discarded

after no success on connection so that it will not be explained.

3.7.0.1 Apache Hive configuration

The first step is to download from Hive website the lastest release. ( https://hive.apache.org/downloads.html)

. In this case, the release 2.3.7 was downloaded, fitting to last Hadoop version as well(Hadoop

version 2.10).

Figure 3.32: Hive downloaded compressed file ( Version 2.3.7) .

After decompressing it, using the command in terminal tar xzf apache-hive-2.3.7-bin.tar.gz,

the next step is to configure environment variables, in order to Operating System locates the repos-

itory. Next lines was added on .bashrc shell script, in order to create an environment variables in

Linux to Hive.

export HIVE_HOME= “home/hadoop/apache-hive-2.3.7-bin”

export PATH=$PATH:$HIVE_HOME/bin

Hadoop environment variables are located on same file as well, as stated before in Hadoop

Configuration section. After editing the environment script, executing source /.bashrc will save

changes. Now, machine knows about Hive repository, able to link it to other processes.

This configurations yet does not create a relationship between Hive and HDFS. For that, some

changes are made on Hive configuration file, hive-config.sh, located on /home/hadoop/apache-

hive-2.3.7-bin/bin folder

Page 65: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 49

The following line was added on this file in order to Hive locates HDFS directory: export

HADOOP_HOME=/home/hadoop/hadoop

Figure 3.33: Hive’s hive-conf.sh file.

There is another variable that was declared, namely Jars Path. This variable is created by

default in the system and represents a path for plugin jar implementations by the user. In this case,

there is no additional jars used so this variable will be not related to nothing in the system.

The next step is to create Hive directories in HDFS domain, more concretely two: a temporary

folder to store Hive results, such as Hive processes, for sending data if necessary, and a warehouse

folder to store all hive tables. So, the next commands are executed on shell,performing above steps.

hdfs dfs -mkdir /user/hive, for creating a hive user on the HDFS system, hdfs dfs -mkdir /tmp, hdfs

dfs -chmod g+w /tmp, for grant permissions to the system, hdfs dfs -mkdir -p /user/hive/warehouse,

for creating warehouse folder under hive user and hdfs dfs -chmod g+w /user/hive/warehouse

for grant permissions on this folder to hive. This directories will be used on the ThriftServer

configuration XML file in order to communicate with Power BI, allowing to remotely access to this

databases and its tables This tables contain all previously extracted data that will be imported using

Hive commands, explained below. The Hive configuration, by default, has already configured

directories’ names as stated above, so that it is more easy to create this folders with the same

names for less modifications on the configuration file.

Hive is now ready to use, and next steps are: configure MySQL and Thriftserver together in

order to create a metastore server and schema for Hive’s tables, allowing them to write, read and

connect its data to Power BI via server.

Page 66: Implementing Hadoop distributed file system (hdfs) Cluster ...

50 Implemented Solution

3.7.0.2 MySQL and Thriftserver Configuration: Create Metastore Database

Every database has a schema , and before creating Hive’s tables, its metastore schema has to be de-

clared and started in order to any allowed client can access to its tables. In order to define a schema

for metastore database, the command $HIVE_HOME/bin/schematool –initSchema –dbType mysql

was executed.

Figure 3.34: MySQL schema metastore creation .

Hive offers a server to run Metastore, that can be running at any time using the command hive

–service metastore.

Hive is now ready to table creation on HDFS, and the next step is to configure Spark’s Thrift-

Server in order to connect to MySQL created metastore schema and respectively to Power BI or

other similar tool, via JDBC driver. Once again, the default Derby metastore schema dont fit to

this architecture as well as MySQL, so that additional configurations on system to link MySQL

and Hive were made, and explained below.

The first step is to download MySQL database. It can be downloaded from terminal, using

the command sudo apt-get install mysql-server. The lastest version and the used one is the 8.0.

Next, Java connector is installed, in order to connect to Hive, using command sudo apt-get install

limbysql-java.

In order to connector is linked to Hive, a soft link is created betwwen Hive and MySQL, with

the command ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-

java.jar

Now, Hive have a linked connector to MySQL schema type. Now, it is important to configure

MySQL permissons, in order to allow only internal access for Hive tables. For that, a user will be

created on MySQL environment, and his credentials are used to access to Hive tables via Power

BI.

On MySQL terminal , the following lines was written. mysql> CREATE USER ’afonso’@’%’

IDENTIFIED BY ’afonso’;

mysql> GRANT all on *.* to ’afonso’@% identified by ’afonso’;

mysql> flush privileges;

The first line will create an user called afonso, with afonso as password. The second line will

ensure that afonso can access to all tables on the system. And last line saves all grant permissions

on MySQL database. The expression " @%" means that every machine in the private network can

connect to this Hive tables, if the user and password match with defined above.

Page 67: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 51

The figure below represents grants created for this user, and it is possible to see that this user

can access and modify any table in Hive domain.

Figure 3.35: Permissions to new Hive and MySQL user .

The next step is to modify some lines on hive-site.xml file, located on the conf folder in Hive.

This file will be replicated also to the conf folder in the Spark directory, so all repositories have a

copy of the configuration made.

The first modification is the credentials to access the metastore server. This is only used if the

user wants to access directly to it r. But Power BI will access with MySQL credentials and these

changes are only for data protection, and will not be used frequently.

Figure 3.36: Metastore server username.

Figure 3.37: Metastore server password.

Page 68: Implementing Hadoop distributed file system (hdfs) Cluster ...

52 Implemented Solution

Figure 3.38: Connection URL.

This property is for connection URL. It is defining Connection URL in this property. It acts as

JDBC connection and its representing metastore location as well.

Figure 3.39: Driver Name.

This is the name of the JDBC driver, which represents a class on the Java MySQL connector,

linking it to Hive and Spark ThriftServer.

After saving the configurations, it is time to create Hive tables, to test if MySQL can connect

to them via metastore server and send them to the exterior. The following figure presents the table

that will contain all extracted data from December of the Apple Stock market. This table will be

imported to Power BI, via ThriftServer and using MySQL user’s credentials. This table will be

loaded to Hive via Hive’s terminal. This is only one of the multiple Hive tables created. Each

market will be one table for each month, containing all records extracted.

Page 69: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 53

Figure 3.40: Created Hive table for Apple’s stock values in December 2020.

If everything went ok, then MySQL can access to this table. So , in order to test it, MySQL is

open again and the command use metastore is executed, changing to metastore database, and then

show tables; will show all Hive metastore’s tables. As the figure below proves, MySQL can acces

to Hive’s content. So , the process of linking this two technologies is a success.

Figure 3.41: Created Hive table for Apple’s stock values in December 2020.

The metadata corresponds to that tables are stored under TBLS in MySQL database. So exe-

cuting select * from TBLS; will present Hive’s tables, if all was done correctly.

Realizing that MySQL has already connectivity with Hive, the next and final stage for con-

necting Power BI to HDFS is to configure ThriftServer. This configuration takes place on the

Page 70: Implementing Hadoop distributed file system (hdfs) Cluster ...

54 Implemented Solution

Figure 3.42: Hive’s table in MySQL domain.

same configuration file presented above, hive-site.xml. And the next lines were added to create an

HTTP server to allow SQL queries between HDFS and PowerBI via JDBC:

Figure 3.43: ThriftServer configuration.

By default, ThriftServer address will be private address of the machine, in this case it will be

192.168.1.50 with port 10001 as configured.

Saving this changes, the last step is to run Thriftserver script, located on spark/sbin folder:

./start-thriftserver.sh

This will create a log file on log folder, and every failure will be reported there. Thriftserver

has to be executed in parallel with the command hive –service metastore in order to locate Hive’s

metadata.

Page 71: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 55

3.7.0.3 Power BI Download and Connection

Finally, the last thing to do in order to see Hive’s tables on Power BI is to download the platform

and connecting to the server.

The download page for this tool is https://powerbi.microsoft.com/en-us/downloads/. Power BI

tool only exists for now in Windows Operative System, so that external computer on the same

network is used to perform this last stage. Opening Power BI, on "Data" Toolbar, it is possible to

see all connectors available on the system.

Figure 3.44: Connectors available at Power BI.

Page 72: Implementing Hadoop distributed file system (hdfs) Cluster ...

56 Implemented Solution

The "Spark" connector is the one to be used in order to connect to HDFS, and next credentials

of the server are inputted.

Figure 3.45: ThriftServer Connection.

Figure 3.46: Credentials to connect to HDFS.

Direct Query is chosen to create changes only on recent file’s data after a change or append

occurs with new records. If Import Query was selected, a copy of the entire file will be imported

to Power BI, and the objective is to be quicker possible. After that, a window is prompted with

Hive’s table, ready to take on data and process it.

Page 73: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 57

Figure 3.47: Hive’s table on Power BI : preview

Now, the last phase of the project begins, where there is no connection to information on HDFS

via Power BI, and processing and presentation of data will be made to create some calculations

and plots with extracted data, studying stock’s behavior with its values.

The next section will be presented all calculations and modifications on extracted data to create

a wide vision of the evolution of extracted information, in this case containing values about stock

markets.

3.7.1 Power BI: Data Load and Processing in real time

This section will represent the final stage of this work, where all extracted and stored data will be

taken, modified(since these values are in text format) to perform mathematical calculations, and

presented in graphs, according to the timestamp of the records. This tool is very useful to perform

this kind of methodologies since it is a Business Intelligence framework that allows the creation of

visual interfaces according to external datasets, from all kinds of data sources, like Spark, MySQL,

local CSV files, among others.

Page 74: Implementing Hadoop distributed file system (hdfs) Cluster ...

58 Implemented Solution

There is two ways of connecting Hive tables to Power BI, as explained before: Import Mode

and Direct Query. The last one is simply a connection that only queries over data, and does not

import all the tables to Power BI. So, it requires an active connection to Spark ThriftServer, to

present the data.

The main objective in this stage is to create dashboards with all extracted data, containing plot

visuals and tables with historical values, for each month and the stock market. Each dashboard

will present 4 plots, each one for each stock market quotes’ source, and a couple of tables with the

preview of the table’s content, concerning the timestamp of each record.

The first step, after a successful connection with tables from Hadoop Cluster, is to modify the

data type, also known as casting columns, once the table’s columns are expressed as text, allowing

to perform some calculations over these values. The columns presenting price values will be cast

as currency columns, percentage columns as percentage values, and finally, the timestamp of the

record is already pre-formatted on the file creation state, on Spark’s Script (MM/DD/YYYY hh:

mm) as a valid date format, to Power BI be able to cast. Power BI tool presents a "Field" toolbar

where is possible to create Measurements to Queried tables.

Figure 3.48: Power BI Fields toolbar.

For example, to cast all price values from Apple, extracted from Google Finance, it is created a

DAX (Data Analysis Expression) formula. This language is used on Power BI to perform specific

queries on data, and it is the only method available to cast and make calculations to values. It is

possible to cast, create moving averages, count distinct values, among others. In this project’s case,

the most important calculation will be moving average, creating an evolution of stock values. This

Page 75: Implementing Hadoop distributed file system (hdfs) Cluster ...

3.7 Connection to Power BI with Apache Spark framework: Apache HIVE and SparkThriftserver Configuration 59

calculation can be executed with the help of a function available on DAX, namely AVERAGEX.

This function can be executed on the entire column or is also possible to filter selected information.

In this case, the selection filter is important, because each calculation will be made to each source.

Each table presents records to a stock market on all sources, so it is important to measure different

values for each source. The parameter "FILTER" will be performing this selection.

Figure 3.49: Moving average of Google Finance’s Close Price indicator , on Apple, in December

Every row in extracted tables will be presenting a text value, appended to stock’s indicators,

with an abbreviation of the record’s source, to distinguish all the information according to its font.

To create average calculations to every indicator and source, small modifications are made in this

formula, changing the selected column and font’s name.

Figure 3.50: Hive’s table, with source font’s column.

After performing the required calculations, the final step is to create plots and tables on the

dashboard to present data information. For that, there is a toolbar "Visualizations", where is pos-

sible to create a different type of charts, tables, and other presentation formats. For this type of

project, once all information is related to stock values according to time, line charts and a simple

table are used to present all data.

Page 76: Implementing Hadoop distributed file system (hdfs) Cluster ...

60 Implemented Solution

Figure 3.51: Power BI Implemented Line Charts and Tables

In the X-axis, there will be used time column and in the Y-axis all calculated values for each

source. Therefore, a temporal analysis can be done. The figure below represents the final state of

data on Power BI, where is possible to see every record of each Hive’s table successfully repre-

sented in form of charts or tables, to give a better overview of the results. In the next chapter, all

created tables and plots will be presented and discussed, to comment on the given result of this

final step.

To conclude, this phase represents the last phase of implemented architecture, where is possi-

ble to show all extracted and loaded data to the Hadoop Cluster.

Firstly, it was projected to create some further calculations with this data, more precisely using

Machine Learning algorithms and Artificial Intelligence models. But due to lack of time, this idea

will be taken to possible future implementation. Once extracted values present a good overview of

the behavior of each market value, in parallel with calculations like Moving Average, it is enough

to conclude some aspects of their studies just by implementing this Power BI’s calculations and

data previews.

Page 77: Implementing Hadoop distributed file system (hdfs) Cluster ...

Chapter 4

Result of Implementation and Tests

4.1 HDFS architecture availability

This section will present some results to prove that Hadoop Cluster is available to perform every

job, as well as availability checks on Spark Framework and Hive Server.

Every framework presents a Web User Interface where is possible to check if the service is

available. The following figures will prove the successful communication and availability among

all services, to check the entire availability of the system. To test Hive’s connection to the exterior,

Hive presents a script to test the connection, namely beeline. It is possible to check if Hive JDBC

is running by performing a command for the test.

Figure 4.1: HDFS availability check

The figures above illustrates the correct work of all architecture’s frameworks, that together

create an environment for data analysis, on Power BI or even in Spark framework. There is some

libraries on Spark that allows to train data and learn some patterns, like mlib, but due to lack of

time, this implementation will be taken to future work.

61

Page 78: Implementing Hadoop distributed file system (hdfs) Cluster ...

62 Result of Implementation and Tests

Figure 4.2: YARN scheduling and motorization test.

Figure 4.3: Spark History Server test.

Figure 4.4: Hive server test with beeline.

Page 79: Implementing Hadoop distributed file system (hdfs) Cluster ...

4.2 HDFS extraction mechanism 63

4.2 HDFS extraction mechanism

The figure below present the result of data extraction, where is possible to see where files are stored

on Hadoop Cluster, as well as an preview of data, in the row-column format, ready to import to

Hive and consequentially on Power BI dashboards.

Figure 4.5: Hadoop Folder of Extracted data

Figure 4.6: Output of data in HDFS files.

This mechanism will append new records to older ones, and store them in the actual HDFS

file, using an auxiliary file to union two data sets in one (creating a new file with old data, creating

an object with its data, and finally creating a new file with new plus old records): This method is

necessary because Spark framework does not have an embedded function to read/write the same

file

Page 80: Implementing Hadoop distributed file system (hdfs) Cluster ...

64 Result of Implementation and Tests

4.3 HDFS Performance Results - Data Extracting and Load: SparkJobs with CSV files vs Parquet

To test the best format to store data on Hadoop, performance tests were made, using Spark frame-

work, to see which format, with the same information, will be faster to process and load to Hive

Table. For that, a sample CSV file is created, with real records of MacDonald’s stock prices,

extracted from Yahoo Finance.

Then, the row’s files are replicated to create three different sizes of files: one with 1 million

files, another with 10 million, and lastly one file with 100 million rows. The objective is to see

if different sizes of data sets in these two presented formats will have different performance in

the Hadoop environment. The following figures represent the total size of tested files: 1, 10, and

100 Million rows ( it was used 100 different rows and replicated until a number of rows equal the

required amount.).

Figure 4.7: CSV File sizes: 1,10, 100 million of rows.

It is possible too see the size of each file, where:

• 1 million row CSV file is equals to 57956253 bytes, or 57.95 megabyte;

• 10 million row CSV file is equals to 522000043 bytes, or 522 megabyte, approximately;

• 100 million row CSV file is equals to 5220000043 bytes, or 5.22 gigabyte.

Spark framework create partitions in created Parquet files. This method increases parallelism

of Hadoop’s architecture, creating chunks of data between every node and preventing single points-

of-failure. So that, the content of the files are located inside the folder, as figures below present.

Figure 4.8: 1 Million row Parquet file

Page 81: Implementing Hadoop distributed file system (hdfs) Cluster ...

4.3 HDFS Performance Results - Data Extracting and Load: Spark Jobs with CSV files vsParquet 65

Figure 4.9: 10 Million row Parquet file

Figure 4.10: 100 Million row Parquet file

Performing the sum of each chunk’s size, the conclusion is :

• 1 million row Parquet file is equals to 16152 bytes, or 16.15 kilobyte;

• 10 million row Parquet file is equals to 64700 bytes, or 64.70 kilobyte, approximately;

• 100 million row Parquet file is equals to 595800 bytes, or 595 kilobyte.

The ratio of compression with Parquet files, for this example, is approximately 0.02% of CSV

format size. This is a great compression value, because no data is a loss, and Parquet files have

a metadata file (language that translates tables into smaller pieces of data) and default schema to

read values in the same format as CSV: as rows and columns.

Performing this compression with a high amount of data, with distinct rows (no equal rows),

the compression rate can be lower (up to 10% of CSV’s size), according to some studies related

to data compression with Parquet and CSV, explained in the past chapter. In this project’s case,

it is inconceivable to perform such tests, because extracted data does not have the same quantity

Page 82: Implementing Hadoop distributed file system (hdfs) Cluster ...

66 Result of Implementation and Tests

as expected to test it at an extreme level. But, it is concluded that using Parquet format files have

better performance and also easily readable/writable as CSV.

A couple of tests was made also using Spark’s History Server, where is possible to check Job’s

performance for each format type. For that, it was created a small script to read each CSV and

Parquet file, to see if there are relevant differences between each format, in terms of processing

time.

The following figure represents Spark’s History Server logs, in its Web interface, where is

possible to compare each format’s performance.

Figure 4.11: Time performance test in Spark: CSV files.

Figure 4.12: Time performance test in Spark: Parquet files.

It is possible to see that Parquet files are read faster than CSV, in about 50% less time. This

value shows one more time that Parquet format presents a better solution than CSV to this project,

where time is an important factor since real-time analysis is a must on every stock market analysis’

architecture. So, with these presented tests, the conclusion is to use Parquet format in every file,

to minimize space and time used, so that more data can be processed in a small time interval.

4.4 Power BI Results

The figures below present the dashboards created on the Power BI tool to visualize all data ex-

tracted. With this, it is possible to create a better overview of all data extracted, and also create

some patterns and calculations on that, to help to understand the evolution of data.

This tool is essential to empower a decision by end-user, creating a good visual about all data

and text extracted, using charts and other types of visualization. Besides that, as stated before, it

is possible to perform calculations on data, to create some basic predictive models on data, which

is a valid advantage in this project, like creating Moving average values, distinct counts, among

others.

Page 83: Implementing Hadoop distributed file system (hdfs) Cluster ...

4.4 Power BI Results 67

Figure 4.13: Power BI Final Dashboard.

Page 84: Implementing Hadoop distributed file system (hdfs) Cluster ...

68 Result of Implementation and Tests

Page 85: Implementing Hadoop distributed file system (hdfs) Cluster ...

Chapter 5

Conclusions and Future Work

5.1 Conclusion

The development of this dissertation enabled me to study a technology that can bring great support

for any enterprise’s business decision. Hadoop is the cheapest way to create a Big Data solution,

and this architecture is compatible with almost every data analysis tool.

In contrast with other similar architecture, like Traditional Warehouse or Cloud-based Clusters,

HDFS proves high-performance results, with small costs. Machines used to create the distributed

system have regular specifications, do not require high-performance hardware. Frameworks used

to support all communication and data management are free to use, and present great online sup-

port, to help to resolve any problem.

The main objective of this work, besides creating a visual decision-making business product,

relies on study different paths to create an HDFS solution, understanding which methods are more

efficient and viable, creating a valid product for the market, and satisfying internal enterprise

interests.

Also, there are made comparisons with different types of output HDFS files, namely Parquet

and CSV, which are the most used ones. In this way, it was possible to understand that, saving

the data in Parquet format, becomes faster and more efficient since the space required for each file

reduces significantly and its processing time is approximately half of the original value. With this,

it is possible to save more information in the system, with the shortest possible reading/writing

time.

The market analysis of the stock exchange, a topic that was analyzed, fits perfectly into the

conception of a Cluster. This will be the ideal architecture to do this, since such a study requires

a large amount of data, to make tables and evolutionary indicators of the stock market. Besides, it

requires a fast and accurate solution, so that its results are conclusive and in real-time, or almost

real since the stock market is quite volatile, that is, its values change with a high frequency.

This is just an example to test the validity of the construction of a Data Cluster, but it is the

most appropriate, as there is an interest in further exploring this subject, in a business environment.

During the realization of the project, the Spark tool was also compared to the original MapReduce,

69

Page 86: Implementing Hadoop distributed file system (hdfs) Cluster ...

70 Conclusions and Future Work

promoted by Hadoop. This comparison was only theoretical, based on studies carried out exter-

nally, wherein in all cases Spark would have better results, since its processing takes place in

memory, instead of on disk. Also, using this tool, it is possible to more easily connect the various

steps, namely the data extraction script with the logic of access and loading them, since the same

programming language is used, and this tool allows good compatibility. On the other hand, using

Spark, it is easier to create and use predictive models, using libraries embedded in the framework,

since in MapReduce jobs this concept is not so easy.

So, the combination of every process made on this work reveals to be the better option to create

a Big Data /Business Intelligence solution, if time, low cost, and commodity are key indicators to

the product. Besides all frameworks and components present high compatibly between them,

their configurations was a long process, with ups and downs, where some middle steps presented

difficulties and problems, compromising a couple of objectives that, if implemented, would create

a better final solution, like for example a better data analysis system. But, due to all barriers

imposed, data training and complex analysis are postponed to future work.

Concluding, the project was implemented successfully, with some gaps regarding the lack of

complex data analysis, but the main idea was proved, that is, the viability to use a low-cost solution

to perform Big Data analysis with good performance. This technology will be useful for business

use, where it will be possible in the future to use this solution to analyze and present various data

that is received, from various sources that work together with the company in question, since it

works with a large amount of information. In the future, if necessary and required, this architecture

could also be mass-produced for potential customers who need a relatively inexpensive and viable

solution to store and process any type of information.

5.2 Future Work

Some points can be highlighted as future work, improving the current solution.

The first point would be to carry out more performance tests, among different distribution

technologies, namely between Hadoop and traditional warehouses. That is, carrying out an im-

plementation, in a separate environment, of a traditional Warehouse, making performance tests

with the same amount of information. Only theoretical case studies were made, based on external

studies.

It would also be interesting to create performance tests between different data processing

frameworks, similar to the one used, Spark, namely between Spark and MapReduce jobs. Again,

this analysis was done only in theory, and it would be interesting to have some tangible results

on this comparison More important than the points mentioned above, the main improvement over

the created implementation would be to implement the concept of Machine Learning on stored

data, using some libraries available in Spark’s framework, like the "Mlib", where the data would

be previously trained and processed, and sent to the Power BI tool, with some additional columns

with more conclusive values about possible market forecasts.

Page 87: Implementing Hadoop distributed file system (hdfs) Cluster ...

5.2 Future Work 71

Due to problems that arose, only a few calculations were made, which are limited, since the

Power BI tool does not have as much power when it comes to processing information according to

regressive/predictive training models. However, given the existing possibilities, the best possible

work was created, building a reliable and efficient architecture for storing, loading, and processing

a large set of information.

Figure 5.1: Mlib library to data training. [14]

Page 88: Implementing Hadoop distributed file system (hdfs) Cluster ...

72 Conclusions and Future Work

Page 89: Implementing Hadoop distributed file system (hdfs) Cluster ...

References

[1] DataFlair. Hadoop architecture in detail – hdfs, yarn mapreduce. Available at: https://data-flair.training/blogs/hadoop-architecture/, (Accessed September2020).

[2] IEEE Yang-Suk Kee Member IEEE Dongchul Park, Member. In-storage computing forhadoop mapreduce framework: Challenges and possibilities. page 4, July 2015.

[3] Shubham Sinha. Hadoop ecosystem: Hadoop tools for crunch-ing big data. Available at: https://dzone.com/articles/hadoop-ecosystem-hadoop-tools-for-crunching-big-da, (AccessedSeptember 2020).

[4] Takashi Kimoto, Morio Yoda Kazuo Asakawa, and Masakazu Takeoka. Stock market pre-diction system with modular neural network. page 2.

[5] Zhihao PENG. Stocks analysis and prediction using big data analytics. Technical report,Department of Computer Science, Dalian Neusoft Institute of Information, Dalian 116626,China.

[6] Ms. Shetty Mamatha Gopal Mrs. Lathika J Shetty. Developing prediction model for stockexchange data set using hadoop map reduce technique. Technical report, International Re-search Journal of Engineering and Technology (IRJET), May 2016.

[7] Amogh P. Kulkarni Mahantesh C. Angadi. Time series data analysis for stock market predic-tion using data mining techniques with r. Technical report, Acharya Institute of Technologyand Sai Vidya Institute of Technology.

[8] Rohan Arora Satish Gopalani. Comparing apache spark and map reduce with performanceanalysis using k-means. Technical report, International Journal of Computer Applications.

[9] Amina Nouicer Abdelkamel Tari Abderrazak Sebaa, Fatima Chikh. Research in big datawarehousing using hadoop. Technical report, LIMED laboratory, Computer Science Depart-ment, University of Bejaia, Bejaia, ALGERIA.

[10] Our World In Data. Internet. Available at: https://ourworldindata.org/internet, (Accessed November 2020).

[11] FlashDBA. The real cost of oracle rac. Available at: https://flashdba.com/2013/09/18/the-real-cost-of-oracle-rac/, (Accessed November 2020).

[12] Prosenjit Chakraborty.

73

Page 90: Implementing Hadoop distributed file system (hdfs) Cluster ...

74 REFERENCES

[13] Apache hive architecture. Available at: hhttps://www.tutorialandexample.com/apache-hive-architecture//, (Accessed January 2021).

[14] Introduction of a big data machine learning tool — sparkml. Avail-able at: https://yurongfan.wordpress.com/2017/01/10/introduction-of-a-big-data-machine-learning-tool-sparkml/, (Ac-cessed January 2021).

[15] Hadoop. Apache hadoop (2020), September 2020. Available at https://hadoop.apache.org/.

[16] HDFS. Apache hadoop hdfs. (2020), 2020. Available at: http://hadoop.apache.org/hdfs, (Accessed September 2020).

[17] Apache Hadoop. Mapreduce tutorial, 2020. Available at: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html, (Accessed September 2020).

[18] Apache Spark. Apache sparkTM is a unified analytics engine for large-scale data process-ing., 2020. Available at: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html, (Accessed September 2020).

[19] Apache Hadoop YARN. Apache hadoop yarn, 2020. Available at: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html,(Accessed September 2020).

[20] Apache Hive. Apache hive. Available at: https://hive.apache.org/, (AccessedSeptember 2020).

[21] Intel IT Server. Apache hive. page 2, March 2013.

[22] Amina Nouice1 Abdelkamel Tari Abderrazak Sebaa, Fatima Chikh. Research in big datawarehousing using hadoop. Journal of Information Systems Engineering Management, pages3–4, March 30 2017.

[23] Apache Parquet. Apache parquet. Available at: https://parquet.apache.org/, (Ac-cessed September 2020).

[24] Databricks. What is parquet. Available at: https://databricks.com/glossary/what-is-parquet, (Accessed September 2020).

[25] Tony Jan Paul D. Yoo, Maria H. Kim. Machine learning techniques and use of event infor-mation for stock market prediction: A survey and evaluation. Technical report, Faculty ofInformation Technology University of Technology, Sydney.

[26] D. E. Rumelhart et al. Parallel distributed processing vol. 1. page 2, 1986.

[27] Ramon Lawrence. Using neural networks to forecast stock market prices. Technical report,Department of Computer Science,University of Manitoba, December 12 1997.

[28] G. Jr. Margavio Yoon, Y. Swales. A comparison of discriminant analysis versus artificialneural networks. Technical report, Journal of the Operational Research Society.

[29] Arkilic. Stock price movement prediction using mahout and pydoop documentation. Tech-nical report, October 2017.

Page 91: Implementing Hadoop distributed file system (hdfs) Cluster ...

REFERENCES 75

[30] M.D. Jaweed and J. Jebathangam. Analysis of stock market by using big data processingenvironment. Technical report.

[31] Data clustering algorithms. Available at: https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm, (AccessedNovember 2020).

[32] HortonWorks. Determine yarn and mapreduce memory configuration settings.URL: https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html [last accessed 2020-12-21].