Big data analysis. Lecture notesklevas.mif.vu.lt/~visk/BigData/.../LectureNotes... · Big data...

Big data analysis. Lecture notes

V. Skorniakov

Contents

1 Introduction 31.1 Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Key features, contents and structure of the course . . . . . . . . . . . . . . . . . . 31.3 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 An overview of concepts, ideas, methods and tools encountered in big dataanalysis 62.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 Types of processing and computing . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 Types of data repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 Parallelization: Traditional and Map–Reduce . . . . . . . . . . . . . . . . . 7

2.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Amazon Web Services (AWS) framework . . . . . . . . . . . . . . . . . . . 9Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Git and GitHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Hadoop and Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11SAS/STAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11IBM SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Some preliminaries regarding software and basic tasks 133.1 Connecting and moving data to the MIF VU cluster* . . . . . . . . . . . . . . . . 13

3.1.1 Connecting and moving via command-line application . . . . . . . . . . . . 133.1.2 Moving data to cluster via third-party software . . . . . . . . . . . . . . . . 14

3.2 Running a program on the MIF VU cluster* . . . . . . . . . . . . . . . . . . . . . . 163.3 Running R on a cluster* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Running Python on a cluster* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.5 Hadoop distributed file system and flow of a map–reduce job in a nutshell . . . . . 223.6 Working with a Hadoop cluster on the MIF VU cluster* . . . . . . . . . . . . . . . 243.7 A very brief introduction to Apache Spark . . . . . . . . . . . . . . . . . . . . . . . 26

3.7.1 Spark ’s components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.7.2 Basic features of Spark ’s working model . . . . . . . . . . . . . . . . . . . . 27

Installing Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Typical working model of Spark application . . . . . . . . . . . . . . . . . . 30Linking Spark with R and Python . . . . . . . . . . . . . . . . . . . . . . . 31

3.8 Virtual machines on the MIF VU Cloud* . . . . . . . . . . . . . . . . . . . . . . . 34Instruction for setup of VM on the MIF VU Cloud . . . . . . . . . . . . . . 34Several remarks regarding your VMs on the MIF VU Cloud . . . . . . . . . 35

3.9 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1

4 Several software options for parallel computations 384.1 Parallelization with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.1 Parallelization with different versions of lapply . . . . . . . . . . . . . . . . 384.1.2 Parallelization with foreach . . . . . . . . . . . . . . . . . . . . . . . . . . 424.1.3 Parallelization with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.1.4 Parallelization with Apache Spark . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Parallelization with Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2.1 Parallelization with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.2 Parallelization with Apache Spark . . . . . . . . . . . . . . . . . . . . . . . 51

RDDs and shared variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A bit on DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Ordinary models 585.1 The list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.2 Spark ’s API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.3 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

A Glossary 59

B Listings 60

2

1 Introduction

1.1 Foreword

Big data analysis generally refers to analysis of large datasets. A natural question arises, HowLarge is Large? It is clear that a rapid evolution of both hardware and software puts the conceptof magnitude of a dataset on a relative scale. That is, a dataset treated as large two years agomay become small pretty soon. Therefore, to make a concept more clear at least in a frame of ourexposition, we will use terms large dataset and analysis of a large dataset whenever analysis of adataset at hand is unavailable on a single computer due to any of the following reasons:

• the dataset is too large to store physically;

• the dataset is too large to process because of lack of other resources such as memory, com-putational speed, time to the deadline of presentation of results, etc.

Up to this point, sharp-eyed reader should have noticed that big data analysis poses a challengeof optimal management of physical and timing resources. What are the rest? First of all, theseare the usual ones encountered by any statistician: data collection, validation, preprocessing,analysis and presentation. Secondly, there is a bunch of new ones rarely, if ever, faced by a typicalstatistician involved into analysis of a small datasets. That is, the ones processed on a singlemachine by making use of one statistical package, and targeted to particular field of applications.The list includes complexity arising from:

• amount of software tools involved;

• variety of fields of applications;

• the set of models including not only those used in a typical statistical applications and basedon a rigorously defined model of population, but also those coming from machine learning,data mining and other similar fields, and having vague, if any, formal background.

That list also includes a rapid evolution of all items mentioned above, and probably something elseskipped by me at a first glance...

All in all, it is clear that big data analytics requires broad qualification, stepping out that ofdatabase manager’s, statistician’s or any other expert’s involved into analysis of data coming froma pretty narrow field. Therefore, it is not surprising that a new term data scientist has appearedto name a professional in the field.

It is not a purpose of these lecture notes to develop a corresponding qualification but merelyto give an introduction and basic skills required to move forward on your own. Consequently, I donot pretend to any completeness at all. However, I have a hope that after completion of the course,you will attain certain understanding of the subject, and introductory foreword given above willsound differently.

V. Skorniakov, 2018, February

1.2 Key features, contents and structure of the course

The broadness of the subject requires specific presentation in order to give at least a preliminaryinsight, and at the same time develop some practical skills. Therefore, the key features of thesuggested course are as follows:

• brevity;

• exposition in favor of ”how-to” style versus ”why-so” style;

• strong emphasis on a minimal set of software tools sufficient to solve a problem under con-sideration;

• examination of relatively small number of selected models encountered in big data analysiswith primary focus on applied part and practical aspects leaving mathematical treatmentkept at a minimum required to understand the model, to apply it practically and interpretthe results;

3

• occasional tasks falling into category ”do-it-yourself”, and designed to get familiar with (oreven implement at a certain degree) some new or undiscussed method.

It is a place to say that an exposition given during the lectures will differ from that given here.Therefore, I do not list the contents of sections. It is quite well reflected by the correspondingtable of contents. Instead, however, I state overall goals of the course. These are as follows:

• to introduce basics of working with cluster;

• to present several options for parallel computations with R and Python1;

• to provide the reader with a short list of software which is quite frequently encountered inbig data analysis nowadays;

• to introduce two popular and powerful tools for analysis of big datasets,— namely, Hadoopand Apache Spark ;

• to give examples of several common applications met in big data analysis;

• to address additional Big Data related themes chosen from the list of known to me, andselected to be of benefit for the auditory in process.

Finally, the structure of the course is based on the following frame:

• minimal set of necessary concepts;

• overview of software;

• software preliminaries for basic operations;

• elements of parallelization;

• building of common models on large datasets;

• examples of applications;

• something additional adjusted to the wishes and background of the auditory as well as mypossibilities.

Note that all the said above applies to the whole course. Talking about these lecture notes, oneshould keep in mind the following.

• The notes alone are insufficient to gain a real value from the course. Though you will findsome useful things here, to really benefit from the course, you will have to read and practicea lot on your own.

• I have placed tasks at the end of each section. Considering practical proficiency, most ofthese are not essential. By placing these tasks, I wanted to pay the reader’s attention to thethings which will not attain sufficient highlight, if at all, during the lectures. Nonetheless,the things touched may be interesting and valuable under certain circumstances. Anyway,I recommend at least to have a look and convince yourself that you are able to accomplishthem. Practically essential tasks are given during classroom sessions. Omitting them wouldbe harmful.

• Overall, the notes represent a very brief overview of certain things one ought to know. It isup to you to select what matters.

During my work on the lecture notes, I have consulted several dictionaries similar to [AB16], [Chr].Since the terms are sometimes defined differently, I have decided to include a short glossary atthe end of the document. It is intended to provide the reader with the meanings attached by meto particular terms in the document. At first instance, such terms are linked to the appendix A,containing the afore mentioned glossary.

1here the list is far from exhaustive one; however, I believe that it is sufficient for reflection of main workingmodels, encountered in a huge amount of software targeting the task

4

1.3 Prerequisites

I do not assume that the reader of the forthcoming text is educated in a rigorous mathematicalfashion. Despite that, he (or she) is expected to match the following prerequisites:

• introductory statistical course including descriptive statistics, simplest models of parametrictesting and contingency tables, correlation coefficients and simple linear regression;

• ability to analyze data within a frame of statistical course mentioned above, and by makinguse of R and/or Python;

• ability to work with Linux terminal on the basic level.

If this does not apply, please refer to introductory courses. Below is a list of exemplary ones.It is given in order to check whether you match and to fill the gaps, provide you feel that there aresuch.

• [Ros17] — for statistics;

• https://cran.r-project.org/doc/manuals/r-release/R-intro.html — for basic usageof R;

• https://www.statmethods.net/stats/index.html — for doing statistical analysis with R;

• https://docs.python.org/3/tutorial/ — for basic usage of Python;

• http://www.scipy-lectures.org/packages/statistics/index.html — for doing statis-tical analysis with Python;

• http://linuxcommand.org/index.php — for working with LINUX terminal.

Remark 1.1. Note that both R and Python provide their default consoles. However, these are notvery convenient to work with. Therefore, I do recommend using more enhanced editors. In case ofR, my own preference falls on RStudio, whereas in case of Python, I do prefer Jupyter Notebook,which is also suitable for R as well as any any other language. Spyder

5

https://cran.r-project.org/doc/manuals/r-release/R-intro.html

https://www.statmethods.net/stats/index.html

https://docs.python.org/3/tutorial/

http://www.scipy-lectures.org/packages/statistics/index.html

http://linuxcommand.org/index.php

https://www.rstudio.com/

http://jupyter.org/

https://www.spyder-ide.org/

2 An overview of concepts, ideas, methods and tools en-countered in big data analysis

As mentioned in Subsection 1.2, the contents of this section is dedicated to conceptual overview ofvery different sorts of things met in big data analysis, and has a very wide range. Because of thisreason, it was hard to structure the section properly, and one may classify it as a chaotic mixture.Nonetheless, it is my hope that after gaining some insight into the subject, and especially afterdoing some practical tasks, you will turn back and find out that these seemingly loosely relatedpieces of text fit well under umbrella of big data analysis, and there still remains a lot to add.

2.1 Concepts

2.1.1 Big data

I have already mentioned that the concept of big data is context dependent. However, almost allintroductory texts (for example, [Mar16], [GH15], [Ell16]) attach the following list of attributes tothe concept of big data.

Volume. An attribute refers to vast amount of data. Though magnitude is relative, it is usuallynot difficult to distinguish between big and ordinary data on this scale. Processing of bigdata requires specially designed tools.

Velocity refers to the speed of generation of new data. Data usually moves to the storage systemfrom different sources. It moves so quickly that quite often it is necessary to process thestream in a real time since only this allows for meaningful utilization of relevant informationcoming with that data.

Variety is an attribute for pointing out different types of data met in big data analysis. Differencesstem not only from different input types (numeric data, text data, image data,...) but alsofrom data generating sources (structured databases’ data, Internet data, sensor data, voicedata, social networks data,...).

Veracity of big data means its messiness or trustworthiness, which is frequently present whenone deals with unstructured data. For example, consider conversations or images from socialnetworks.

Value of big data generally means that due to complexity of data at hand, extracting valuableinformation may be very difficult and, hence, too costly task.

The subset of the first three (correspondingly, first four, or all five) attributes is usually termed as3 Vs (correspondingly 4 Vs, or 5 Vs).

2.1.2 Types of processing and computing

When talking about big data, one usually distinguishes between two types of processing.

Batch processing is a model for processing of very large and relatively time non-sensitive datasets.Here one starts a process of computation, and after some relatively large amount of time haspassed, results are obtained.

Stream processing differs from the batch processing because of requirement to produce realtime computations. Here, instead of being collected into the batch for later processing as awhole, the individual items of data are processed immediately after arrival into the system.

There are two commonly used terms to distinguish between types of computing.

Cluster computing refers to the pooling of resources of multiple machines for accomplishmentof target computing task. Computer clusters are typically composed of individually powerfulcomputers, connected to a network and managed by special software. The latter coordinateswork of individual computers.

In memory computing refers to computational model in which all working datasets are man-aged within collective cluster’s memory without writing intermediate calculations to the disk.

6

Figure 1: Data flow in case of traditional parallelism.

2.1.3 Types of data repositories

There are many ways to classify distributed data storage systems by making use of features ofsoftware employed (see, e.g., reference [AT15]). When it comes to classification of data repositories,one frequently encounters two terms described below.

Data lake refers to a huge data repository being in a relatively raw state. That is, the data ishighly unstructured and, as a rule, frequently changing.

Data warehouse also refers to data repository. In contrast to the data lake, it contains datathat has been cleaned, integrated with other sources, and is, in general, well-structured.

2.1.4 Parallelization: Traditional and Map–Reduce

Parallelization refers to ”doing something in parallel”. In our case, we talk about parallel compu-tations, done within a certain time interval by one or several computers, in order to accomplishthe computational task faster. Parallelization naturally occurs when the following constraints aresatisfied:

• data at hand is big;

• the data can be divided into independent pieces in the sense that processing of each piecedoes not rely on results obtained from processing of the rest.

In such case the conceptual solution is straightforward:

• divide the data into afore mentioned pieces;

• process each piece on a single abstract computational node (it may be a single core of acentral processing unit (further on CPU), computational thread etc.);

• combine results.

Both approaches named in the title of the current subsection apply the same conceptual scheme.To get the clean–cut between traditional parallelism and Map–Reduce, one has to understand anarrangement of data in either case. Talking about traditional one, the flow of the parallel job ispresented in figure2 1.

Here the parallel computations are done as follows.

2taken from [Loc17]

7

Figure 2: data flow in case of Map–Reduce parallelism.

1. The master process, numbered 0, reads the input data from a storage (e.g., a file server) andsends parts of the data to the rest workers of the cluster.

2. Each worker processes its piece.

3. Results are combined and shared between workers (here occurs some movement of data), and(if needed) the next iteration starts.

In case of map–reduce, the flow of parallel job is presented in figure 2, and has the followingstructure.

1. The data is divided into pieces before the start of computation in such a way that correspond-ing pieces of data exist on the corresponding processing nodes (in fact, the data is storedpermanently in such form).

2. Each worker processes its piece.

3. Results are combined and shared between workers and (if needed) the next iteration starts.

One can see that the difference is only in the first step: in traditional approach, the data is readsequentially at the start of calculation and then divided; in map–reduce approach, the division isdone before the start of calculation, and some movement of data may occur only within the com-putational phase; here, it is usually a movement of aggregated and substantially smaller amount ofdata as compared to the initial one. It appears that, in case of data–intensive calculations, readingof data is the bottleneck of traditional parallelism which may last longer than the computationalphase.

Concisely, the difference may be summarized as follows. In traditional approach, the datais brought to the computational resources; in map–reduce approach, the direction is opposite(the computational resources are placed close to the data). Currently, the most popular actualimplementations of this approach are Apache Hadoop software framework and Apache Spark (seeSubsection 2.2). The basics of these implementations are described in Subsections 3.5 and 3.7.

2.2 Software

Below I give a brief description of software which is currently in a frequent use when it comes to bigdata. Not all of the mentioned software will be touched in this textbook. Nonetheless, mentioning

8

it still has some meaning, since it supplies the reader with some guidance, which may be useful inhis future explorations. My intent is to present free tools since only these are used in the sequel.However, I also mention several commercial ones. To make distinction between free, semi–free,and commercial tools, I highlight each paragraph in different colors. Green is used for completelyfree software, blue is used for semi–free whereas red is left for commercial one.

Amazon Web Services (AWS) framework AWS is a subsidiary of Amazon.com. It providesa very wide range of on-demand cloud computing services to any type of customers, ranging fromindividuals to governments. Though most of services are on paid subscription basis, there is afree-tier option available for 12 months, starting from the date of subscription. It is a good placeto start since subscribers get at their disposal a full-fledged virtual cluster of computers. Thecluster is available all the time. A web browser connected to the Internet is the only thing oneneeds. AWS’s virtual cluster of computers shares most features of a real physical cluster. One canvary number of CPUs, choose an operating system, manage networking, etc. One can also deployhis AWS systems to provide internet-based services for his customers’ benefit.

Currently AWS offers over 90 services targeting different tasks (computing, storage, networking,etc.). In the list of the most popular tools, one finds Amazon Elastic Compute Cloud (EC2) andAmazon Simple Storage Service (S3).

EC2 allows users to rent virtual computers (virtual machines) on which they run their appli-cations. It provides a web service through which the user can boot an Amazon Machine Image(AMI) and configure a virtual machine, termed an ”instance” and containing any desired software.The management of server-instances is very flexible. One can launch or terminate required numberof instances and pay by the second only for active ones. These opportunities gave rise to the nameElastic compute cloud.

Amazon S3 is a web service for data storage and retrieval through web interfaces. It uses objectstorage model. The latter enables to store and retrieve any amount of data from anywhere – websites, mobile applications, devices, etc. It offers high reliability and scalability. Therefore, it isused by many market leaders. As stated on web page of S3 (see https://aws.amazon.com/s3/),”S3 provides the most powerful object storage platform available in the cloud today.”

Python Python is an open source programming language. It has a standard library and a verylarge and always growing set of the community-contributed modules. These can be downloadedand installed for free on any machine having Python. As reported on the introductory page ofPython (see https://www.python.org/about/), some of fields encompassed by the mentioned setof modules are the following.

• Web and Internet Development.

• Database Access.

• Desktop GUIs.

• Scientific & Numeric.

• Education.

• Network Programming.

• Software & Game Development.

Under the hood of Scientific & Numeric, one finds a lot of modules for carrying out statisticalanalysis, parallel computations, and solving other data specific tasks encountered in big dataanalysis. Database Access speaks for itself.

Moreover, due to its growing popularity Python is included or serves as a base for many (big)data analysis related projects. In our course, we will more or less deal with four of them. Arrangedby importance for big data analysis, these are Jupyter, IPython, Apache Hadoop and Apache Spark(for the latter two, see description below).

Jupyter is an open-source project born out of the IPython Project. Its main function is tosupport interactive data science and scientific computing across all programming languages. Itoffers web based GUI termed Jupyter Notebook. This web application allows creation and sharing

9

of documents on the web that contain live code, equations typeset in LATEX, visualizations andnarrative text. It is tightly integrated with Python, and also allows integration of big data projects.

IPython stands for Interactive Python. It is, therefore, not surprising that first of all IPythonprovides a powerful interactive shell for working with Python. It also provides tools for interactivedata visualization as well as a kernel for work with the above mentioned Jupyter Notebook3.Consequently, a typical session runs within Jupyter Notebook’s environment. Finally, IPython’sdevelopers maintain ipyparallel package designed for parallel computations. Though there area lot of Python packages targeting this task, the latter offers an interesting and flexible model forwork with anyhow distributed computing cluster.

R As stated on the homepage of R, ”R is a free software environment for statistical computingand graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.”Conceptually, its working model is very similar to that of Python (see previous paragraph). Thatis, there is a standard R library, offering basic functionality, and there are thousands4 of packagescovering very broad range of applications. The major difference from Python is that initially Rwas oriented towards statistical applications (as it is seen from the above introductory statement),whereas Python was designed for a broader set of problems. Despite that, currently one can finda lot of R packages suited for solving tasks very loosely, if at all, related to data analysis (e.g.,sending an automated email), and making R suitable for integration with multipurpose systems.Consequently, much like Python, R is also incorporated into different projects. Usually, one canrun R code from other environments designed for data analysis. To mention a few examples,such commercial software systems as SAS, IBM SPSS (see description below), and also differentfree software projects (including Apache Hadoop and Apache Spark, described below) provide thisopportunity.

Hence, when talking about big data, it is strongly recommended to get familiar with both Rand Python because its very likely5 that you will face one or both of them on your way. Also notethat at least one of these languages is in the list of prerequisites for the course.

Git and GitHub Git is a version control system developed to track changes in computer files.It is very useful when it comes to coordination of work on those files among multiple users. Atypical use of Git is that for source code management in a process of software development, sinceit is a frequent case that a developer wishes to rollback to previous version of code. As any versioncontrol system, it allows to revert back to a previous state of a project’s code with a minimal effort,and do many other operations as well (e.g., to see who last modified something, compare changesover time, etc.; for a detailed account see [CS14]). Though code control is a frequent reason fordeployment of Git, it may be used for any type of files. Git is a distributed version control system.Basically, this means that it functions as follows.

• Controlled files (say, those belonging to a particular project) reside on a dedicated server’s(there may be more than one server for that purpose) directory termed repository. Repositoryincludes not only all data, but also all metadata.

• Each client (e.g., a developer) not only checks for latest snapshots, but also fully mirrors thewhole repository with its history, i.e. metadata.

Hence, if there is a crash on the server (or servers), full recovery of it is possible due to the clonesresiding on the clients.

Initially, Git was developed for Linux users. Currently, it supports several other OSs too. Gitis free, and there are many ways to use it. These include original command line tools as well asdifferent wrappers. GitHub is one of them. It is a very popular web host for Git repositories. Tostart using it, one only needs to visit https://github.com/, create an account and login. After that,it is possible to use it directly via web interface, or to download a desktop application.

Github is free for public and open source projects, and it is paid whenever one wishes to haveprivate repositories.

3there are, of course, many other editors and GUI’s designed to work with Python; however, Notebook is in thelist of the most convenient ones

4more than 10000 at the beginning of 20175in my opinion, even inevitable

10

Hadoop and Spark Apache Hadoop (further on Hadoop) is an open–source software frameworkdesigned for processing of large data sets over the the clusters of computers. Though it can beinstalled on a cluster consisting of a single computer, it is usually employed on the clusters havingconsiderably bigger amount of machines. The framework consist of four modules.

Hadoop Common includes utilities that support the rest Hadoop modules.

Hadoop Distributed File System (HDFS) is a distributed file system that provides reliable andfast operating with large data sets distributed over the cluster.

Hadoop YARN implements job (the concept is explained in the sequel) scheduling and clusterresource management.

Hadoop MapReduce is YARN-based system for parallel computing over large data sets.

There are other Hadoop–related software projects, designed to solve specific tasks. I shall mentionthree (the most important, in my opinion) of them6.

HBase is a scalable, distributed database designed for structured storage of large tables.

Mahout is a scalable machine learning and data mining library.

Spark is a general compute engine. It suggests integration with R and Python as well as a rich vari-ety of libraries including ETL, machine learning, stream processing, and graph computation.It can run in standalone mode on the cluster, on EC2, or on Hadoop YARN. Moreover, it canaccess diverse sources of data. The list includes data residing on HDFS, Cassandra, HBase,and S3. Up to date, it seems to be one of the most promising tools in big data analysis, sinceit greatly outperforms Hadoop and offers very flexible implementation of Map–Reduce.

SAS/STAT SAS/STAT is a system designed for statistical analysis. In my opinion, up to dateit is the most advanced commercial system to serve this purpose. It offers a very wide range oftools. The broadness of tools is reflected both in a set of statistical models available as well as in aset of environments designed to meet specialized needs. Among the rest, SAS/STAT offers high-performance modeling tools for massive data, and integration with open source systems. Finally,it has very well written and freely available http://support.sas.com/documentation/, covering notonly the use of software, but also theoretical background behind.

IBM SPSS IBM SPSS is another very popular commercial platform for advanced statisticalanalysis. It is less powerful than SAS. However, it also covers quite broad range of statisticalmodels, offers open source extensibility and integration with big data. Its main advantage againstother commercial and open source systems is an ease-to-use. The latter feature makes IBM SPSSaccessible to users with all skill levels, and, therefore, maintains its popularity among less skillfulusers, working in such areas as psychology, medicine, ecology, etc.

2.3 Methods

When dealing with big data, one ought to have heard something about the methods listed below.

• Methods of statistical data analysis.

• Methods of communication via network.

• Methods of working with database management systems.

• Methods of building distributed applications.

The set of methods contained in each subset, listed above, is very large, and one, of course, shouldnot be an expert in the corresponding field. It is even very likely that you will need only a verylimited amount of knowledge in your routine work. On the other hand, it is very unlikely that,dealing with big data, you will completely avoid any of the fields (i.e. statistics, networking, etc.)having intersections with the list above. Summing up, the above is just for guidance what youmay expect on your way.

6the whole list may be found at http://hadoop.apache.org/

11

http://hadoop.apache.org/

2.4 Tasks

2.1. Read some introductory text on big data similar to references [Ell16] and [GH15]. Thinkabout key features of the topic you would extract when asked to give an introductory lecture.

2.2. Refer to [PD15], Chapter 4, for introduction to different types of big data databases. Readit. Think about key features of the topic you would extract when asked to give an introductorylecture.

2.3. Refer to [PD15], Chapter 8, for introduction to different types of geographical informationalsystems for big data. Read it. Think about key features of the topic you would extract when askedto give an introductory lecture.

2.4. Choose some specific topic (e.g., cloud computing) and think about presentation. It shouldinclude the following:

• basic introduction into the topic;

• overview of software tools;

• recommendations for selection of tools;

• deeper presentation of some selected tool including basic examples with code.

2.5. Get familiar with different types cloud computing (hybrid, private, public). Find out moreabout different providers on the market.

12

3 Some preliminaries regarding software and basic tasks

In this section, I have tried to gather basics regarding software tools one ought to know beforemoving on to more elaborated examples of ”real world” applications. I have assumed that thereader knows almost nothing as do I. Therefore, my explanations may seem too detailed. Hence,if one feels confident, he is encouraged to skip this section7 and save some time.

I have also assumed that the reader will make use of facilities provided by the Faculty ofMathematics and Informatics of Vilnius University (MIF VU). The assumption implies specificfeatures of explanations given in what follows. Again, to facilitate selection of material worthwhileto read, I have tried to name subsections in an informative way. Subsections having many specificsdevoted to the MIF VU cluster community are marked by an asterisk *. Within these subsections,relative paths and other similar information is primarily devoted to the MIF VU community. Thelatter fact is not emphasized elsewhere but here.

3.1 Connecting and moving data to the MIF VU cluster*

Moving data to cluster from your local or another remote machine is the basic task. Below Idescribe several possible ways to accomplish it. Once finished, you will be able to access a folderresiding on the cluster and named /scratch/lustre/home/your username, where your username isthe user name used by you within the net of the faculty. Any data used by your programmesshould be placed here, or in sub-folder created by you. Please, keep this folder as clean as possible;after completing the task, delete the data as well as all output in order to free the resources forothers.

3.1.1 Connecting and moving via command-line application

In order to start communication with the cluster via command line, you need a terminal emulator.For Windows users, I do recommend installation of PuTTY or similar SSH client program; Linuxusers may make use of any terminal emulator available at hand. The first step is to connect touosis.mif.vu.lt. If you are sitting in front of computer which runs Linux and is connected to thenet of the faculty, then all you have to do is just to log in. Otherwise, you may need to supply acommand

ssh [email protected]

to your terminal with a password on prompt, or make use of some GUI. Anyway, once console ofyour terminal is launched and you are connected to uosis.mif.vu.lt, independently of your OS, typein its window

kinit

ssh cluster

The text similar to that of figure 3 will appear in case of success. Now you can start executionof your programmes as well as exchange of data between the cluster and other remote machines.The most plain way is to make use of scp command line utility:

scp your_username_on_remote_host@remotehost:full_file_name /scratch

/

lustre/home/your_MIF_username/subdir

E.g.,

scp [email protected]:/ users3/visk/TMP/MPI_R /scratch/lustre/

home/

visk/BDA

copies file MPI R, which resides on uosis.mif.vu.lt, to my sub-folder named BDA and residing onthe cluster onto which I have logged before submitting the command. To move the whole directory,use option -r after scp. Below is an example which moves the whole directory named TMP andresiding on uosis.mif.vu.lt to my default destination folder (i.e., /scratch/lustre/home/visk ; notethat, instead of typing the whole path, I use single dot) on the cluster.

scp -r [email protected]:/ users3/visk/TMP .

7and then, probably, all the rest

13

Figure 3: view of a terminal after successful logging on to the cluster.

3.1.2 Moving data to cluster via third-party software

Option for Windows users

1. Install WinScp.

2. Launch; Hit Tools Button; Run PuTTYgen; Hit Generate button and wait until key isgenerated.

3. Once finished, copy text from dialog Public key for pasting into OpenSSH authorized keysfile, and paste it into blank text file. Save this file on your local machine under the nameauthorized keys (no extension is required, the text should be in one line).

4. Hit button Save private key, and save the file (name is arbitrary; we will use private key.ppkfor definiteness) on your local machine.

5. Log onto cluster.mif.vu.lt as described in the previous subsection, and place your autho-rized keys file into directory /scratch/lustre/home/your username/.ssh. By default, there isno such directory, and you should create it first.

6. Launch WinScp. In the Login window:

a) type cluster.mif.vu.lt for Host name;

b) type your user name for User name;

c) hit Advanced button and choose Advanced ;

d) navigate to SSH → Authentication; specify path to your private key.ppk file;

e) go back, login, and start exchange of data by making use of drag and drop interface.

Remarks.

• If security of your machine is not a big concern, it is useful to save settings of connection inorder to avoid long authentication process each time you log onto the cluster.

• Exchange between cluster and your local machine via WinScp is very handy. Moreover, youcan execute commands on the cluster with a help of WinScp built in terminal. However, forthis task, I do recommend PuTTy client. It handles the task of execution in a more flexibleway.

14

https://winscp.net/eng/index.php

• Check the link. There you will find the Lithuanian counterpart of the above. Moreover, thereare explicated relative paths used within Windows environment.

Option for Linux users

1. Install FileZilla client.

2. Install puttygen or PuTTY with all utilities8.

3. Launch your terminal and run

ssh -keygen

command. Since we provide no arguments, it should ask you for a file name and passphrase.The simplest choice is to hit enter both times. In such a case, after the finish of generationprocess, you will get a key file named id rsa, residing in your ∼/.ssh sub folder (in other case,modify the instructions below accordingly). Note that the latter folder is hidden by default.

4. Run

puttygen id_rsa -o private_key.ppk

in your terminal window. This will convert id rsa to ppk file named private key.ppk. Thename is optional, and you can replace it by another one.

5. Run

puttygen -L private_key.ppk

in your terminal window. This will extract contents of your key file to the terminal window.

6. Copy the text and paste it into a blank text file. Save this file on your local machine underthe name authorized keys (no extension is required, the text should be in one line).

7. Log onto cluster.mif.vu.lt, as described in the previous subsection, and place your autho-rized keys file into directory /scratch/lustre/home/your username/.ssh. By default, there isno such directory, and you should create it first.

8. Take your private key.ppk residing in ∼/.ssh sub folder, and place it to another local folderof your machine which is not hidden by default.

9. Launch FileZilla.

a) choose Edit menu item and navigate to Settings→Connection→SFTP ;

b) hit Add key file... button and locate your private key.ppk, previously taken from ∼/.sshand put to unhidden folder; close Settings window;

c) in the fields Host:, Username:, located just below the tool bar of the main menu, typesftp://your user [email protected] for Host:, and your username for Username:;hit enter;

d) start exchange of data by making use of drag and drop interface.

Hint. When you start FileZilla afresh next time, there is no need for filling in Host: and Username:again. Just make use of the drop down arrow at the right of the button Quickconnect.

8though, for our particular task, we will need only puttygen, you may find other components useful in your work

15

https://mif.vu.lt/itwiki/windows

https://filezilla-project.org/

https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html

Figure 4: SLURM architecture and functioning.

3.2 Running a program on the MIF VU cluster*

Before running your first program, it is very instructive to get understanding of the workingenvironment of the MIF VU cluster. The cluster is managed by an open source system namedSLURM (the whole documentation is available at https://slurm.schedmd.com/; the text belowresembles a quick start guide). The latter has the following key functions:

• allocation of access to computing nodes in time, and arbitration of contention of allocatedresources;

• start, execution and monitoring of a job on allocated nodes.

The architecture and functioning of SLURM is well reflected by figures9 4–5, from which it is seenthat:

• SLURM consists of a central daemon slurmctld, running on a management node (with op-tional duplicate twin for the case of failure), and a set of slurmd daemons, running on eachof the computing nodes;

• the daemons manage computing nodes, partitions (logical sets of computing nodes whichmay include intersecting sets of physical nodes), jobs and job steps;

• the user interacts with a cluster through the set of case sensitive SLURM commands.

An entity named job includes assigned resources: a set of nodes and amount of time. Within theallocated job, the user is able to run a parallel task in the form of job steps in any configuration. Forexample, several job steps may independently use some portions of the allocated nodes accessingthem at the same or different time intervals. As it is seen from above, to be able to run programmes,

9taken from https://slurm.schedmd.com/quickstart.html

16

https://slurm.schedmd.com/

https://slurm.schedmd.com/quickstart.html

Figure 5: SLURM architecture and functioning.

you have to be familiar with SLURM commands. Fortunately, it is enough to know a few. Onthe MIF VU cluster (see http://mif.vu.lt/cluster/#slurm_batch and http://mif.vu.lt/

cluster/#slurm_interactive), an execution of a typical program is carried out as follows:

• the user wraps his program into a bash script *.sh (a file may be created with any text editorand then given an extension .sh; take care of line breaks; these should be suitable for Linux );

• the script is invoked by a command10

sbatch script.sh

with possible options.

Inside the bash script, one usually makes use of srun or mpirun commands11, which actually startan execution, since sbatch only submits a job script for an execution, and also allocates resourcesthrough its options. To get a better understanding of the whole flow, let us work through anexample of a common scenario, and, at the same time, shed light on some other important aspectsnot discussed by now. Suppose that we want to execute a script named sample script.sh. First ofall, we need to connect to the cluster (see subsection 3.1). Then just type

sbatch sample_script.sh

The above says nothing about the mentioned resource allocation. These are usually allocatedwithin the body of the script by making use of sbatch options12. Suppose that we want to gainaccess to the partition named large and need at least four CPUs for our job. Then our script.shmay have the following code:

1 #!/bin/sh

2 #SBATCH -p large

10other ways are also possible; however, these will not be discussed11the Open MPI is an open source Message Passing Interface implementation (see https://www.open-mpi.org/);

in Debian distribution of Linux, it supports an execution of programmes through SLURM ; the discussion of benefitsof using this way falls out from the scope of our course

12for short, one may specify the options within one line; to make our first ”execution” more clear, we break thecode into separate lines

17

http://mif.vu.lt/cluster/#slurm_batch

http://mif.vu.lt/cluster/#slurm_interactive

http://mif.vu.lt/cluster/#slurm_interactive

https://www.open-mpi.org/

3 #SBATCH -n4

4 mpirun scriptWithCode

Note that line 1, even though it starts with a comment symbol ], is not a comment, since combi-nation ]! is reserved for a so called shebang line and points to an interpreter which will execute thescript.sh. In case (as well as in a typical ”real life” case) of our example this is a shell under thepath /bin/sh. Lines 2–4 also are not comments. These are sbatch options which can be specifiedat the beginning of script (before any executed lines) in a form

#SBATCH -option_keyword option_value

Option given at line 2 informs SLURM that we would like to allocate nodes in partition large.Omitting it will result in SLURM’s selection of the default partition designated by the systemadministrator. Line 3 informs SLURM that a maximum number of tasks launched by our jobcould be equal to 4. This helps it to allocate sufficient number of nodes and CPUs. Both, however,can be specified by other options. Finally, line 4 starts execution of our code wrapped into scriptscriptWithCode. Strange as that might seem, but start of an execution declared in the line number4 still does not mean factual start. It means that, from this point on, SLURM takes the followingsteps:

• the job submitted by us is assigned jobid, which is printed to the terminal; this jobid isimportant because of relationship to output explained below;

• depending on the other jobs already running or submitted to the cluster as well as on theother reasons13, our job may start immediately or be put into the queue;

• once all resources needed by the job are free, SLURM starts an actual execution; when it isfinished, the whole output together with error messages may be found in file named slurm-jobid.out and residing in the working folder (by default, /scratch/lustre/home/your user-name); this is a default name, and it can be changed by making use of sbatch options -o

and -e, followed by the full-path names of files devoted for output and errors respectively.

Several useful notes regarding this final stage:

• if one has forgotten his jobid, it is possible to print current queue of jobs by SLURM commandsqueue; it, however, does not provide information on finished jobs;

• squeue -j jobid prints current state of the job having jobid, given by jobid after option-j;

• you can cancel your job by typing scancel jobid.

This basic example should suffice to clarify the whole process. At this point, note that (allocationof resources by making use of) the whole set of SLURM commands, the whole list of which isavailable at https://slurm.schedmd.com/, can handle very complex scenarios of parallelization.

Since specification of requested resources is possible only in case of being informed aboutavailable ones, it remains to provide instructions on retrieval of information regarding availableresources14. To serve this purpose, there are two SLURM commands:

• sinfo prints information about SLURM nodes and partitions;

• scontrol is used for viewing and modification of SLURM configuration and state.

Figure 6 shows output of sinfo supplied to the terminal after logging onto the cluster. One cansee that there is available one partition named short and having 73 nodes in total. All these nodeshave a resource of time limited to 2 hours, and are split into groups by state. Based on thiscommand,

scontrol show node lxibm121

13e.g., it may happen so that some of the required nodes are turned off and it takes time for them to start;consequently, delay occurs

14information applies to any cluster managed by SLURM

18


Figure 6: output of sinfo.

Figure 7: output of scontrol.

produces output shown in figure 7. Information provided in figure 6 reflects factual resources. Tobe more precise, currently15 each typical user of the MIF VU community has an access to partitionnamed short with time limit equal to 2 hours and 73 node having 1200 cores. In case of higherdemands, special access is needed. It may be provided by the Center of digital investigations andcomputations of the faculty (http://kedras.mif.vu.lt/itc/) after submission of an applicationform.

At the end of this subsection, it is worthwhile one more time to mention that the whole set ofSLURM commands, suggesting very high flexibility is available at https://slurm.schedmd.com/,and may be employed if there is a need.

Finally, there is one caution for the MIF VU users. Before running your program in batchmode, it is highly recommend to test it in interactive mode. For this, it suffices to log onto thecluster and type

srun --pty --$SHELL

The command will allocate the node for you and start an interactive session. On the MIF VUcluster, it is prohibited to execute computations on the log on node; therefore, the step is necessary.After taking it, you can start interactive R or Python sessions and test your program.

An advice stems from personal experience showing that program may run well when startedinteractively (which means that you type all commands of resource allocation and program ex-ecution directly to the terminal instead of wrapping into *.sh script), and it can cause troubleswhen launched via the bash script. This is due to MPI configuration set up during the process ofdeployment of it on the cluster.

15the date of compilation of this document is July 20, 2018; detailed account on resources is listed at http:

//kedras.mif.vu.lt/cluster/

19

http://kedras.mif.vu.lt/itc/


http://kedras.mif.vu.lt/cluster/

http://kedras.mif.vu.lt/cluster/

3.3 Running R on a cluster*

There are several ways to run an R program on a cluster16 (see, e.g., [Bal17]). I shall describe twoof them.

If one needs an interactive session, then it suffices to type R, and such session will start. Tofinish it, type q() (see figure 8). However, when it comes to big data, it is much more likely thatbatch execution will be a favorable one. For this, one may use command

R CMD BATCH script.R

This command prints nothing to the terminal, since the whole output is written to the file namedscript.R.Rout. Of course, one can wrap both methods of execution into a bash script havingresource allocation specifications at the beginning (see previous subsection). In such case, it maybe necessary to start R with options suppressing interaction. Some of these options are given intable 1. The whole set may be printed by starting R with option --help, i.e., by typing intoterminal

R --help

Option Description--save Do save workspace at the end of the session--no-save Don’t save the workspace--no-environ Don’t read the site and user environment files--no-site-file Don’t read the site-wide Rprofile--no-init-file Don’t read the user R profile--vanilla Combine all previous starting from --no-save

--no-readline Don’t use readline for command-line editing-q, --quiet Don’t print startup message--silent Same as --quiet

--slave Make R run as quietly as possible--interactive Force an interactive session--verbose Print more information about progress--args Skip the rest of the command line-f FILE, --file=FILE Take input from ’FILE’-e EXPR Execute R expression code (e.g. ’print(”hello”)’ ) and exit

Table 1: R startup options.

The following is an example. Suppose we have an R script named Hello.R and having one row

print("Hello world!")

Suppose it is wrapped into bash script bashHello.sh listed in listing 1. Then

sbatch bashHello.sh

will produce slurm-jobid.out (here jobid denotes an id of the job assigned by SLURM ) with thefollowing content

> print(’Hello world!’)

[1] "Hello world!"

>

Listing 1: contents of bashHello.sh

#!/bin/sh

#SBATCH -p short

#SBATCH --nodes=1

#SBATCH --constraint="ups"

R --silent --file=Hello.R

16it is needless to say that it should be installed on it

20

Figure 8: interactive R session.

What we gain in addition to simple invocation of interactive R session is a certainty that ourscript was executed on the node satisfying constraint ups, specified by us at the beginning of thebash file.

3.4 Running Python on a cluster*

Running Python is almost the same as running R. To start an interactive session, type python.To finish it, type exit(). To run a script, type

python your_script.py

This command has additional options (for short, we omit them) which can be enlisted by typing

python --help

into the terminal window. Finally, you can wrap your Python script into a bash file in exactly thesame way as described above for the case of R. For a simple test, do the following:

• modify the above Hello.R by changing its extension into .py;

• modify bashHello.sh by changing its last line into

python Hello.py

• run bashHello.sh by typing

sbatch bashHello.sh

into the terminal window;

• check an output written to the file slurm-jobid.out.

21

Figure 9: file in HDFS.

3.5 Hadoop distributed file system and flow of a map–reduce job in anutshell

Hadoop project implements map–reduce via its module17 named Hadoop Distributed File System(HDFS). This module realizes division and distribution of data over the computational nodesof the cluster it operates on. HDFS is a specially designed file system. Most of file operatingcommands (e.g. ls, mkdir, rm, mv, cp, cat, tail, chmod, etc.) externally behave in a usualway, common to Linux type systems. That is, the user does not feel any differences. However,internally the things go on differently. When one puts a file into HDFS, the following happens:

• the file is sliced into 128 MB chunks18 (it is the default size of chunk which may be changed);

• each chunk is replicated three times (number of replications also may be changed) in orderto maintain stable and reliable functionality;

• chunks are distributed across the nodes of the cluster.

In such a way HDFS handles each file, though it appears to the user as indivisible entity (seefigure19 9).

Ability of effective slicing, distribution and recombination of parts of a large file residing on theHDFS makes this system perfect for map–reduce jobs (see the concept of map–reduce parallelismdescribed in Subsection 2.1.4).

In order to benefit from subsequent exposition, one has to gain clear understanding how themap–reduce job is constructed and executed within the Hadoop framework. In total, there arethree phases or steps: mapping, shuffle and reducing. The user has two take care of the first andthe last, since shuffle is completely handled by internal processes of Hadoop. The conceptual flowis depicted in figure20 10. Assume that the data comes in a text file. Once the file is copied to theHDFS and the job is started, the map phase includes the following.

1. On each computational node having a chunk of data, there starts a process termed mapper.

2. Each mapper splits his chunk into individual lines of text. The line is determined by the newline character \n.

3. Each line is passed to the map function implemented by the user maintaining the job.

17see [Inc17],[The17b]18except the last one containing the remaining data; it can be smaller19taken from [Loc17]20taken from [Loc17]

22

Figure 10: the flow of Hadoop map–reduce job.

4. Map function has to process the given line and return a couple of objects. The first oneis termed key, the second one is termed value. This (key,value) pair is passed then to thereduce step and processed by the reduce function implemented by the user.

There are several important notes regarding the map phase.

1. The split by the new line character is the default one and can be changed.

2. Map function is allowed to emit 0 or more than one (key,value) pair.

3. Duplicate keys are allowed.

4. The output of the map function is written into the file residing on HDFS. Single line isdevoted to single (key,value) pair. Key and value are separated by the tab symbol21. Thatis, there should be one tab symbol within a single line. All what precedes it is treated as thekey, whereas all what follows it is treated as the value.

Summing up, the map phase is expected to transform raw data into the set of (key, value) pairssubsequent analysis of which is carried in the reduce step. The latter starts just after the finishof the shuffle step, initiated after the map step. During the shuffle step, Hadoop sorts and groupsemitted (key, value) pairs and determines which reducer (the process running on some node andresponsible for further processing of the emitted pairs) will obtain the corresponding pair. It isvery important to stress here that all pairs sharing the same key are sent to the same reducer, andthe corresponding reduce function gets these pairs in a bunch. That is, once the value of key haschanged, the reducer will get all pairs with the same key in a contiguous stream, process them,and such key will never appear again.

The reduce function is expected to return unique (key,value) pairs obtained after processing ofall pairs sharing the same value of key. This is the result of map–reduce job which is written backto HDFS.

It is worthwhile to recall here the nice feature of Hadoop mentioned in subsection 2.2: itsupports map and reduce functions written in R, Python, C,... Thus, to be able to start the job,it is sufficient to know only one language from the list.

To make all the process more transparent, let us work on the conceptual level through the clas-sical map-reduce example termed ”word count” and met in the literature on big data as frequentlyas ”Hello, World!” is met in the introductory courses on high level programming languages. Thetask of ”word count” is to count numbers of instances of different words in a given text file. Withmap–reduce implementation of Hadoop, conceptual solution would be as follows.

21it is the default option; if there is a need, it can be customized

23

Figure 11: schematic ”word count”.

1. Write a map function which, for a given line of text supplied by the mapper, first splits theline of the text into separate words; after that, treating each word in obtained list as a key,the function couples each word with a value 1 and emits a list of pairs (key,value)=(word,1).

2. Write a reduce function which counts occurrence of each word in a set of (key,value) pairssupplied to the reducer running your function. Since the mentioned set of (key,value) pairs isequal to the sorted set of (word,1), the realization of such task is quite simple. The functionshould emit a list of pairs (word, no of instances of the word).

Graphical illustration of the above algorithm is given in figure22 11.

I end this section by emphasizing the fact that Hadoop is usually installed on a large clusterof computers devoted to some community of users. In such case, an interested user has to startHadoop on some subset of reserved nodes, which constitute his Hadoop cluster and serve as a basisfor the HDFS and corresponding manipulations. It is, therefore, important to get familiar withdetails of this process within a frame of your community. For the case of the MIF VU cluster, Iprovide the corresponding subsection.

3.6 Working with a Hadoop cluster on the MIF VU cluster*

Typical working scenario consists of the following steps.

1. Reservation of resources.

2. Spinning up a Hadoop cluster.

3. Running map–reduce job and/or doing some file manipulations.

4. Shutting down the Hadoop cluster and freeing the resources.

Listing 2 may serve as pattern. The code before the comments reserves resources. Modify it ac-cordingly. Since further comments are self–explanatory, I do not provide additional ones. Listings27–28 in the appendix B show the contents of the scripts startHadoopCluster.sh, stopHadoopClus-ter.sh. These should be placed in your working folder /scratch/lustre/home/yourUserName andmodified, if ever, with care.

When working with Hadoop cluster, you will usually need to do some file manipulations suchas copying file from your local file system to HDFS or vice a verse. Listing 3 (with supportingcomments) demonstrates how to do these basic operations. For a full list of commands, refer toHDFS command guide .

22taken from [Loc17]

24

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

Listing 2: typical bash script to work with Hadoop cluster on the MIF VU cluster

#!/bin/sh

#SBATCH -p short

#SBATCH -C alpha

#SBATCH --nodes=2 --ntasks -per -node=1 --cpus -per -task =12

#SBATCH --output=HadoopJob%j.out

#### Initialize the hadoop cluster

. /scratch/lustre/home/$USER/startHadoopCluster.sh

#### Execution of your job

echo "Execution of the job"

# Here comes your code for the scenario

# "start Hadoop cluster -> execute job -> exit".

echo

#### Stop the cluster

. /scratch/lustre/home/$USER/stopHadoopCluster.sh

Listing 3: exemplary manipulations with files within HDFS on the MIF VU cluster (interactivesession)

#### Start interactive session. If there is a need ,

#### reserve resources first. Note that command prompt is

#### redirected to allocated node.

visk@lxibm102 :~$ srun --pty $SHELL

#### Initialize the hadoop cluster

visk@lxibm028 :~$ . /scratch/lustre/home/$USER/startHadoopCluster.sh

# Below is an example of moving local data to hdfs and back;

# dfs is a command preceding any command to be run on HDFS;

# basic command syntax is as follows:

# dfs copyFromLocal localFileOrFolder destFileOrFolder

# here comes factual implementation; note that everything what

# precedes dfs command points to our hdfs;

# we first make directory in HDFS; after that , we copy two local

# files and one local directory to HDFS; in HDFS we put all data

# to one directory , which is then copied back to local file system

visk@lxibm028 :~$ $HADOOP_HOME/bin/hadoop --config \

$HADOOP_CONF_DIR dfs -mkdir inputDir


$HADOOP_CONF_DIR dfs -copyFromLocal \

/scratch/lustre/home/$USER/LocalData1.txt HDFSFile1.txt



/scratch/lustre/home/$USER/LocalData2.txt HDFSFile2.txt



/scratch/lustre/home/$USER/LocalFolder inputDir

25


$HADOOP_CONF_DIR dfs -mv HDFSFile1.txt inputDir


$HADOOP_CONF_DIR dfs -mv HDFSFile2.txt inputDir


$HADOOP_CONF_DIR dfs -copyToLocal inputDir \

/scratch/lustre/home/$USER/LocalOutFolder

#### Stop the cluster

visk@lxibm028 :~$ . /scratch/lustre/home/$USER/stopHadoopCluster.sh

#### Finish session

visk@lxibm028 :~$ exit

3.7 A very brief introduction to Apache Spark

Up to now, we have introduced main elements of a working model of Hadoop and said nothingabout its offspring Apache Spark (further on simply Spark), which was mentioned in Section 2.2 asbeing in the list of important ones. It is time to say that Spark, not the Hadoop, is the tool we willfocus on in our forthcoming exposition. The reason for this is the fact that Spark greatly improvesthe map-reduce realization of Hadoop. Additionally, it offers a full stack of other tools, which areuseful in big data analysis. The latter feature is very important, since one has to deploy onlyone tool instead of a bunch of the others targeted for particular tasks. Hence, the maintenance,cost (recall that Spark is free!) and ease of use, combined with a power provided, makes Sparkvery favorable alternative to the other analogs. Though Spark was developed by the same team ofresearches as Hadoop and tightly integrates with Hadoop, it does not require Hadoop to run. Thatis, it can function equally well in a standalone mode. It is natural to ask then, why was Hadoopintroduced so broadly? The reasons are below.

• Though Spark is likely to push out23 Hadoop, it is still in use (and will be for some time),and one can encounter many clusters where Spark is built over Hadoop. Frequently the usersof these clusters do not use Hadoop to run programmes, yet it is used to maintain file system.

• Hadoop is important due to historical reasons (in particular, due to the implementation ofmap-reduce).

• It is important to be familiar with Hadoop due to compatibility and migration to othersystems (with Spark being one, but not the only one of them).

Keeping this in mind, we will not discard Hadoop entirely, and will provide several examples ofintegration with both R and Python.

In the rest part of subsection, we will describe key elements of Spark. More details will arise inthe subsequent sections. The reader, interested in exhaustive treatment wright now, is referred tothe homepage of Spark. In particular, to the latest documentation.

3.7.1 Spark ’s components

Spark ’s components are presented in figure24 12. Below is a brief description of each.

Spark Core is responsible for basic functionality, which includes scheduling of tasks, memorymanagement, etc. In particular, it maintains API used to operate on the main Spark’s ab-straction component termed Dataset25; switching to DataSet will become clear after finishingSubsection 3.7, which is described soon, and as such represents a model for distribution of abig dataset and operation on it in parallel.

23http://spark.apache.org/powered-by.html one provides a list of companies making use of Spark ; note thegiants: Amazon, eBay, Yahoo!,...

24taken from [HKZ]25earlier versions of Spark (up to 2.0) used term resilient distributed dataset (RDD)

26

https://spark.apache.org/

https://spark.apache.org/docs/latest/

http://spark.apache.org/powered-by.html

Figure 12: Spark’s components.

Spark SQL is Spark’s package designed to work with data in a frame of the usual relationaldatabase model by means of structured query language. It supports various sources of data.More than that, it allows inclusion of SQL queries within the programmatic code written inPython and R.

Spark Streaming is devoted for processing of live streams of data. It is fast, fault tolerant andscalable, pretty much like the Spark Core.

MLlib is a library offering common models encountered in machine learning. In its list one findstypical models for classification, regression and clustering. MLlib is well adopted to run onlarge clusters, which means that a data analyst gets ability to build common models over thelarge datasets in an acceptable amount of time and without exhaustive workarounds. I.e.,one can work with big datasets in the same way as with the usual ones.

GraphX is a library designed to work with graphs (a very common example is that of socialnetwork’s friend graph) and perform parallel computations on graphs.

The bottom (Standalone scheduler, Hadoop YARN, Apache Mesos) components are shownto indicate that Spark can run over a variety of cluster managers. An important thing tonote is that Spark has its standalone manager and, in fact, does not need an external one.However, if this is a case that such manager does exist, Spark integrates well.

3.7.2 Basic features of Spark ’s working model

My main goal lies in description of interaction with Spark by making use of R and Python APIs,and at the same time in shedding light on basic Spark’s functioning model. I accomplish this inthree steps. First, I explain how to deploy Spark on a single machine in a standalone mode. Iassume that you do not have an existing Hadoop installation. If this is not the case, you shouldcheck version compatibility. Second, I describe the typical working model of Spark application.Finally, I give a short instruction how to invoke and start using Spark from R or Python program.More details are uncovered in later sections.

Installing Spark Spark is not available on the MIF VU cluster. That is why I explain in detailhow to deploy it on your local machine. This, however, is not a big drawback when we talk aboutlearning, since functioning on the cluster resembles functioning on a local machine. Moreover, itis very unlikely that you will be the person responsible for deployment of Spark on the cluster.

27

Being able to work with it, you will only need to know how to launch it and allocate resources.The corresponding instruction should be provided by the team managing the cluster.

To install Spark on Linux :

• check whether you have Java; if this is not the case, install it;

• visit http://spark.apache.org/downloads.html and download the latest version suggestedby default;

• place the downloaded file (say, spark-2.2.0-bin-hadoop2.7.tgz ) into the directory you wish theSpark to be installed in;

• untar the file by typing

tar -xf spark -2.2.0 -bin -hadoop2 .7.tgz

into your terminal window;

• make a permanent entry for Spark in .bashrc file by typing26

export PATH=$PATH:path_of_inst_dir/bin

alternatively, you can make persistent entry, provided you are allowed to; for this:

– type into terminal

gedit ~/. bashrc

this will open .bashrc file;

– add to its end the previously shown export command

export PATH=$PATH:path_of_inst_dir/bin

– save the file, exit, and then type into the terminal

source ~/. bashrc

• check success of installation by typing

spark -shell

into the terminal window; in case of success, a figure similar to 13 will appear (do not careabout lengthy output); to exit, press Ctrl+D.

To install Spark on Windows:

• check whether you have Java; if this is not the case, install it;

• visit http://spark.apache.org/downloads.html and download the latest version suggestedby default;

• place the downloaded file (say, spark-2.2.0-bin-hadoop2.7.tgz ) into the directory you wish theSpark to be installed in; it is highly recommended to choose root directory (say C:) withoutspaces in its name; however, if you do not have a lot of rights, then choose the one you canwork in without limitations;

• extract contents of the file (free 7-Zip software is suitable for that);

• download winutils.exe from27 https://github.com/steveloughran/winutils and place itinto spark-2.x.x-bin-hadoop2.x\bin sub-folder corresponding to your Spark’s version;

26note that there should be no spaces around the sign of equality; also note, that this command adds requiredentry to .bashrc only for the current terminal session, and you will need to execute this export statement each timeyou log onto remote working machine

27navigate to corresponding hadoop-2.x.x folder, locate subfolder bin and follow the link winutils.exe

28

https://www.java.com/en/

http://spark.apache.org/downloads.html

https://www.java.com/en/

http://spark.apache.org/downloads.html

https://www.7-zip.org/

https://github.com/steveloughran/winutils

Figure 13: Spark’s shell.

• create the following environment variables28:

SPARK_HOME=Spark ’s directory\spark -2.x.x-bin -hadoop2.x\bin

HADOOP_HOME=Spark ’s directory\spark -2.x.x-bin -hadoop2.x

e.g., in my case these look like

SPARK_HOME=C:\spark -2.2.0 -bin -hadoop2 .7\bin

HADOOP_HOME=C:\spark -2.2.0 -bin -hadoop2 .7

• append your PATH environmental variable by adding

%HADOOP_HOME %;% SPARK_HOME %;

to its beginning (there should be no spaces); e.g., its value may look like (everything in asingle line without spaces between semicolons)

PATH=% HADOOP_HOME %;% SPARK_HOME %;% JAVA_HOME %\BIN;

C:\ windows\system32

• in your root directory (say C:), create folder TMP having sub-folder hive; start commandprompt console and type

%SPARK_HOME %\ winutils.exe chmod 777 /TMP/hive

then exit command prompt;

• to check success of installation, start the command prompt again and type

%SPARK_HOME %\spark -shell

figure similar to 13 should appear (do not care about lengthy output); to exit, press Ctrl+D.

28the name of the variable is typed in uppercase, values are typed in lower case; if you are unfamiliar with processof creation of such variables, then do the following:

a) start command prompt, type SystemPropertiesAdvanced.exe, and hit enter; the pop up window will appear;

b) hit the button environment variables; you will need System tab;

c) to create variable, hit button new, then enter its name and value into the corresponding fields; note that youdo not need equalities; e.g., for the first variable, the left hand side of equality, i.e. SPARK HOME, is Variablename whereas left hand side, i.e. Spark’s directory\spark-2.x.x-bin-hadoop2.x\bin, is Variable value;

modification of environment variables is carried out in the same way

29

Figure 14: Typical execution of Spark application.

Typical working model of Spark application Spark application is governed by a driverprogram. The latter contains application’s main function which does the following:

• defines datasets, and, by making use of Spark internal tools, distributes them over the clusternodes;

• applies parallel operations to the distributed pieces of datasets defined;

• retrieves and returns results of these operations.

There are two main abstractions used in working model of Spark : distributed datasets andshared variables. Currently, there are supported two kinds of distributed datasets. The first, andthe older ones, are called Resilient distributed datasets (further on RDDs). The second are calledData Frames. On high level, both types of datasets should be treated as collections of data pieces,partitioned across the nodes of the cluster and having properties to be accessed and operated on inparallel. The differences under the hood make Data Frames more effective than the older RDDs.Therefore, in Spark’s documentation, it is recommended to switch to Data Frames. Since bothdatasets are still in use, we will make use of both in our future exposition. However, RDDs will betouched very briefly.

When distributed pieces of data are processed in parallel on the set of executing nodes, man-aged by the driver’s main program, one may require shared variables to store and exchange aninformation provided by the nodes. Spark supports two types of such variables: broadcast variablesand accumulators. Broadcast variables are read-only, and are used to cache a value in memoryrequired by all nodes to be known during an execution of the task. Accumulators are variablesthat are only “added” to, such as counters and sums.

For example, imagine that your program has to count number of words in a bunch of filesshipped to different worker nodes. Then each worker node would count the number of words inthe set assigned and return its value. In order to store counts on the nodes, one could make use ofaccumulators. If there is any information to be shipped to the nodes (e.g., we would like to countonly words having length which does not exceed n symbols), it could be broadcasted by makinguse of broadcast variables.

In what follows, I will not dive into actual implementation of the above abstractions, and Iwill focus mainly on usage of Data Frames, which provide high level interface to the describedabstractions. These details will not appear to be of great relevance. In case of a need, you canalways turn to documentation and examples given on the Spark ’s homepage. Reference [HKZ] isalso very exhaustive, provided you are interested in older RDDs.

Finishing, it is worthwhile to mention that, for the case of our standalone installation on asingle machine, the driver program and executing processes will run locally. In case of a cluster

30

https://spark.apache.org/

mode, the flow of execution is well reflected by figure 14. Here the driver runs on one abstractnode, whereas executors run on the rest. An important thing to stress is that Spark automaticallytakes the the pieces of code wrapped into functions in your single driver program and ships themto executor nodes to be run in parallel. As you will see in the future, it is not always the case whenemploying other software, designed for parallel computations. That is, it may happen so that youhave to take care of this part on your own.

Linking Spark with R and Python There are three ways to link application with Sparkindependently of language (R or Python) in use.

1. Start interactive Spark shell.

2. Write a script and submit it to Spark for an execution.

3. Invoke Spark from your IDE.

Below I describe each.

Interactive shell To start Spark’s interactive shell on Linux 29:

• open terminal window;

• for R session type

sparkR

for Python session type

pyspark

To start Spark ’s interactive shell on Windows:

• open command prompt window and navigate to your Spark installation directory;

• for R session type

bin\sparkR

for Python session type

bin\pyspark

Assuming you have followed installation above and have given SPARK HOME environment vari-able value ending with bin, alternative way to start is by typing

%SPARK_HOME %\ sparkR

for R session, and

%SPARK_HOME %\ pyspark

for Python session.

Once done, the corresponding shell will be spawned, and you will be able to work interactively:create datasets, manipulate, and analyze. In my opinion, working directly from shell is leastconvenient because of absence of facilities supplied by IDE. Hence, I will not focus on that andrather move on to the rest methods.

Script submission Submitting a script to Spark is a good alternative. Especially when it comesto work on the cluster. Submission is simple. To submit script on Linux :

• open terminal window;

• regardless of the type of the script (R or Python), type30

29recall that you may require to run previously described export statement to add your Spark installation’s pathto .bashrc file

30again, recall .bashrc entry

31

spark -submit your_script

To submit script on Windows:

• open command prompt window and navigate to your Spark installation directory;

• regardless of the type of the script (R or Python) type

bin\spark -submit your_script

or, alternatively (under installation assumption described above),

%SPARK_HOME %\spark -submit your_script

Below is an example. Suppose we want to count a number of lines in a given text file. Scripts 4-5demonstrate that for R and Python correspondingly. Both create DataFrame structure from thefile of interest, and then run its method count(), which does the stuff. Assuming the R script wasgiven name31 countLines.R, submission would be done by typing

spark -submit countLines.R

on Linux, and

%SPARK_HOME %\spark -submit countLines.R

on Windows. Important thing to learn for the future is that an interaction with Spark is realized viaits object SparkSession. When an interactive shell is spawned as described above, SparkSessionis created automatically, and is available for referencing as spark in Python and sparkR in R. Incase of script submission, one has to initialize SparkSession object on his own and then startusing its methods for creation of datasets.

Remark 3.1. In Python as well as previous versions of package SparkR, it is pretty common touse SparkContext object of pyspark package to do the described initialization, and you may stillencounter plenty of such examples in literature. To make code in both languages more similar andeasier to follow, I have decided to describe an interaction via sub-package pyspark.sql. Anotheradvantage of using pyspark.sql.SparkSession is an ability to create datasets of type DataFrame

which are newer and more effective in contrast to earlier mentioned RDDs, directly available inpyspark by making use of SparkContext. �

Listing 4: count lines with Spark in R

# import required library

library(SparkR)

# Initialize SparkSession

sparkR.session(appName = "Spark DataFrame example")

# Path to file to process

lPath = "Path_to_file/lines.txt"

# Create a SparkDataFrame from a text file

lines <- read.text(lPath)

# count number of lines and print result

nOfLines <- count(lines)

cat(’Number of lines is equal to ’, nOfLines ,’\n’)

# Stop SparkSession before exit

sparkR.session.stop()

# to run this script use

# spark -submit countLines.R

31in Linux case, the below submission should work provided your script resides in $HOME directory; in Windowscase, the script should reside in SPARK HOME; otherwise, specify full path

32

Listing 5: count lines with Spark in Python

# import required object

from pyspark.sql import SparkSession

# Create SparkSession under the name spark

spark = SparkSession\

.builder\

.appName("Spark DataFrame example")\

.getOrCreate ()

# Path to file to process

lPath = "Path_to_file/lines.txt"

# Use SparkSession ’s method to read text file into the

# DataFrame structure

lines = spark.read.text(lPath)

# count number of lines and print result

print ’Number of lines is equal to ’ + str(lines.count ())

# Stop SparkSession before exiting the program

spark.stop()

# to run this script use

# spark -submit countLines.py

Invocation of Spark within IDE It is much more convenient to work with Spark from IDE thanfrom an interactive shell since here you can make use of advantages suggested by IDE. Invocationof Spark within IDE is the same as invocation within a script described above. That is, one hasto make use of SparkSession object in exactly the same way. However, there are small steps totake before starting using it. In R one has to execute the following lines

spark_home <- gsub("\\","/",Sys.getenv(c("SPARK_HOME")),fixed=TRUE)

spark_home <- gsub("/bin","",spark_home)

Sys.setenv(SPARK_HOME=spark_home)

.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),

.libPaths ()))

on Windows, and the following

spark_home <- gsub("/bin","",Sys.getenv(c("SPARK_HOME")))

Sys.setenv(SPARK_HOME=spark_home)

.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),

.libPaths ()))

on Linux (note that you may require to export SPARK HOME before that). These lines will adda required path to the list of paths containing R libraries, and you will be able to load packageSparkR directly from your environment. Hence, after execution of the above lines, you can runscript 4 directly from your IDE, or you can type it in and execute line by line. Anyway, it willproduce the same result with an output to the console of your IDE. You can also create newdatasets and manipulate them during your session.

In Python (regardless of OS), you first have to install package findspark. Then, in your IDE,run the following lines

import findspark

findspark.init(’path_to_your_spark_home_dir ’)

from pyspark import SparkContext , SparkConf

sc = SparkContext(master="local")

After that, you can run script 5 directly from your IDE, or work with Spark interactively. Note,however, that here we initialize by making use of previously mentioned SparkContext object. By

33

doing so we, in fact, can stay with this object, and we do not require SparkSession unless we donot want to work with DataFrames.

3.8 Virtual machines on the MIF VU Cloud*

It may turn out so that due to some reasons you would like to make use of virtual resourcesprovided by the MIF VU. In this subsection, I will discuss deployment of virtual machine (furtheron abbreviated as VM). Here are several motives for you to become interested in taking thisopportunity:

a) it may happen so that you were unlucky to configure Spark on your local machine; VM,described below, comes with an installed software required for your training sessions and/orassessments;

b) you may be interested in disposal of several VMs because of other projects; for example,think about your thesis;

c) you are simply interested in cloud computing; setting up several machines and experimentingwith them would yield certain practical insight.

I have created template of VM. It comes with the following: Spark, R, RStudio IDE for R, Python,Spyder IDE for Python. After accomplishing the steps of instruction below, you will be able toremotely connect to such VM and modify it in accordance with your wish, since this VM will makeuse of your resources, and you will have administrator’s rights. You will be able to do virtuallyeverything you want: install additional software, run programmes, etc. Of course, the limits willbe set by the amount of resources (memory, CPU, number of machines, etc.). While going throughthe steps below, you will become familiar with the actual amounts, coming at your disposal as amember of the MIF VU community. While reading the instruction, inspect figures 15–16, and alsoget familiar with remarks at the very end of the subsection. These activities should be helpful.

Instruction for setup of VM on the MIF VU Cloud

1. Visit https://grid5.mif.vu.lt/cloud3/one/login and log on (your mif user + password).

2. On the left, choose Settings → info; activate field named Public SSH Key and paste your sshkey, which is used by your WinScp client32 (see Subsection 3.1.2).

3. On the left, choose Virtual resources → Virtual machines. In the displayed window, on theright, hit green plus. Select template having id=976 (owner=visk, name=win-vm-for-BDA).Hit the button Create. This will start the process of creation of VM.

4. Wait until status changes from Pending to Running (use Refresh button to track changes ofthe state). Check the check-box to select your VM. After that, just hit id of your machine,and you will be redirected to info page.

5. Refer to Attributes section. If you are Windows user, locate attribute named CONNECT INFO2;if you are Linux user, locate attribute named CONNECT INFO1. Copy the text of thisattribute (similar to mstsc.exe /v:193.219.91.104:9679), open your command prompt (forWindows users) or terminal window (for linux users), and paste that text. Press Enter.

6. In case of success, remote desktop dialog will appear. Use credentials user=Administrator,password=BigData2018 to connect to your remote machine. Once connected, you’re done,and you can start working with your fresh copy of VM. In case of rejection, proceed to thenext item.

7. Return to Virtual resources → Virtual machines tab, check your VM and press monitor iconon the right end of the record line of your VM. This will launch desktop in the web browser.Use the same credentials as above to connect to VM. Check whether Internet browser works,since it may happen so that there are problems with an Internet connection of your VM,which result in disability to connect remotely. For a fix, run usual Windows diagnostics toolto fix connection problems. I hope this should suffice. However, in worst case scenario, you

32this is necessary for remote desktop connection

34

https://grid5.mif.vu.lt/cloud3/one/login

Figure 15: Virtual machines tab.

can still work via browser. Though it is not so convenient as the remote desktop, it is betterthan nothing.

After finishing your work, do not forget to turn of you virtual machine. For this, return back toVirtual resources → Virtual machines tab; select your machines ID (check-box near ID); locateblack square button, similar to that met on media players; choose undeploy.

Several remarks regarding your VMs on the MIF VU Cloud

• Do not forget to change the password of your Administrator user account. Also, do not forgetthat each time you wish to log onto your VM remotely, it should be started first. Hence,you first have to log onto the MIF VU Cloud, start your machine, and, only after gettingconvinced that its status is Running, try to log on from your local machine. Even when VMis running, it may take a moment for it to get ready to accept remote connections. Hence,be patient.

• To save disk space, both R and Python come with minimal base installation. As for theinstallation of additional R packages, it may be accomplished by making use of RStudio in-terface or install.packages(’packageName’) command. For Python case, I have installedminiconda distribution, which supplies you with Anaconda command prompt (see the list ofprogrammes in start menu of your fresh VM). By making use of the latter, you can installadditional packages. For example, launching the Anaconda prompt and typing

conda install statsmodels

will install statsmodels and some dependencies (with numpy, scipy and pandas in the list).You can also make use of regular pip install packageName, or you can look for alternativeways.

35

https://conda.io/miniconda.html

Figure 16: Settings tab.

• You should have noticed that your VM has ≈ 5 GB of free space on the root disk C : \.Probably you have also inspected other characteristics such as RAM, CPU speed, etc. Someof these can be modified via web interface, you have just used to create your VM. For this, goto Virtual resources→Virtual machines tab, select your machines ID (check-box near ID), hitid of your machine. Once redirected to info page, browse other pages (capacity,storage,...).Something should be clear at a glance, however, for an advanced modification, one shouldrefer to OpenNebula documentation.

• Continuing on resources, your quotas are available via the same web interface, under Set-tings→Quotas (see figure 16). For example, each standard user is allowed to have at most 5VMs.

• Following pretty the same pattern, yet choosing another templates, you can create your VMsrunning OSs different from Windows. Again, this requires deeper knowledge and reference tostaggered sections of documentation. To have full capabilities, you should also work underuser view (see figure 16) which is the default one; cloud view is for simpler interface. It isdesigned for those only wishing to create VMs from existing templates.

Several warnings for users wishing to create Linux VMs:

– after setting up a VM, for most of Linux type OSs, you will be able to connect for thefirst time only remotely via ssh;

– it is very likely that you will be logged on not as a root user, and management ofaccounts will pose a challenge.

3.9 Tasks

3.1. Refer to https://slurm.schedmd.com/sbatch.html. Create a script which:

• defers allocation of your job for 10 seconds after factual submission (option --begin=time);

36

https://docs.opennebula.org/5.4/index.html

https://docs.opennebula.org/5.4/index.html

https://slurm.schedmd.com/sbatch.html

• requests SLURM to grant one node satisfying some constraint (options --nodes=minNoOf-Nodes and --constraint=list);

• prints a phrase ”Hello, my name is actual name of node executing command”;

• outputs to the file specified by you with a help of option -o.

Once your job is submitted, check its status by making use of squeue.

3.2. Log on to the cluster. By making use of scp utility and appropriate command line commandsdo the following:

• import file from remote machine;

• import folder from remote machine;

• move imported file to imported folder and print contents (i.e., a list of files and sub-folders,if any) of folder to console;

• create empty folder and copy to it contents of imported folder;

• create one more file in the folder of the previous item; move the whole folder to your localmachine;

• free resources (i.e., delete all created folders) and log out.

Repeat the task by making use of some third party software. Compare an ease of use.

3.3. Write a bash script which spins up Hadoop cluster on two nodes, produces some manipulationswith files and shuts down Hadoop cluster. Log on to the cluster and test it.

3.4. Write R and/or Python script which retrieves and prints to the standard output user andsystem information. Log on to the cluster. Reserve two nodes and start interactive session. Runyour script.

Repeat the same task by wrapping your script into the bash script which reserves the sameamount of resources and then runs your script.

3.5. Write R and/or Python script which does the following:

• creates local data structure (in case of R this should be an ordinary data R data.frame; incase of Python this should be an ordinary list) having elements 1, 2, 3;

• creates Spark ’s DataFrame (or RDD) from the local structure of the previous step and writesit to a text file on your disk.

Test your script by making use of spark-submit as well as interactive invocation from your IDE.Instruction. To accomplish the task, you may need to refer to documentation or make use of

the short instructions below.

Python case In case of Python, you can use method parallelize of the SparkContext object.The latter method creates resilient distributed dataset (RDD) from local structure by simpleassignment

rdd = sc.parallelize(local_data_list)

where sc is an instance of SparkContext. RDD has a method saveAsTextFile(path). Thelatter saves its contents to the location provided by path argument.

R case SparkR package provides method createDataFrame(local data frame) which convertsR data.frame or list into Spark ’s DataFrame. Once Spark ’s DataFrame is created, you canmake use of method write.text(x, path), where x denotes the Spark ’s DataFrame.

3.6. Create two virtual machines on the MIF VU Cloud and produce several experiments on yourown. Suggested types of experiments:

• installation of software;

• communication between two machines.

37

4 Several software options for parallel computations

Parallel computations lie in a core of big data analysis. Consequently, it is very important to beinformed about software tools designed for this task. Since an area is very dynamic, it seems hardlypossible to track all developments. Therefore, I shall describe only a few options available for Rand Python users. In my opinion, a short list is not a drawback, since it reflects major part ofvariety of approaches used in handling task of parallel computations. I.e., different packages mayoffer other options, yet they usually will operate in a way similar to one of those described below.

To make the explanations feasible and available to readers with different background, I will setaside complex models and go through trivial task of parallelization approaching it in different waysmentioned above. Hence, assume we need to estimate P(X > a), with X ∼ N(0; 1) and given a,by means of Monte–Carlo simulations33. The conceptual algorithm is as follows:

• simulate large sample of independent copies of X;

• calculate proportion of sample members exceeding a.

To achieve good precision, sample size n should be large. Though the choice of magnitude may beassessed analytically, it is of small concern in our case, and we will assume that 1000000 independentcopies of X will suffice. The introduced problem may be treated as a big data problem providedsample size n is so large that computations on a local machine would last too long. If this is thecase, moving to cluster could solve the problem. In order to illustrate the differences as well ascomplexities occurring due to employment of different parallelization tools, the ”plain” solution isgiven first34 for the case of each computational environment. In what follows, I will quite oftenreference the above problem as mean MC problem.

4.1 Parallelization with R

Straightforward solution is listed in listing 6.

Listing 6: plain solution of the mean MC problem

# Function

tail.function <- function(a,n){

set.seed (1)

x <- rnorm(n = n, mean = 0, sd = 1)

return(mean(x>a))

}

# Test (difference equals to 0.000187)

a <- 0

n <- 1000000

tail.function(a = a,n = n)-pnorm(q = a,mean = 0,sd = 1,

lower.tail = FALSE)

Below come the options described in [Loc17].

4.1.1 Parallelization with different versions of lapply

Recall that R function lapply is used to apply given function over the list. Below is an exemplarymodification of the plain solution rewritten by making use of this function.Functionally, it is equivalent to that of listing 6, and added complexity, therefore, seems redundant.However, it is perfectly suited for trivial parallelization as that encountered in our Monte-Carlosimulation. To see this, note that speed up may be attained as follows:

• prescribe execution of tail.function with smaller value of n to different workers (nodes,cores of CPU, abstract threads, etc.); e.g., force each worker to execute tail.function withsample size argument equal to n/k, where k is equal to the number of workers available;

33the problems of such kind occur quite naturally in different applications where it is difficult or impossible toestimate the quantity of interest analytically

34this would also suffice, provided the node employed has enough resources

38

Listing 7: reworked plain solution of the mean MC problem

# Function

tail.function <- function(a,n){

x <- rnorm(n = n, mean = 0, sd = 1)

return(mean(x>a))

}

# Calculation of estimate

a <- 0

n <- 1000000

nOfSubsets <- 10

sizeOfSubset <- n/nOfSubsets

estimates <- lapply(X = seq(1: nOfSubsets), FUN = function(i)

tail.function(a=a,n=sizeOfSubset))

finalEstimate <- mean(unlist(estimates))

• combine estimates obtained from different workers as it is done in the final line of listing 7.

Since workers execute tail.function with smaller value of parameter n and can work indepen-dently in parallel, the job should be done faster. The bigger value of n, the more touchable winning.In what follows, this approach is demonstrated by making use of several versions of lapply.

mcapply()

The simplest form of parallelization is to employ maximum possible number of cores of CPUof machine at hand and distribute the task among them. The option is offered by a functionmcapply() coming with parallel package. The corresponding code is given in listings 9 and 8(the latter being carried over through the rest examples of subsection). Below are explanations ofsome of numbered lines.

Listing 8: contents of MC tail function.R

1 # Function

2 tail.function <- function(argVec){

3 a <- argVec [1]

4 n <- argVec [2]

5 x <- rnorm(n = n, mean = 0, sd = 1)

6 return(sum(x>a))

7 }

Listing 9: solution of the mean MC problem by making use of mcapply() function

1 library(parallel)

2

3 #1) Function and arguments

4 source(’MC_tail_function.R’)

5

6 # Parameters

7 a <- 0

8 n <- 1000000

9 nOfWorkers <- detectCores(logical = TRUE)

10 sizeOfSubset <- as.integer(n/nOfWorkers)

11

12 # Construction of argument

13 argMatrix <- matrix(data = c(rep(c(a,sizeOfSubset),

14 times = nOfWorkers -1),c(a,n-(nOfWorkers -1)*sizeOfSubset)),

15 nrow = nOfWorkers , ncol = 2, byrow = TRUE)

39

16 argList <- split(x = argMatrix , row(argMatrix))

17

18 #2) Call

19 estimates <- mclapply(X = argList , FUN=tail.function ,

20 mc.cores = nOfWorkers)

21

22 #3) Final estimate

23 sum(unlist(estimates))/n

Line 4 sources contents of listing 8, which defines some modification of the plain solution givenin listing 6. Vector arguments are adopted for variety.

Lines 7–23 do the rest of the task. As compared to standard lapply function, the call to mcapply

has an additional argument mc.cores prescribing the number of cores to be used. The latteris determined by invocation of function detectCores(logical = TRUE) with an argumentlogical = TRUE, pointing out that we wish to detect the maximum number of logical coresavailable.

Figure 17 depicts sequential increase of performance on machine having 12 physical and 24 logicalcores. One can see that the gain in performance settles after reaching the number of cores equalto 12.

Figure 17: visual illustration of gain in performance due to number of cores employed.

The main advantage of mcapply is shared memory. That is, all parallel tasks see variables definedbefore the parallel portion of code because everything runs on the same machine. Such approach,however, limits resources (amount of memory and number of cores) to those of machine at hand.The forthcoming version of lapply bypasses this drawback.

parLapply()parLapply comes with package parallel as well as package snow. In contrast to mcapply, it

offers an ability to execute parallel tasks on the cluster composed from CPUs of more than onenode. Listing 10 demonstrates solution of the mean MC problem by means of parLapply. Thelatter code was run interactively by making use of bash commands (the corresponding output isalso printed) given in listing 11. Here the first command (line 1) allocates 3 nodes. The secondcommand (line 7) runs the script, whereas the third one (line 9) releases resources.

40

Listing 10: R script for solution of the mean MC problem by making use of parLapply() function

1 library(snow)

2 library(Rmpi)

3

4 #1) Initialization

5

6 # Function

7 source(’MC_tail_function.R’)

8

9 # Parameters

10 a <- 0

11 n <- 1000000

12 noOfNodes <-3

13 noOfcpusPerNode <- 12

14 nOfWorkers <- noOfcpusPerNode*noOfNodes

15 sizeOfSubset <- as.integer(n/nOfWorkers)

16

17 # Construction of argument

18 argMatrix <- matrix(data = c(rep(c(a,sizeOfSubset),

19 times = nOfWorkers -1),c(a,n-(nOfWorkers -1)*sizeOfSubset)),

20 nrow = nOfWorkers , ncol = 2, byrow = TRUE)

21 argList <- split(x = argMatrix , row(argMatrix))

22

23 #2) Call

24 cl <- makeCluster(noOfNodes)

25 clusterExport(cl,"tail.function")

26 tm<-snow.time(sums <- parLapply(cl = cl, x = argList ,

27 fun=tail.function))

28


30 finalEstimate <- sum(unlist(sums))/n

31

32 #4) Display of results

33 print(finalEstimate)

34 print(tm)

35

36 #5) Recommended exit

37 stopCluster(cl)

38 mpi.quit()

Listing 11: bash commands for interactive execution of MC parLapply.R script

1 visk@lxibm102 :~$ salloc -N3

2 salloc: Pending job allocation 538240

3 salloc: job 538240 queued and waiting for resources

4 salloc: job 538240 has been allocated resources

5 salloc: Granted job allocation 538240

6

7 visk@lxibm102 :~$ R CMD BATCH MC_parLapply.R

8

9 visk@lxibm102 :~$ scancel 538240

10 salloc: Job allocation 538240 has been revoked.

Below are comments to minor yet significant amount of changes of code of R script.

Lines 1–2 load required packages. Rmpi is needed for manipulations with the cluster by makinguse of Message Passing Interface (MPI).

Lines 10–21 do the stuff already met previously.

41

https://en.wikipedia.org/wiki/Message_Passing_Interface

Line 24 creates required cluster object. The number of nodes should be set explicitly. We knowthat we have reserved 3 nodes (each having 12 CPUs). One could also make use of functionmpi.universe.size() giving value of total number of CPUs available in a cluster. However, thereis no guarantee that it will function properly. One could also first detect number of CPUsavailable on each node by making use of previously announced function detectCores. Itshould be run on each node separately.

Line 25 does call of cluster.export(clusterObjet,textExpresion) function. The latter callexports tail.function making it visible to all members of the cluster object cl. Omittingthis line would cause improper work of code since, in this case, memory (in contrast to par-allelization with mcapply) is not shared. Note a role of the cl argument: it ships parLapplywith the cluster over which parLapply parallelizes execution.

Line 26 executes parLapply over the list argList. Additionally, we make use of snow.time func-tion which calculates times of execution on each of the nodes (line 34 prints the correspondinginformation).

Lines 37–38 code proper exit and disposal of reserved resources.

There are several important remarks to be given at this point.

1. By default35, line 24 of listing 10 creates a cluster which uses MPI for parallel operations andis available provided your platform has installed and properly configured MPI. Clusters ofthis type are preferable because one can utilize any high performance networking and doesnot need to care about host names of nodes involved: the function makeCluster takes onlythe number of nodes and all the rest is done by managing environment. However, in additionto installation of MPI, one needs to have R package Rmpi. Reserved resources are managedwithin the executed script MC parLapply.sh by snow on its own.

In case of other types of clusters one should specify host names of nodes (if there is a need,consult documentation of snow or parallel for details), and there could be speed limitationsoccurring due to the type of networking used. On the other hand, there disappears a needof installation of the above mentioned software.

2. One should not forget to explicitly distribute all needed data (including functions as well) toworker nodes using the clusterExport function.

3. snow offers other versions of lapply such as parSapply and parApply, which are parallelversions of sapply and apply. There is also an ability to apply some function on each nodewith the same or different arguments (see snow functions clusterCall, clusterApply).These may be useful in case there is a need to retrieve some specific information correspondingto the particular nodes.

4.1.2 Parallelization with foreach

The foreach package does exactly the same as lapply, i.e., it is designed to apply some portionof code in parallel while iterating over the given list and then return results in a form of a list.The main difference is due to coding. Listing 12 serves as a proof. It implements plain solutionwith a help of package foreach. One can see that the only difference as compared to listing 7 isin calculation of estimates list. This time, iteration through the list is done in a more explicitmanner.

Listing 12: R script for solution of the mean MC problem by making use of foreach package

1 library(’foreach ’)

2

3 # Function

4 tail.function <- function(a,n){

5 x <- rnorm(n = n, mean = 0, sd = 1)

6 return(mean(x>a))

7 }

35assuming version of Rmpi is not an old one

42

8

9 # Calculation of estimate

10 a <- 0

11 n <- 1000000

12 nOfSubsets <- 10

13 sizeOfSubset <- n/nOfSubsets

14

15 estimates <- foreach(x = seq(1: nOfSubsets)) %do%

16 {tail.function(a=a,n=sizeOfSubset)}

17

18 finalEstimate <- mean(unlist(estimates))

Thus, it is not surprising that the logic behind parallelization with foreach is essentially the sameas in case of lapply. One can parallelize by means of additional number of cores of a singlemachine at hand (analog of mcapply version) or by an employment of CPUs of a cluster at hand.In either case, one simply needs to load additional library and register appropriate parallel back-end. Listings 13–14 demonstrate corresponding implementations. One sees that changes of codeare negligible. In case of script 13, code

source(’MC_function_plus_args.R’)

includes portion of code equal to lines 4–16 of script 9. In case of script 14, it includes lines ofcode equal to lines 7–25 of script 10.

Listing 13: R script for solution of the mean MC problem by making use of foreach package andspecified number of cores of a single machine

1 library(foreach)

2 library(doMC)

3 library(parallel)

4


6 source(’MC_function_plus_args.R’)

7

8 #2) Call

9 # registerDoMC takes the number of cores as a parameter , i.e. it

10 # prescribes the number of cores to be used for parallelization;

11 # the number may be determined prior to writing code or on the run.

12 registerDoMC(nOfWorkers)

13 sums <- foreach(i = argList) %dopar% {tail.function(i)}

14


16 sum(unlist(sums))/n

Listing 14: R script for solution of the mean MC problem by making use of foreach package andspecified cluster

1 library(foreach)

2 library(doSNOW)

3 library(Rmpi)

4


6 source(’MC_function_plus_args.R’)

7

8 #2) Call

9 # registerDoSNOW takes cluster object as an argument;

10 # the rest remains unchanged.

11 registerDoSNOW(cl)

12 sums <- foreach(i = argList) %dopar% {tail.function(i)}

13


43

15 finalEstimate <- sum(unlist(sums))/n

16 print(finalEstimate)

17

18 #4) Recommended exit

19 stopCluster(cl)

20 mpi.quit()

4.1.3 Parallelization with Hadoop

All considered solutions of our Monte–Carlo mean problem worked well and gained a good increasein performance. One of the main reasons for that was the fact that the problem itself did not involveany external data. All the data was generated on the flow. It is obvious that in practice, thoughbeing met, this case is not the most frequent one, and the external data (large or small) is usuallypresent. In such case, lapply and foreach based parallelization, discussed up to now, workswell, provided the amount of data involved is relatively small in the sense that it may fit and beeffectively processed on a single node. To clarify an idea, imagine that our problem of estimationof P(X > a) should be carried out by making use of an external datafile containing the observedsample. Then the plain solution would be obvious:

• load the datafile into vector type structure;

• compute fraction of elements > a.

Here is the corresponding code:

x <- read.table("/sample.csv", header=TRUE , sep=",")

estimate <- mean(x>a)

What if sample.csv is too large to load into memory? For such a simple case, there are a lot ofpackages offered by R community (e.g., bigmemory, ff, LaF and their wrappers). One could alsosplit the file on his own, process pieces on different nodes, and combine their outputs. However,proceeding in this way is, in fact, proceeding in the spirit of Map–Reduce (see subsection 2.1.4).Since several R packages offer integration with Hadoop36, considering tools introduced up to now,it is seemingly the best way to deal with the problems of this kind: both in terms of performanceand programming efforts. Therefore, I shall focus on this approach and omit other alternatives.

Beyond previously mentioned R packages, developed by R community and intended for inte-gration Hadoop, there is a Hadoop streaming utility, which is intended for the same purpose andcomes with Hadoop installation. In this subsection, I shall describe an implementation of Map–Reduce job by making use of it, leaving the rest packages untouched (interested reader can consultreference [Pra13]). The choice may be reasoned as follows:

• though I did not conduct any testing, and my impression is based on the readings of theliterature, it seems that there is no big difference between the tools when talking aboutconvenience or performance;

• Hadoop streaming is that utility which implements the nice feature of Hadoop mentionedpreviously; namely, the use of that utility is essentially the same with any of the languagessupported by Hadoop for an implementation of Map–Reduce job; thus, readers, favoring otherlanguages, can immediately start using it just after getting familiar with examples providedbelow;

• I have already mentioned (see Subsection 2.2) that my major focus does not lie in extensiveuse of Hadoop for big data analysis, since (talking about this purpose) it may be viewed asa retreating tool; for analysis, I do recommend Apache Spark ; consequently, the goal of thissubsection is to introduce, but not to dive into details.

Conceptual implementation of Hadoop Map–Reduce job was already described in the very end ofSubsection 3.5. According to it, we need to write map and reduce functions in R and to deliverthem to Hadoop. Our Monte-Carlo problem is quite trivial, and since all we need for it to be solvedis an ability to operate with a large file, conceptual solution will be as follows:

36to name a few mostly appearing in the literature up to the date, the list would include such packages as RHipe,HadoopStreaming and RHadoop, which is, in fact, a collection of three packages [Pra13]

44

• the mapper will take the chunk of data file, compute the sum required for computation offinal estimate, and emit its value with a key equal to one;

• the reducer will compute final estimate from the mappers’ output.

The code of mapper’s and reducer’s functions is given in listings 15-16. In order to run a HadoopMap–Reduce job, one, first of all, needs to spin up the Hadoop cluster as described in subsection3.6 (for the case of the MIF VU cluster). Then, one also needs to run a Hadoop streaming utilitywith an appropriate set of parameters. Finally, it remains to get an output and shut down theHadoop cluster. The corresponding code is given in listing 17. Here are the comments to selectedlines.

Listing 15:

line 1 is a shebang line, necessary to inform the managing environment about the type of the scriptto be run; because of this line, the managing environment is able to start the correspondinginterpreter;

line 6 opens a file from which the data will be read; ’stdin’ is an abbreviation for standard input ;in the present case standard input is a file passed by us to the Hadoop streaming utility viaits option -input (line 26 of listing 17);

lines 8–14 produce (key, value) pairs as described previously;

line 15 writes the obtained (key,value) pair to the standard output, which in our case is themapper’s output37, and at the same time may be viewed as reducer’s input38, which isopened on line 5 of the listing 16; note the format of writing: key is followed by tab symbol,which is followed by value, which, in turn, is followed by the new line symbol (see Subsection3.5 for description of this format);

line 16 is a recommended exit; that is, to handle the process properly, one needs to close connec-tion to standard input.

Listing 15: mapper’s code corresponding to the MC problem

1 #!/usr/bin/env Rscript

2

3 running_sum <- 0

4 a <- 0

5

6 input <- file(’stdin ’, open=’r’)

7

8 while (length(currentLine <- readLines(input , n=1)) > 0) {

9 running_sum <- running_sum +

10 as.numeric(as.numeric(currentLine)>a)

11 }

12

13 value <- running_sum

14 key <- 1

15 cat(key ,’\t’, value ,’\n’,sep=’’)

16 close(input)

Listing 16:

line 5 opens reducer’s input obtained after the map and shuffle phases;

lines 7–10 read the lines from mapper’s output, split them, and filter out values, which are thenadded to the total sum;

lines 12–14 compute final estimate and write it to standard output; the latter output is an outputof the whole Map–Reduce job; it is copied from HDFS to local directory on lines 34–39 ofthe listing 17. The user then finds final estimate in a file named finalEstimate.txt.

37in a usual execution of R programme standard output would be console38this is not exactly so, since, as we know from the previous exposition, map phase is followed by shuffle phase

and it is shuffle’s output which goes to reducer as an input

45

Listing 16: reducer’s code corresponding to the MC problem

1 #!/usr/bin/env Rscript

2

3 running_sum <- 0

4

5 input <- file(’stdin ’, open=’r’)

6

7 while (length(currentLine <- readLines(input , n=1)) > 0) {

8 keyvalue <- unlist(strsplit(currentLine , split=’\t’, fixed=TRUE))

9 running_sum <- running_sum + as.numeric(keyvalue [[2]])

10 }

11

12 value <- running_sum /1000000

13 key <- ’Final estimate:’

14 cat(key ,’\t’, value ,’\n’,sep=’’)

15 close(input)

Listing 17 follows typical pattern of execution described in Subsection 3.6. The most importantare the options listed in lines 24–31, since they govern the flow of execution of Hadoop streamingutility:

line 24 defines the number of reducers to be run; by default, Hadoop streaming uses one reducer;thus, in our case this option is redundant one; it is presented here for demonstration;

line 25 defines the number of mappers to be run; if one does not specify this option, then, by de-fault, there would be two mappers, each operating on the chunk of initial data and producingthe sum of observations having value > a; consequently, it is again a redundant statement,which is written only for the sake of demonstration;

line 26 defines an input file with the data to be processed; note that first it has to be copied toHDFS (lines 18–19);

line 27 specifies an output directory within HDFS; it is used in the sequel to move output to thelocal directory (lines 34–35);

lines 28–29 make file mc mapper.R available to Hadoop and point out that it contains mapper’scode; lines 30–31 do the same for the reducer.

Listing 17: wrapper corresponding to the MC problem

1 #!/bin/sh

2

3 #SBATCH -p short

4 #SBATCH -C alpha

5 #SBATCH --nodes=1 --ntasks -per -node=1 --cpus -per -task =12

6 #SBATCH --output=MCJob%j.out

7

8 #### Initialize the hadoop cluster

9 . /scratch/lustre/home/$USER/startHadoopCluster.sh

10

11 echo "Execution of the map -reduce job"

12

13 # A command to create the input directory of HDFS

14 $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR \

15 dfs -mkdir inputDir

16

17 # A command to move the data from local directory to hdfs

18 $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR dfs \

19 -copyFromLocal /scratch/lustre/home/$USER/MCdata.txt inputFile.txt

20

21 # A command to execute the job with a help of Hadoop streaming

46

22 $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR jar \

23 $HADOOP_HOME/contrib/streaming/hadoop -streaming -1.2.1. jar \

24 -D mapred.reduce.tasks=1 \

25 -D mapred.map.tasks=2 \

26 -input inputFile.txt \

27 -output output \

28 -mapper mc_mapper.R \

29 -file /scratch/lustre/home/$USER/mc_mapper.R \

30 -reducer mc_reducer.R \

31 -file /scratch/lustre/home/$USER/mc_reducer.R

32

33 # A command to move the output from hdfs directory to the local one


35 -copyToLocal output /scratch/lustre/home/$USER/outputs

36

37 # A command to write the output to the text file


39 -cat output/part -* > finalEstimate.txt

40

41 echo

42

43 #### Stop the cluster

44 . /scratch/lustre/home/$USER/stopHadoopCluster.sh

One may observe that reducer, in our case, adds nothing and is, in fact, a redundant componentof the whole job. All we need is an ability to work with a large file. Then we could compute ourestimate in a single map phase by simply dividing the only output value by 1000000. That is, wehave to be able to run only one mapper, change the line 13 of listing 15 to the following:

value <- running_sum /1000000

and, after that, treat mapper’s output as the final one. For all this, it suffices to modify options oflisting 17 as follows: -D mapred.reduce.tasks=0, -D mapred.map.tasks=1. This will instructHadoop streaming not to run reduce phase, run only one mapper, and treat an output of the mapphase as the final one.

I finish the section by classical example of word count introduced in Subsection 3.5. In contrastto our MC problem, the reduce phase is meaningful here. Recall that the problem was to countthe numbers of instances of words in a given text file. The conceptual solution was as follows.

1. One needs to write a map function, which operates in the following way:

• given a line of text supplied by the mapper, the function splits this line into separatewords;

• treating each word in obtained list as a key, the function couples each key with value 1and emits a list of pairs (key,value)=(word,1).

2. One needs to write a reduce function, which, for a set of (key,value) pairs supplied to thereducer running the function, counts occurrence of each word. The function should emit listof pairs (word, no of instances of word), which is the final output of the job.

Listings 18–19 demonstrate the code corresponding to mapper’s and reducer’s functions accord-ingly. These should be wrapped into a bash script very similar to that given in listing 17. Since thecode in listings 18–19 involves only well documented ordinary R functions, and the logic behind isclear, I do not provide any comments, but suggest instead to test the code on your own.

Listing 18: mapper’s code for the word count problem

#!/usr/bin/env Rscript

print.keyval = function(key) {

value <- 1

cat(key ,’\t’,value ,’\n’,sep=’’)

47

}

input <- file(’stdin ’, open=’r’)

while (length(currentLine <- readLines(input , n=1)) > 0) {

currentLine <- gsub (’(^\\s+|\\s+$)’, ’’, currentLine)

keys <- unlist(strsplit(currentLine , split=’\\s+’))

lapply(keys , FUN=print.keyval)

}

close(input)

Listing 19: reducer’s code for the word count problem

#!/usr/bin/env Rscript

last_key <- ""

running_total <- 0

input <- file(’stdin ’, open=’r’)

while ( length(line <- readLines(input , n=1)) > 0 ) {

line <- gsub (’(^\\s+)|(\\s+$)’, ’’, line)

keyvalue <- unlist(strsplit(line , split=’\t’, fixed=TRUE))

this_key <- keyvalue [[1]]

value <- as.numeric(keyvalue [[2]])

if ( last_key == this_key ) {

running_total <- running_total + value

}

else {

if ( last_key != "" ) {

cat(last_key ,’\t’,running_total ,’\n’,sep=’’)

}

running_total <- value

last_key <- this_key

}

}

if ( last_key == this_key ) {

cat(last_key ,’\t’,running_total ,’\n’,sep=’’)

}

close(input)

4.1.4 Parallelization with Apache Spark

I have already described basic features of Spark working model in Subsection 3.7. To remind ina nutshell, recall that here one creates Spark dataset and applies parallel operations on it. Thebright side of Spark is in handling much of the work under the hood. Though the data is splitacross the cluster nodes and operated in parallel, the user may even not feel that unless there isan explicit need for doing something with separate partitions of that split.

In case of R, one simply loads package SparkR in his script, and the work then goes almost ina usual way. The differences are as follows:

• instead of operating with usual R data.frames, one operates with Spark DataFrames; sum-marized results of these operations are usually translated to common R data.frame for aneasier subsequent processing;

• not all functions operating on Spark DataFrame have exactly the same syntax as usual analogsoperating on R data.frames; therefore, one has to refer to documentation (for the latestrelease check url of the reference [The18a]) of package SparkR;

48

• though the coding resembles the usual R coding, there are more explicit declarations; onealso has to take some care when the data at hand is too large to fit in a memory of a singlemachine; if this is the case, one has to choose appropriate SparkR functions; e.g., considerfunctions dapplyCollect and dapply, both devoted to run user defined function over parti-tioned dataset; dapplyCollect is more convenient because of no need to provide additionalinformation specifying structure of the dataset; however, it may fail when the dataset isvery large (see https://spark.apache.org/docs/latest/sparkr.html, Subsection Spark-DataFrame Operations, Applying User-Defined Function).

To demonstrate the usual work flow, we provide an implementation of solution for our MC problemwhen the data resides on external data storage. That is, we assume that we need to calculatefraction of i.i.d. copies of X ∼ N(0; 1) exceeding given threshold value a; this has to be done whenthe corresponding sample comes in a large text file. The corresponding code is given in listing 20.Comments are extensive enough, however, there are several important things to note:

• though in general the data is partitioned across the nodes of the cluster, one does not needto ship the functions executed on the nodes to the nodes explicitly (compare with lapply

and foreach parallelization models);

• example utilizes Map–Reduce mechanism; however, in contrast to parallelization with Hadoop,it is implicit and well masked under the hood; the user feels only minor differences as com-pared to working with usual dataset;

• one may ask, what if we want to simulate data in the program and only after that producesome calculations? Exemplary options would be as follows:

– make use of previously introduced parallelization models (lapply, foreach) or othersimilar alternatives;

– collect your data to SparkDataFrame and work with this dataset; for creation of thecorresponding SparkDataFrame make use of the functions (some arguments are omitted;for a full list consult [The18a]) createDataFrame(data, numPartitions = NULL) and

rbind(df1,df2,...), where:

∗ argument data takes usual R data.frame or list;

∗ numPartitions may be used to specify required number of file partitions in advance,and is equal to 1 by default;

∗ rbind returns union of rows of two or more SparkDataFrames (these are given in alist df1,df2,... and must have identical structure, i.e. identical schemas (see thethe forthcoming code)).

Listing 20: MC problem with a help of SparkR package

1 ## import required library

2 library(SparkR)

3

4 ## Initialize SparkSession

5 sparkR.session(appName = "MC problem via Spark")

6

7 ## Path to file to process

8 # note: csv are more handy to read as compared to txt

9 lPath <- "MCData.csv"

10

11 ## Create SparkDataFrame from a csv file;

12

13 # reading without specifications

14 # will yield all columns of type string

15 # with default names; therefore , below we

16 # make some initial transformations to have our single

17 # column named ’obs ’ and of double type; creation of data defining

18 # schemas is quite straightforward;

19 # for more details refer to documentation

49

https://spark.apache.org/docs/latest/sparkr.html

20

21 xschema <- structType(structField("obs", "double"))

22 x <- read.df(lPath ,source=’csv ’,schema=xschema)

23

24 # Check the schema of created SparkDataFrame

25 printSchema(x)

26

27 ## Compute estimate

28 # function filter(df, cond) filters dataset df by

29 # applying given condition cond; function count(df) counts the

number

30 # of lines in a dataset df

31

32 a <- 0

33 finalEstimate <- count(filter(x,x$obs >0))/count(x)

34 cat(’Final estimate is equal to ’, finalEstimate , ’\n’)

35

36 ## Stop SparkSession before exit

37 sparkR.session.stop()

38

39 ## to run this script use

40 ## spark -submit MC_spark.R

I do not provide additional examples, yet I insistently recommend the reader to work throughself-practicing tasks devoted for SparkR. This should undoubtedly increase familiarity with theworking model and provide additional insights, useful for comparisons of utilization of Spark viaPython. The latter may play important role when choosing between SparkR and pyspark fortargeting particular problem at hand. When dealing with exercises, keep in mind two importantthings.

• Currently, SparkR is not the most stable software, and a lot may depend on version as wellas your machine. Therefore, if some functions do not run as expected (it may happen evenwith the code copied from documentation), do not waste time for debugging. Better proceedto other tasks.

• In my opinion, the most important thing to grasp lies in usage of previously mentionedfunction dapply (or dapplyCollect). In a word, this function takes three arguments: (x,

func, schema). The meaning of the latter is as follows:

– x is a Spark DataFrame to operate on;

– func is the user defined function with single argument, corresponding to one partitionof x and being treated as a regular R data.frame;

– schema defines the structure of resulting Spark DataFrame.

dapply applies func to each partition of x and returns DataFrame having structure definedby supplied schema argument. Within the working model of SparkR, it is an essential thingto understand . All the rest is pretty much the same as in ordinary R. For a quick and veryexplanatory example, have a look at example given in documentation (on the left, choosedapply).

Finally, note that SparkR provides very well documented API [The18a] in which every function isaccompanied by a simple example. This should suffice for a quick adoption of API without anyadditional explanations. Mastering the details is a question of time.

4.2 Parallelization with Python

As in case of R, we start with a plain solution given in listing 21. Since there is no a lot to comment,we proceed to other options.

50

https://spark.apache.org/docs/latest/api/R/index.html

Listing 21: MC problem: plain solution with Python

import scipy.stats as st

import numpy as np

import pandas as pd

a = 0

n = 1000000

def tail_function(a,n):

np.random.seed(seed =123)

sample = pd.DataFrame({’x’:st.norm.rvs(loc=0, scale=1, size=n)

})

return(np.mean(sample[’x’]>a))

# test difference equals to -3.1000000000003247e-05

st.norm.cdf(x=a)-tail_function(a,n)

4.2.1 Parallelization with Hadoop

In case of Python, parallelization with Hadoop is essentially the same, provided we make use ofhadoop streaming utility. Therefore, below I simply give the corresponding implementation ofmapper’s function, similar to that of R given in listing 15. To avoid empty coding correspondingto the reducer (see Subsection 4.1.3), I also assume here that one makes the changes discussedin Subsection 4.1.3, in the corresponding *.sh wrapper. The code of the latter is not provided,since it almost completely repeats listing 17 (for the negligible changes, see comments provided inSubsection 4.1.3).

Listing 22: mapper’s code corresponding to the MC problem

1 #!/usr/bin/env python3

2

3 # import of required module

4 import sys

5

6 # reading of data coming from Hadoop input file ,

7 # which is standard input for our MC problem

8

9 running_sum = 0

10 a = 0

11

12 # computation of sum

13 for line in sys.stdin:

14 # convert string line to number and add it to running sum

15 running_sum = running_sum + (float(line)>a)

16

17 value = running_sum /1000000

18 key = 1

19

20 # write (key , value) pair to standard output

21 print(key ,’\t’,value)

4.2.2 Parallelization with Apache Spark

Before proceeding to the contents of this subsection, I again remind that basic features of Sparkworking model were described in Subsection 3.7. In a word:

• one spawns a program which runs main function on the driving node of the cluster;

51

• via the main function, driving node ships workers with tasks and collects their output;

• the whole job spins around one or several Datasets, which are distributed over the cluster byinternal Spark ’s mechanism.

In case of Python, there is a bit more to uncover as compared to R. Therefore, I divide myforthcoming story into two paragraphs.

RDDs and shared variables In contrast to R, Python is among the native Spark languages.Therefore, its API has evolved longer, and a lot of object coming from old API are still in use.Resilient distributed dataset (RDD), introduced in 3.7 is one of them. Since Spark announces thatin the near future RDDs will not be maintained and one should avoid their usage, from the longstanding applied point of view, there is no sense even in getting familiar with this data structure.Nonetheless, I have decided to give a very brief overview because of the following reasons.

• In the Web, there is still much active code utilizing RDDs. Hence, some familiarity may beuseful both for reproducibility and for maintenance.

• An old working model contains several concepts which may be useful for a more advancedtuning of Spark based application. This makes sense even if you do not plan to use old APIsat all.

In Subsection 3.7, I have also mentioned two types of shared variables, namely, broadcasting vari-ables and accumulators. Recall that broadcasting variables are read-only and their main functionis to provide some immutable pieces of data to worker nodes. E.g., one may have some datasetfrequently queried by worker nodes. Instead of having this data on the driving node, one can shipworkers with it at the very beginning of execution. This would reduce traffic and save computa-tional resources as well. Accumulators are added only, and their main function is to store valuesof counters updated by workers within the whole interval of execution. E.g., each worker may addto accumulator its time of execution. The corresponding accumulator then will store the totaltime of execution. In my opinion, at the present state of Spark ’s evolution, the above informationabout shared variables is sufficient. Therefore, I shall not provide additional one. I f you feel thatthere is a need to employ these variables in your code, then reference [HKZ] contains all requiredinformation.

Turning back to RDDs, below I list the most important (as it seems to me) facts, and thenprovide a simple example of implementation of our MC problem by means of RDD ’s API. Isuggest to finish reading of the whole subsection and only then decide what matters. I am inclinedto think that, after getting familiar with DataFrame, you will drop RDD from the list of potentialoptions for parallelization overall. Nonetheless, if for certain reasons you will require them, [HKZ]is a recommended reading. Also, do not forget pyspark documentation. Among the rest, itcontains detailed description of RDD as well as classes for shared variables, and there are manyshort informative examples too.

Important remarks regarding RDDs

• Structural features

– Generally, RDDs should be viewed as partitioned datasets, containing ordinary Pythonobjects (tuples, lists, etc.).

– RDDs do not provide API access to its objects, similar to that of Pandas DataFrame.That is, there is no handy column referencing and slicing mechanism.

– Spark takes care of number of partitions. The latter depends on the size of data.However, user can repartition RDD. Documentation provides recommendations regardingnumber of partitions.

• Operational features

– By inspecting documentation, one finds that RDDs methods are primarily intended towork on data under Map-Reduce approach. Listing 23 provides an exemplary code.

– Spark resolves two types of operations applied over RDD: transformations and actions.Transformations define how the data should be transformed, yet they are not applied

52

https://spark.apache.org/docs/latest/api/python/index.html

immediately, and delay occurs until the user requests factual data defined by transfor-mation for subsequent processing. This is called lazy evaluation model. It allows Sparkto optimize execution. Actions are taken to produce the data defined by transformationsand, therefore, take place immediately.

E.g., RDD.map(f) applies function f over RDD element-wise. Its type is transformation,since it defines some transformation of initial data, yet there is no need to apply thattransformation immediately. However, RDD.collect(), which translates contents ofRDD to native Python collection (e.g., list), is an action, since it must take place imme-diately at the point of its factual occurrence in the code. The whole list of actions andtransformations may be found in documentation. However, for us, it is important tograsp the model of execution.

– During application of transformation, there occurs some data movement within thepartitions. It is called shuffling. This is an expensive operation. To optimize code, oneshould keep the number of shuffles at the minimum.

– RDDs can be created from: a) regular Python structures by making use of SparkContextmethod parallelize; b) local text files; c) Hadoop structures with predefined inputformats. In our exemplary codes, I make use of the first two options. As for c), oneshould refer to documentation.

– RDD has two methods for keeping it close to CPU, namely, persist and cache. Thelatter is a short version of persist(args) and is equivalent to persist with defaultarguments. Both methods instruct Spark not to spill RDD to disks but to keep itcloser to CPU. The level of closeness depends on optional argument args. By default,persist() asks to keep the RDD in memory after the first action on it has taken place.Other options, such as memory+disk are also possible. It is recommended to persistRDDs with a frequent subsequent use. Such strategy saves resources when working withbig datasets.

Listing 23: Several RDDs’ methods

# initialization

import findspark

sparkHome = "/home/visk/spark -2.3.0 -bin -hadoop2 .7"

findspark.init(sparkHome)

from pyspark import SparkContext


# 1)

# parallelize converts native Python datastructure to RDD

rddWithKeys = sc.parallelize ([(’a’,1) ,(’b’,2) ,(’a’,1) ,(’b’,3),

(’a’,2) ,(’c’,0)])

# RDD.mapValues(f) applies function f element -wise to values in

# (key ,value) pairs without altering the key; it returns RDD ,

# which is assigned to res;

# RDD.collect () converts contents of RDD to regular Python list

res = rddWithKeys.mapValues(lambda x:x**2)

res.collect ()

# output:

# [(’a’, 1), (’b’, 4), (’a’, 1), (’b’, 9), (’a’, 4), (’c’, 0)]

# the same result in a single line coding

rddWithKeys.mapValues(lambda x:x**2).collect ()

# 2)

# RDD.reduceByKey(f) applies associative and commutative function f

# to the groups of elements having the same key

rddWithKeys.reduceByKey(max).collect ()

# output:

53

# [(’a’, 2), (’b’, 3), (’c’, 0)]

# stopping of SparkContext

sc.stop()

Listing 24: the MC problem via RDD

# initialization

import findspark



from pyspark import SparkContext


# reading of data from the text file

pathToFile = ’/home/visk/Documents/MCdata.txt ’

mcRDD = sc.textFile(pathToFile)

# persist forces RDD to be kept in a memory for subsequent use

# just after the first action on it has taken place;

# this significantly speeds up forthcoming calculations and is

# recommended for RDDs frequently used in the sequel

mcRDD.persist ()

# computation of mean , long version:

mcRDDofFloats = mcRDD.map(lambda x:float(x))

mcRDDofFloats.mean()

# 1) mcRDDofFloats = mcRDD.map(lambda x:float(x)) takes each

# element of mcRDD and applies lambda function , which converts

# strings , read initially , into floats , required for further

# processing; returned value is again an RDD; however , this

# time it contains numbers; map() is a TRANSFORMATION; it does

# not force factual computation

# 2) mcRDDofFloats.mean() computes required mean; this is

# an ACTION

# computation of mean , single line version:

mcRDD.map(lambda x:float(x)).mean()

# result = 3.277271733705661e-05

# check via native Python API:

# RDD.collect () converts RDD to ordinary Python list

mcList = mcRDDofFloats.collect ()

import numpy as np

np.mean(mcList)

# result = 3.277271733705661e-05

# stopping of SparkContext

sc.stop()

A bit on DataFrames DataFrame class belongs to pyspark subpackage pyspark.sql. Therefore,full class reference would read as pyspark.sql.DataFrame. However, for the sake of convenience,I shall stay with abbreviation DataFrame.

Excluding beginners, every Python user works with pandas.DataFrame. Taking this into ac-count, adoption of pyspark’s DataFrame should be straightforward and painless. Suffices it tomention that the pyspark’s DataFrame supports named and strongly typed columns as well asbuilt in functions for aggregation, slicing, filtering and user-defined transformations. Syntacticdifferences, though being present, should not give rise to prolongation of the mentioned adoption,and since I do not pretend to giving any API reference at all, but merely to indicate important

54

points, I shall not discuss syntaxes as it was in R case. That is, below I provide only importantremarks to take into account when dealing with your practical tasks devoted for increasing profi-ciency with pyspark. Again, to hold the line, I finish the paragraph with implementation of MCproblem and several arbitrarily chosen examples for gaining some quick insight. I hope that thiswill suffice for the start. For the rest, pyspark provides well documented API reference [The18b],which is supported by plenty of examples devoted for usage of dedicated functions.

Important remarks regarding DataFrames

• For pyspark.sql.DataFrame, many concepts explicitly present in RDDs API have been maskedunder the hood or light front-end API. For example, actions and transformations, thoughbeing present, are not stressed so as it was in case of old fashioned RDD’s API. However,one should keep in mind that partitioning, lazy evaluation, shuffling, persistence and otherpreviously untouched RDD-related concepts carry over, and, in case of a need, can be used tooptimize performance. This explains my choice of a brief touch, taken in the previous para-graph: being informed provides some credits. Looking at pyspark.sql.DataFrame API, youwill find dedicated functions, e.g., DataFrame.persist, DataFrame.explain,DataFrame.localCheckpoint, DataFrame.repartition, DataFrame.storageLevel, etc.

• To run user-defined function over the rows of DataFrame, one has to:

– implement ordinary Python function;

– register it by making use of pyspark.sql.functions.udf orpyspark.sql.functions.pandas udf functions.

Listing 26 provides corresponding examples. The same pattern applies to functions dedicatedto run in SQL mode.

• In R case, reduced DataFrame is usually collected back to regular R data.frame. Incase of Python, dedicated function DataFrame.collect() returns a Python list containingpyspark.sql.Row objects. The latter represents a row with named columns termed keys.Accessing of values is available in two ways: Row.key, Row[key]. The forthcoming listing26 contains examples.

Listing 25: the MC problem via pyspark.sql.DataFrame

# 1) initialization

import findspark




spark = SparkSession.builder.master("local").appName("MC problem")

\

.config("spark.some.config.option", "some -value").getOrCreate ()

# 2) reading of data

# types modules is required to define schema of the data set;

# it contains constructors for definition of usual types

# (StringType (),IntegerType (), DoubleType ()) as well as schema

# constructors:

# StructType --- to define the whole schema of the dataset ,

# StructField --- to define separate columns (name , type , True

# if nullable)

import pyspark.sql.types as T

MCSchema = T.StructType ([T.StructField("x",T.DoubleType (),True)])

path = ’/home/visk/Documents/MCdata.txt ’

mcDF = spark.read.csv(path=path ,schema=MCSchema ,header=False)

mcDF.show() # shows top 20 rows by default

55

mcDF.dtypes # shows colnames with data types

# output: [(’x’, ’double ’)]

# 3) main computations:

# DataFrame.filter(cond) returns DataFrame satisfying condition

# cond; DataFrame.count () returns the number of rows in DataFrame

probEst = mcDF.filter(mcDF.x>0).count()/mcDF.count()

probEst

# output: 0.499802

# stop session

spark.stop()

Listing 26: User-defined functions for pyspark.sql.DataFrame and some other demos

# 1) initialization

import findspark




spark = SparkSession.builder.master("local").appName("MC problem")

\

.config("spark.some.config.option", "some -value").getOrCreate ()

# 2) required modules:

# T -- for definition of return types;

# F -- for different functions;

import pyspark.sql.types as T

import pyspark.sql.functions as F

# 3) some exemplary dataset created from pandas.DataFrame

import pandas as pd

age = pd.Series ([20.0 , 30.0, 22, 24])

gender = pd.Series([’f’, ’m’, ’f’,’m’])

# s - single; m - married; d - divorced

status = pd.Series([’s’, ’d’, ’s’,’m’])

salary = pd.Series ([2000.0 , 3100.0 , 2500, 2000])

pdf = pd.DataFrame({’age ’:age ,’gender ’:gender ,’status ’:status ,

’salary ’: salary })

sDF = spark.createDataFrame(pdf)

sDF.dtypes

# output:

#[(’age ’, ’double ’),

# (’gender ’, ’string ’),

# (’salary ’, ’double ’),

# (’status ’, ’string ’)]

sDF.show()

# output:

#+----+------+------+------+

#| age|gender|salary|status|

#+----+------+------+------+

#|20.0| f|2000.0| s|

#|30.0| m|3100.0| d|

#|22.0| f|2500.0| s|

#|24.0| m|2000.0| m|

#+----+------+------+------+

# 4) user defined function for getting decoded status

# (see remarks regarding running such functions over

56

# the rows of DataFrame)

# ordinary python implementation

def statusFunc(abbrev):

status = ’married ’

if abbrev == ’s’:

status = ’single ’

if abbrev == ’d’:

status = ’divorced ’

return status

# registration:

# -- F.udf(arg1 ,arg2) registers custom Python function

# written by the user and makes it available for running

# over DataFrames

# -- first arg supplies native Python function name;

# -- second arg supplies return type;

statusF = F.udf(statusFunc ,T.StringType ())

# application: we select only variables of interest with status

# decoded; alias is used to provide desired colname;

# note that:

# --we supply single colname to the function; to code function

# operating on multiple cols make use of F.pandas_udf;

# --return type must be from pyspark.sql.types.DataType

sDFReduced = sDF.select(’age ’,statusF(’status ’).alias(’status ’))

sDFReduced.show()

# output:

#+----+--------+

#| age| status|

#+----+--------+

#|20.0| single|

#|30.0| divorced|

#|22.0| single|

#|24.0| married|

#+----+--------+

# 5) examples of ordinary aggregations

# 5.1) ungrouped average age

# note that we collect to ordinary Python list

# containing rows; to obtain value we use r[colname] or

# r.colname

avgAge = sDF.agg({"age": "avg"}).collect ()

avgAge

# output: [Row(avg(age)=24.0)]

avgAge [0][’avg(age)’]

# output: 24.0

# another way with alias giving desired name to aggregated

# output column containing aggregated data

avgAge = sDF.agg(F.avg(sDF.age).alias(’meanAge ’)).collect ()

avgAge

# output: [Row(meanAge =24.0)]

avgAge [0]. meanAge

# output: 24.0

# 5.2) gender grouped maximum salary

gsDF = sDF.groupBy(sDF.gender)

maxS = gsDF.agg(F.max(sDF.salary).alias(’maxSalary ’)).collect ()

maxS

57

#output:

# [Row(gender=’m’, maxSalary =3100.0) ,

# Row(gender=’f’, maxSalary =2500.0)]

# 6) SQL based selection:

# --DataFrame.select(’col1 ’,’col2 ’ ,...) is used for untransformed

# selection;

# --DataFrame.selectExpr(’col1 ’,’expr1 ’) is used for more

# elaborated selection;

filteredDF = sDF.selectExpr("age", "salary *0.74 as taxedS")

filteredDF.show()

#+----+------+

#| age|taxedS|

#+----+------+

#|20.0|1480.0|

#|30.0|2294.0|

#|22.0|1850.0|

#|24.0|1480.0|

#+----+------+

spark.stop()

4.3 Tasks

4.1. Visit CRAN Task View for High-Performance and Parallel Computing with R.

• Check the abundance of packages for parallel computations and get familiar with one orseveral. Think about the key points of introductory lecture for your class mates.

• Subsection ”Parallel computing: Hadoop” offers several wrappers for Hadoop. Get familiarwith at least one of them (reference [Pra13] is useful for this). Summarize pros and cons ofusing Hadoop this way as compared to Hadoop streaming utility.

4.2. Visit Python Wiki page for Parallel Processing and Multiprocessing in Python.

• Check subsection ”Cluster Computing” and get familiar with one or several libraries excludingIPython. Think about the key points of introductory lecture for your class mates.

• Get familiar with IPython’s working model. In case the model appears interesting to you,study it deeper and experiment with different setups of clusters. For example, you maywish to compose a cluster consisting from several machines among which there are local andremote physical as well as virtual ones.

5 Ordinary models

5.1 The list

The goal of this section is to provide the reader with a short list of models every data scientistshould be familiar with. To see this, note that, after turning to devoted Spark ’s API, you willfind out that all the models of the forthcoming list are implemented by Spark ’s developers. Morethan that, they belong to MLlib library. In R case, it is scattered over several SparkR functions.In Python case, it is contained in sub-package pyspark.ml (pyspark.mllib for the case of olderRDD based interface). MLlib stands for Machine Learning library, which means that all the modelsare treated in the frame of Machine Learning paradigm. By taking such an approach, Spark ’sdevelopers did not deviate from the global tendency observed within the world of big data softwarefor statistical inference, since almost all big data software providers suggest some libraries spanningMachine Learning models and, as a rule, include all the models listed below. In case you feel thatthe list below includes unknown models and the corresponding tasks for self-practicing appear hardto tackle, you are advised to turn to references [JWHT13, HTF16]. [JWHT13] is less technical,whereas [HTF16] provides more rigor.

Below is the announced list of models.

58

https://cran.r-project.org/web/views/HighPerformanceComputing.html

https://wiki.python.org/moin/ParallelProcessing

• Linear regression.

• Logistic regression.

• Support vector machines.

• Decision trees.

• Random forests.

• Gradient boosted trees.

• K-means clustering model.

• Gaussian mixture clustering model.

• Neural networks.

It is also necessary to be familiar with usual concepts and methods met in Machine Learning : testand training sets; classification error; mean squared error (MSE); cross-validation, etc. Finally,note that some of the above list items span not only the basic model having the correspondingname. For example, linear regression is first of all tied to the classical population model with i.i.d.normally distributed errors. However, the same name spans linear regression with non-normalerrors, shrinked regression (Ridge, Lasso), Bayes regression, etc. Though it is not necessary to bean expert, it is advisory to know well the basic model, and also be familiar with at least severalpopular variations. The tables of contents of [JWHT13, HTF16] may serve as a pattern.

5.2 Spark ’s API

As in the case of parallel computations, I do not provide any details regarding Spark ’s API devotedfor ordinary models listed in the previous subsection. You should refer to documentation and workthrough the tasks devoted for self-practicing. However, before doing that, you are advised tolook at the examples hosted on the corresponding GitHub directory. Note that the examples areavailable only for Python, and that they are implemented by making use of the older RDD basedAPI. Therefore, it is a good exercise to implement the corresponding examples by making use ofDataFrame based API provided by pyspark.ml package.

5.3 Tasks

5.1. Refer to [RLOW15]. Select several models of most interest to you and implement the givenexamples by making use of SparkR and pyspark.ml.

5.2. Think about key points of presentation in which you are asked to:

• introduce isotonic regression model;

• give examples of carrying ordinary analysis based on isotonic regression model by makinguse of pyspark.

A Glossary

Cluster. Unless stated otherwise, refers to the computer cluster, which is a set of connectedcomputers working together so that one can treat them as a single system.

Cloud computing. It is the on-demand delivery of various computing resources, including (butnot limited to) virtual machine clusters, storage, databases, networking, software. The nec-essary feature is that the whole delivery is carried over the Internet. Hence the term ”cloud”.

Daemon. A program running in a background process and not controlled by an interactive user.

ETL. It is an abbreviation for ”extract, transform and load”. The latter generally refers to atype of data integration from multiple sources. During this process, data is taken from somesource system(s); then it is transformed into an appropriate format; finally, it is loaded intoa data warehouse or other system for storage and/or processing.

59

https://github.com/apache/spark/tree/master/examples/src/main/python/mllib

https://en.wikipedia.org/wiki/Isotonic_regression

Interpreter. Any program which handles the commands input by the user given as successivelines of text. shell stands for a synonym.

Job. A unit of work given by the master (in this context usually termed job scheduler) to theoperating system.

Master or master process. A process responsible for management of execution of some taskimplemented by other processes running on the same or other computers.

Node. A computer in a cluster under consideration.

Partition. Unless stated otherwise, a chunk of file split over the cluster.

Shebang line. A line at the very beginning of *.sh script. It starts with symbols ]! and points outto an interpreter which should execute the forthcoming commands. For example, shebangline

#!/bin/sh

instructs operating system to execute file using the Bourne shell.

Shell. See interpreter.

Slave. Unless stated otherwise, refers to the process executing particular task of some job andmanaged by some master process.

SSH client. SSH stands for Secure Shell, which is a cryptographic network protocol designed tocommunicate securely over an unsecured network by means of client-server architecture. SSHclient stands for a client’s application usable to connect to remote server.

Terminal or terminal emulator. A program which enables the host system to be operated bythe guest system.

Virtual machine. It is a software which mimics a behavior of real physical computer havingsome operational system installed. Virtual machine is usually abbreviated as VM and calledan image. Virtual machines are mostly used in cloud computing.

Worker. May be viewed as a synonym of slave. A major difference is that by worker I usuallymean a computer governed by some master process. However, it may refer to a single slaveprocess.

B Listings

Listing 27: typical bash script to start a Hadoop cluster on the MIF VU cluster

#!/bin/sh

# The code below makes initial settings required for typical MIF

# VU cluster user to spin up a hadoop cluster. Each action is

# commented by making use of three comment symbols. Take care in

# case you decide to modify this code.

### Setting the directory where Hadoop configs should be generated

# Don ’t change the name of this variable (HADOOP_CONF_DIR) as it is

# required by Hadoop (all config files will be picked up

# from here).

export HADOOP_CONF_DIR="/scratch/lustre/home/$USER/hadoop -config"

### Setting the location of myHadoop

export MY_HADOOP_HOME="/soft/myHadoop"

60

https://en.wikipedia.org/wiki/Bourne_shell

### Setting the location of the Hadoop installation

export HADOOP_HOME="/soft/hadoop"

export HADOOP_HOME_WARN_SUPPRESS =1

### Setting the location used for HDFS

export \

HADOOP_DATA_DIR="/scratch/lustre/home/${USER}/hadoop -${USER}-data"

### Setting the location of the Hadoop logfiles

export \

HADOOP_LOG_DIR="/scratch/lustre/home/${USER}/hadoop -${USER}-log"

### Setting the Java path and SSH options

export JAVA_HOME="/usr/lib/jvm/default -java"

HADOOP_SSH_OPTS="-q -i $HADOOP_CONF_DIR/ssh_host_rsa_key -p 22222"

HADOOP_SSH_OPTS +=" -o UserKnownHostsFile =/dev/null"

HADOOP_SSH_OPTS +=" -o StrictHostKeyChecking=no"

export HADOOP_SSH_OPTS

### Setting up the configuration.

echo "Set up the configurations for myHadoop"

# This is the default non -persistent mode

$MY_HADOOP_HOME/bin/slurm -configure.sh -c $HADOOP_CONF_DIR

echo

### Formating of HDFS( required for the a non persistent instance)

echo "Format HDFS"

$HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR namenode -format

echo

### Starting the Hadoop cluster

echo "Start all Hadoop daemons"

$HADOOP_HOME/bin/start -all.sh

echo

Listing 28: typical bash script to stop a Hadoop cluster on the MIF VU cluster

#!/bin/sh

### The code below stops the Hadoop cluster and cleans the

# working directories.

echo "Stop all Hadoop daemons"

$HADOOP_HOME/bin/stop -all.sh

echo

echo "Clean up"

$MY_HADOOP_HOME/bin/slurm -cleanup.sh

rm -rf $HADOOP_CONF_DIR

echo

61

References

[AB16] Anne Kerr Andrew Butterfield, Gerard Ekembe Ngondi. A Dictionary of ComputerScience. Oxford Quick Reference. Oxford University Press, 7 edition, 2016.

[AT15] Emil Mazu Alexander Tormasov, Anatoly Lysov. Distributed data storage systems:Analysis, classification and choice. Proceedings of ISP RAS, 27(6):225–252, 2015.

[Bal17] James Balamuta. The coatless professor. http://thecoatlessprofessor.com/, 2017.[Online; accessed July-2017].

[Chr] Per Christensson. The tech terms computer dictionary. https://techterms.com/.Launched in 2005.

[CS14] Scott Chacon and Ben Straub. Pro Git, 2nd ed. edition. Apress, 2014.

[Ell16] Justin Ellingwood. An introduction to big data concepts and terminol-ogy. https://www.digitalocean.com/community/tutorials/an-introduction-to-big-data-concepts-and-terminology, 2016.

[GH15] Amir Gandomi and Murtaza Haider. Beyond the hype: Big data concepts, methods, andanalytics. International Journal of Information Management, 35(2):137 – 144, 2015.

[HKZ] Patrick Wendell Holden Karau, Andy Konwinski and Matei Zaharia. Learning Spark,LIGHTNING-FAST DATA ANALYSIS. O’Reilly Media, Inc.

[HTF16] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of StatisticalLearning. Springer-Verlag New York, 2016.

[Inc17] Yahoo! Inc. Hadoop tutorial from yahoo! https://developer.yahoo.com/hadoop/

tutorial/, 2017.

[JWHT13] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introductionto Statistical Learning. Springer-Verlag New York, 2013.

[Loc17] Glenn K. Lockwood. Data-intensive computing. http://www.glennklockwood.com/,2017. [Online; accessed July-2017].

[Mar16] Bernard Marr. Big Data in Practice: How 45 Successful Companies Used Big DataAnalytics to Deliver Extraordinary Results. Chichester, 2016.

[PD15] Kim H. Pries and Robert Dunnigan. Big Data Analytics: A Practical Guide for Man-agers. Auerbach Publications, Boston, MA, USA, 2015.

[Pra13] Vignesh Prajapati. Big Data Analytics with R and Hadoop. Packt Publishing, 2013.

[RLOW15] Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills. Advanced Analytics with Spark.O’Reilly Media, Inc., 2015.

[Ros17] Sheldon M. Ross. Introductory Statistics. Academic Press, 2017.

[The17a] The Apache Software Foundation. Apache spark. https://spark.apache.org, 2017.

[The17b] The Apache Software Foundation. Hadoop. https://hadoop.apache.org, 2017.

[The18a] The Apache Software Foundation. R frontend for apache spark. https://spark.apache.org/docs/latest/api/R/index.html, 2018.

[The18b] The Apache Software Foundation. Spark python api docs. https://spark.apache.

org/docs/latest/api/python/index.html, 2018.

62

http://thecoatlessprofessor.com/

https://developer.yahoo.com/hadoop/tutorial/

https://developer.yahoo.com/hadoop/tutorial/

http://www.glennklockwood.com/

https://spark.apache.org

https://hadoop.apache.org





Big data analysis. Lecture notesklevas.mif.vu.lt/~visk/BigData/.../LectureNotes... · Big data...

Documents

Transcript of Big data analysis. Lecture notesklevas.mif.vu.lt/~visk/BigData/.../LectureNotes... · Big data...