Download - CIS 612 Advanced Topics in Database Big Data Project ...eecs.csuohio.edu/~sschung/cis612/CIS612GroupProjectFinalReportP… · CIS 612 Advanced Topics in Database Big Data Project

CIS 612

Advanced Topics in Database

Big Data Project

Lawrence Ni, Priya Patil, James Tench

Abstract

Implementing a Hadoop-based system for processing big data and doing analytics is a topic

which has been completed by many others in the past, and there is ample amount of

documentation regarding the process. From the perspective of a beginner, or even someone

with little knowledge of implementing a big data system from scratch, the process can be a

overwhelming. Wading through the documentation, making mistakes along the way, and

correcting those mistakes are considered by many part of the learning process. This paper will

share our experiences with the installation of Hadoop on an Amazon Web Services cluster and

analyzing this data in a meaningful way. The goal is to share areas where we encountered

trouble so the reader may benefit from our learning.

Introduction

The Hadoop based installation was implemented by Lawrence Ni, Priya Patil, and James Tench

as a group by working on the project over a series of Sunday afternoons. In addition, between

meetings the individual members performed additional research to prepare for the following

meeting.

The Hadoop installation on Amazon Web Services (AWS) consisted of four servers

hosted on micro EC2 instances. The cluster was setup in with one NameNode and three Data

Nodes. In a real implementation, multiple Name Nodes would have been implemented to

account for any machine failures. In addition to running Hadoop, the NameNode ran Hive as its

NoSQL database to query the data.

In addition to processing data on the AWS cluster, every step was first implemented on a

local machine to test prior to running any job. On our local machines we ran MongoDb to query

json data in an easy manner. In addition, the team implemented a custom Flume Agent to

handle streaming data from Twitter’s firehose.

AWS

Amazon Web Services offers various products that can be used in a cloud environment.

Running an entire cluster of hardware in the cloud is referred to platform as a service. To get

started with setting up a cloud infrastructure you begin by creating an account with AWS. AWS

offers a free level, which is basically low end machines. For our implementation, these low end

machines served our needs.

After creating an account with AWS, the documentation for creating an EC2 instance is

the place to start. An EC2 instance is the standard type of machine that can be launched in the

cloud. The entire set up for AWS was as easy as following a wizard to launch the instances.

Configuration

After successfully launching 4 instances, to get the machines running Hadoop it is

necessary to download the Hadoop files and configure each node. This is the first spot where

the group encountered configuration issues. The trouble was minor and easy to resolve, but it

was more about remembering the installation steps for Hadoop in pseudo mode. Hadoop

communicates via SSH and must be able to do so without being prompted for a password. For

AWS machines to communicate, it must be done via SSH and you must have your digitally

signed key available. To remedy the communication problem, a copy of the PEM file that is

used locally was copied to each machine. Once the file was copied to each machine, a

connection entry was made in the ~/.ssh config file with the ip address info for the other nodes.

The next step after configuring the connection settings with SSH was to setup each of

the Hadoop config files. Again, this process was straight-forward. Following the documentation

on the Apache Hadoop website was all that was needed to set up the configuration. The key

differences between installing on a cluster vs. pseudo mode were creating a slaves file, setting

the replication factor, and adding the IP addresses of the data nodes.

Flume

The Twitter firehose API was chosen as the datasource for our project. The firehose is a

stream of “tweets” coming from twitter live. To connect to the API, it is necessary to go to

Twitter’s developer page and register as a developer. Upon registration you may create an

“app” and obtain an API key for the app. This key is used to connect, and download data from

the various Twitter APIs. Because of the streaming nature of the data (vs. connecting to a REST

API), a method for moving the data from the stream to HDFS is needed. Flume provides this

API.

Flume works by using sinks, channels and sources. A sink is a data source, and in our

case is the streaming API. A channel is the method it will use to store data as it moves to

permanent storage. For this project, memory is used as the channel. Finally the sink is where

data is stored. In our case, we are storing data in HDFS.

Flume is also very well documented, and the documentation will guide you through the

majority of the process for creating a Flume Agent. One area documented on the Flume website

references the Twitter API and warns the user that to code is experimental and subject to

change. This was the first area of configuring Flume where trouble was encountered. For the

most part, the Apache Flume example worked for downloading data and storing it in HDFS.

However, the Twitter API allows for filtering of the data via keywords passed with the API

request. The default Apache implementation did not implement the ability to pass keywords, so

there was no filter. To get around this problem, there is a well documented java class from

Cloudera that includes the ability to use a Flume Agent with a filter condition. For our project we

elected to copy the Apache implementation, and modify it by adding in the filter code from

Cloudera. Once we had this in place, Flume was streaming data from to HDFS.

After a few minutes letting Flume run on a local machine, the program began throwing

exceptions, and the exceptions starting increasing. To solve this problem it was necessary to

modify the Flume Agent config files so that the memory channel was flushed to disc often

enough. After modifying the transaction capacity setting, and some trial and error the Flume

Agent began running without exceptions. The key to getting the program to run without

exception was to set the transaction capacity higher than the batch size. Once this was working

as desired, the Flume Agent was copied to the Namenode on AWS. The Namenode launched

the Flume Agent, was allowed to download data for days.

Flume Java code

MongoDb

The Twitter API sends data in JSON format. MongoDb handles JSON naturally because

it stores data in a binary JSON format called BSON. For these reasons, we used MongoDb on a

local machine to understand the raw data better. Sample files were copied from the AWS cluster

to a local machine. The files were imported into MongoDb via the mongoimport command. Once

the data was loaded, querying to view to format of the tweets, test for valid data, and review

simple aggregations was done with the mongo query language.

Realizing we wanted a method to process large amounts of data directly on HDFS the

group decided that MongoDb would not be our best choice for direct manipulation of the data on

HDFS. For those reasons, the extent of the MongoDb usage was limited to only analyzing and

reviewing sample data.

MapReduce

The first attempt to process large queries on the Hadoop cluster involved writing a

MapReduce job. The JSONObject library created by Douglas Crockford was used to parse the

raw JSON and extract the components being aggregated. MapReduce for finding one summary

metric was easily implemented by using the JSONObject library to extract screen_name as the

key, and the followers_count as the value for our MapReduce Job.

Once again, the job was tested locally first, then processed on the cluster. With about

3.6 gb of data, the cluster process our count job in about 90 seconds. We did not consider this

bad performance for 4 low end machines processing almost 4gig of data.

Although the MapReduce job was not difficult to create in Java, it lacked the flexibility of

running various ad hoc queries at will. This lead to the next phase of processing our data on the

cluster.

mapreduce code

HIVE

Apache Hive, like the other products mentioned prior was also very well documented

and easy to install on the cluster following the standard docs. Moving data into hive proved to be

the challenge.

For HIVE to process data it needs a method for serializing and deserializing data when

you send a query request. This is referred to as a Serde. Finding a JSON Serde was the easy

part. We used the Hive-JSON-Serde from user rcongiu on github. The initial trouble with setting

up the hive table was telling the Serde file what the format of the data would look like. Typically

a create table statement needs to be generated to define what each field looks like inside the

nested JSON document. During the development and implementation of the table, many of the

data fields that we expected to hold a value were returning null. This is where we learned that in

order for the Serde to work properly, the table definition needed to be very precise. Because

each tweet from twitter did not alway contain complete data, our original implementation was

failing.

To create the perfect schema definition, another library called hive-json-schema by user

quux00 on github was used. This tool was very good at auto generating a hive schema if you

provided it with a single sample JSON document. After using the tool to generate the create

table statement, the data was tested again. Once again, the data was returning null values for

fields that should have had values. This ended up being one of the most tedious areas of the

project to debug. After spending time researching and debugging, the problem was discovered.

The problem once again stemmed back to twitter data sometimes being incomplete. Because of

this, the sample tweet that was used by the tool to generate the create table statement was not

complete. To correct this problem, a sample tweet was reconstructed with dummy data in any

field that we found to be missing. We used the Twitter API to validate what each field should

look like in terms of data types and nested structures. After making a few typos, we finally got it

right and constructed a full tweet. Using this new Tweet sample a create table statement was

generated with the same tool. Queries began returning expected values!

hive code

python code

Queries & Visualization

Now that we had HIVE up and running, we generated some sample queries that

aggregated the data in various ways. Creating HIVE queries is just like creating standard SQL

queries. In addition, it was easy to use Java style string manipulation to aid in processing the

data.

After we queried data and aggregated it in different ways we moved the aggregated data

to summary files. The aggregated data included information about who tweeted, how often they

tweeted, and even the hours of the day users were most actively sending tweets.

Watching HIVE generate MapReduce jobs in the terminal window was fun the first one

or two times, but then we realized we should find a way to better represent our data. The final

piece of software we used to process our data was called Plotly. Plotly is a Python library that

offers multiple graphing options. To process and use Plotly, you need a developer account.

Once you create an account, you use Python to define your data set and format the data set

based on the graph or chart you intend to create. The library then generates a custom URL that

can be used to view the data in chart form via a web browser.

Conclusion

From the perspective of a beginner, it may seem very difficult and overwhelming to

implement and configure a complex computer system. However, breaking down these complex

systems into more manageable pieces makes it easier to understand how these different parts

work and communicate with each other. This type of structured learning not only helps you

understand the material but also makes debugging issues a lot easier. While configuring and

installing our various systems, we encountered a variety of different issues. Whether it be

environment variables not being set or jar files that are no longer compatible with your current

software, these issues were easier to debug because we were able to break down the different

parts and localize the error. Experiencing errors/bugs when setting up these complex systems is

when the learning truly begins. Having to break down the error messages and think about the

different moving parts helps you develop a deeper understanding of how these different aspects

work and interact as a whole.

References

The Apache Software Foundation. Apache Hadooop, https://hadoop.apache.org/

The Apache Software Foundation. Apache HIVE, https://hive.apache.org/

The Apache Software Foundation. Apache Flume, https://flume.apache.org/

Cloudera. Cloudera Engineering Blog, Analyzing Twitter Data with Hadoop.

http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-

with-flume/

JSON-Hive Serde. https://github.com/rcongiu/Hive-JSON-Serde

JSON-Hive-schema. https://github.com/quux00/hive-json-schema

Plotly The Online Chart Maker. https://plot.ly/