CIS 612
Advanced Topics in Database
Big Data Project
Lawrence Ni, Priya Patil, James Tench
Abstract
Implementing a Hadoop-based system for processing big data and doing analytics is a topic
which has been completed by many others in the past, and there is ample amount of
documentation regarding the process. From the perspective of a beginner, or even someone
with little knowledge of implementing a big data system from scratch, the process can be a
overwhelming. Wading through the documentation, making mistakes along the way, and
correcting those mistakes are considered by many part of the learning process. This paper will
share our experiences with the installation of Hadoop on an Amazon Web Services cluster and
analyzing this data in a meaningful way. The goal is to share areas where we encountered
trouble so the reader may benefit from our learning.
Introduction
The Hadoop based installation was implemented by Lawrence Ni, Priya Patil, and James Tench
as a group by working on the project over a series of Sunday afternoons. In addition, between
meetings the individual members performed additional research to prepare for the following
meeting.
The Hadoop installation on Amazon Web Services (AWS) consisted of four servers
hosted on micro EC2 instances. The cluster was setup in with one NameNode and three Data
Nodes. In a real implementation, multiple Name Nodes would have been implemented to
account for any machine failures. In addition to running Hadoop, the NameNode ran Hive as its
NoSQL database to query the data.
In addition to processing data on the AWS cluster, every step was first implemented on a
local machine to test prior to running any job. On our local machines we ran MongoDb to query
json data in an easy manner. In addition, the team implemented a custom Flume Agent to
handle streaming data from Twitter’s firehose.
AWS
Amazon Web Services offers various products that can be used in a cloud environment.
Running an entire cluster of hardware in the cloud is referred to platform as a service. To get
started with setting up a cloud infrastructure you begin by creating an account with AWS. AWS
offers a free level, which is basically low end machines. For our implementation, these low end
machines served our needs.
After creating an account with AWS, the documentation for creating an EC2 instance is
the place to start. An EC2 instance is the standard type of machine that can be launched in the
cloud. The entire set up for AWS was as easy as following a wizard to launch the instances.
Configuration
After successfully launching 4 instances, to get the machines running Hadoop it is
necessary to download the Hadoop files and configure each node. This is the first spot where
the group encountered configuration issues. The trouble was minor and easy to resolve, but it
was more about remembering the installation steps for Hadoop in pseudo mode. Hadoop
communicates via SSH and must be able to do so without being prompted for a password. For
AWS machines to communicate, it must be done via SSH and you must have your digitally
signed key available. To remedy the communication problem, a copy of the PEM file that is
used locally was copied to each machine. Once the file was copied to each machine, a
connection entry was made in the ~/.ssh config file with the ip address info for the other nodes.
The next step after configuring the connection settings with SSH was to setup each of
the Hadoop config files. Again, this process was straight-forward. Following the documentation
on the Apache Hadoop website was all that was needed to set up the configuration. The key
differences between installing on a cluster vs. pseudo mode were creating a slaves file, setting
the replication factor, and adding the IP addresses of the data nodes.
Flume
The Twitter firehose API was chosen as the datasource for our project. The firehose is a
stream of “tweets” coming from twitter live. To connect to the API, it is necessary to go to
Twitter’s developer page and register as a developer. Upon registration you may create an
“app” and obtain an API key for the app. This key is used to connect, and download data from
the various Twitter APIs. Because of the streaming nature of the data (vs. connecting to a REST
API), a method for moving the data from the stream to HDFS is needed. Flume provides this
API.
Flume works by using sinks, channels and sources. A sink is a data source, and in our
case is the streaming API. A channel is the method it will use to store data as it moves to
permanent storage. For this project, memory is used as the channel. Finally the sink is where
data is stored. In our case, we are storing data in HDFS.
Flume is also very well documented, and the documentation will guide you through the
majority of the process for creating a Flume Agent. One area documented on the Flume website
references the Twitter API and warns the user that to code is experimental and subject to
change. This was the first area of configuring Flume where trouble was encountered. For the
most part, the Apache Flume example worked for downloading data and storing it in HDFS.
However, the Twitter API allows for filtering of the data via keywords passed with the API
request. The default Apache implementation did not implement the ability to pass keywords, so
there was no filter. To get around this problem, there is a well documented java class from
Cloudera that includes the ability to use a Flume Agent with a filter condition. For our project we
elected to copy the Apache implementation, and modify it by adding in the filter code from
Cloudera. Once we had this in place, Flume was streaming data from to HDFS.
After a few minutes letting Flume run on a local machine, the program began throwing
exceptions, and the exceptions starting increasing. To solve this problem it was necessary to
modify the Flume Agent config files so that the memory channel was flushed to disc often
enough. After modifying the transaction capacity setting, and some trial and error the Flume
Agent began running without exceptions. The key to getting the program to run without
exception was to set the transaction capacity higher than the batch size. Once this was working
as desired, the Flume Agent was copied to the Namenode on AWS. The Namenode launched
the Flume Agent, was allowed to download data for days.
Flume Java code
MongoDb
The Twitter API sends data in JSON format. MongoDb handles JSON naturally because
it stores data in a binary JSON format called BSON. For these reasons, we used MongoDb on a
local machine to understand the raw data better. Sample files were copied from the AWS cluster
to a local machine. The files were imported into MongoDb via the mongoimport command. Once
the data was loaded, querying to view to format of the tweets, test for valid data, and review
simple aggregations was done with the mongo query language.
Realizing we wanted a method to process large amounts of data directly on HDFS the
group decided that MongoDb would not be our best choice for direct manipulation of the data on
HDFS. For those reasons, the extent of the MongoDb usage was limited to only analyzing and
reviewing sample data.
MapReduce
The first attempt to process large queries on the Hadoop cluster involved writing a
MapReduce job. The JSONObject library created by Douglas Crockford was used to parse the
raw JSON and extract the components being aggregated. MapReduce for finding one summary
metric was easily implemented by using the JSONObject library to extract screen_name as the
key, and the followers_count as the value for our MapReduce Job.
Once again, the job was tested locally first, then processed on the cluster. With about
3.6 gb of data, the cluster process our count job in about 90 seconds. We did not consider this
bad performance for 4 low end machines processing almost 4gig of data.
Although the MapReduce job was not difficult to create in Java, it lacked the flexibility of
running various ad hoc queries at will. This lead to the next phase of processing our data on the
cluster.
mapreduce code
HIVE
Apache Hive, like the other products mentioned prior was also very well documented
and easy to install on the cluster following the standard docs. Moving data into hive proved to be
the challenge.
For HIVE to process data it needs a method for serializing and deserializing data when
you send a query request. This is referred to as a Serde. Finding a JSON Serde was the easy
part. We used the Hive-JSON-Serde from user rcongiu on github. The initial trouble with setting
up the hive table was telling the Serde file what the format of the data would look like. Typically
a create table statement needs to be generated to define what each field looks like inside the
nested JSON document. During the development and implementation of the table, many of the
data fields that we expected to hold a value were returning null. This is where we learned that in
order for the Serde to work properly, the table definition needed to be very precise. Because
each tweet from twitter did not alway contain complete data, our original implementation was
failing.
To create the perfect schema definition, another library called hive-json-schema by user
quux00 on github was used. This tool was very good at auto generating a hive schema if you
provided it with a single sample JSON document. After using the tool to generate the create
table statement, the data was tested again. Once again, the data was returning null values for
fields that should have had values. This ended up being one of the most tedious areas of the
project to debug. After spending time researching and debugging, the problem was discovered.
The problem once again stemmed back to twitter data sometimes being incomplete. Because of
this, the sample tweet that was used by the tool to generate the create table statement was not
complete. To correct this problem, a sample tweet was reconstructed with dummy data in any
field that we found to be missing. We used the Twitter API to validate what each field should
look like in terms of data types and nested structures. After making a few typos, we finally got it
right and constructed a full tweet. Using this new Tweet sample a create table statement was
generated with the same tool. Queries began returning expected values!
hive code
python code
Queries & Visualization
Now that we had HIVE up and running, we generated some sample queries that
aggregated the data in various ways. Creating HIVE queries is just like creating standard SQL
queries. In addition, it was easy to use Java style string manipulation to aid in processing the
data.
After we queried data and aggregated it in different ways we moved the aggregated data
to summary files. The aggregated data included information about who tweeted, how often they
tweeted, and even the hours of the day users were most actively sending tweets.
Watching HIVE generate MapReduce jobs in the terminal window was fun the first one
or two times, but then we realized we should find a way to better represent our data. The final
piece of software we used to process our data was called Plotly. Plotly is a Python library that
offers multiple graphing options. To process and use Plotly, you need a developer account.
Once you create an account, you use Python to define your data set and format the data set
based on the graph or chart you intend to create. The library then generates a custom URL that
can be used to view the data in chart form via a web browser.
Conclusion
From the perspective of a beginner, it may seem very difficult and overwhelming to
implement and configure a complex computer system. However, breaking down these complex
systems into more manageable pieces makes it easier to understand how these different parts
work and communicate with each other. This type of structured learning not only helps you
understand the material but also makes debugging issues a lot easier. While configuring and
installing our various systems, we encountered a variety of different issues. Whether it be
environment variables not being set or jar files that are no longer compatible with your current
software, these issues were easier to debug because we were able to break down the different
parts and localize the error. Experiencing errors/bugs when setting up these complex systems is
when the learning truly begins. Having to break down the error messages and think about the
different moving parts helps you develop a deeper understanding of how these different aspects
work and interact as a whole.
References
The Apache Software Foundation. Apache Hadooop, https://hadoop.apache.org/
The Apache Software Foundation. Apache HIVE, https://hive.apache.org/
The Apache Software Foundation. Apache Flume, https://flume.apache.org/
Cloudera. Cloudera Engineering Blog, Analyzing Twitter Data with Hadoop.
http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-
with-flume/
JSON-Hive Serde. https://github.com/rcongiu/Hive-JSON-Serde
JSON-Hive-schema. https://github.com/quux00/hive-json-schema
Plotly The Online Chart Maker. https://plot.ly/
Top Related