An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis
-
Upload
micheal-walsh -
Category
Documents
-
view
57 -
download
0
Transcript of An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis
![Page 1: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/1.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
1 Micheál Walsh MSc Cloud Computing
An investigation into Spring XD
to study
methods of Big Data Analysis
By
Micheál Walsh
16/04/2015
![Page 2: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/2.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
2 Micheál Walsh MSc Cloud Computing
Acknowledgements
I would first like to thank my Mom (Maureen) and my Dad (Edward) for all their support in any endeavor
that I wish to accomplish in life.
I would like to thank all my friends who provided guidance and support both technically and emotionally
for the past 2 years. Especially Declan Tarrant, Keith Lee, John Ryan, Eoin Johnston, Steve Carter and
David Hallissey.
I would especially like to thank my project supervisor Donna O’Shea and Eugene Bell who were always
available to talk through any problems that were encountered.
I would like to thank the management at EMC who encourage employees to better themselves by
providing the financial support required to complete an MSc such as this.
I would also like to thank Brian Casey and Don ORegan for providing the facilities needed to complete
this project. Without their support the technical detail accomplished within this project would not have
been possible.
![Page 3: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/3.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
3 Micheál Walsh MSc Cloud Computing
Table of Contents
Acknowledgements ....................................................................................................................................... 2
Introduction .................................................................................................................................................. 5
Architecture .................................................................................................................................................. 7
Spring XD Servers ...................................................................................................................................... 7
Stream components .................................................................................................................................. 8
Spring XD Message flow .......................................................................................................................... 10
Message Bus ........................................................................................................................................... 12
Application Context Hierarchy ................................................................................................................ 13
Building the Test System ......................................................................................................................... 13
XD Single mode System ....................................................................................................................... 14
XD Distributed System ........................................................................................................................ 15
Functionality of the DSL: Twitter Stream Example ............................................................................. 16
Storing DSL commands ...................................................................................................................... 17
Preparing the VMs .................................................................................................................................. 17
Analysis ....................................................................................................................................................... 18
Analysis of Spring XD functionality ......................................................................................................... 18
Unified ................................................................................................................................................. 18
Distributed Runtime: ........................................................................................................................... 18
Extensible: ........................................................................................................................................... 19
Data ingestion ..................................................................................................................................... 20
Real time Analytics: ............................................................................................................................. 21
Batch Processing: ................................................................................................................................ 23
Data export: ........................................................................................................................................ 23
Apache Storm: ......................................................................................................................................... 24
Spark Streaming ...................................................................................................................................... 27
Results ......................................................................................................................................................... 30
The system under test ............................................................................................................................ 30
The test process ...................................................................................................................................... 31
Data Limit: ............................................................................................................................................... 31
![Page 4: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/4.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
4 Micheál Walsh MSc Cloud Computing
Results 2MB to 50MB: ............................................................................................................................ 34
Results of streaming 2MB files: .............................................................................................................. 37
Results of streaming 10MB files: ............................................................................................................ 42
Results of streaming 20MB files: ............................................................................................................ 45
Conclusion ................................................................................................................................................... 49
Appendices .................................................................................................................................................. 51
Java Code: CPU and Memory .................................................................................................................. 51
Java Code: Start time of Spring XD Stream ............................................................................................. 53
Zookeeper ............................................................................................................................................... 55
Redis ........................................................................................................................................................ 55
Rabbitmq ................................................................................................................................................. 56
MySQL ..................................................................................................................................................... 57
Hadoop .................................................................................................................................................... 58
References .................................................................................................................................................. 60
![Page 5: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/5.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
5 Micheál Walsh MSc Cloud Computing
Introduction
It has been shown that data is expanding exponentially year on year and that IT departments are being
tasked with organizing this data. With more and more resources being laid at the feet of IT Architects,
expectations also mount to deliver a system that will ingest, organize and analyze any amount of data.
With the added pressure of speed, this must be done before that data becomes irrelevant or out of
date. Just a few years ago this type of task was designed and programmed on a project by project basis;
which varied from company to company. With data increasing so rapidly software developers needed to
change their approach as it was becoming clear that current solutions were not able to handle the
differing data sources or the volume of data. Data management and speed have been put forward as
the most important reasons for moving to the cloud. This is only true if the correct software architecture
is in place to meet the needs of the user. Most open sourced software solutions would need adjusting
by skilled programmers to meet the needs of a small to medium business. As this sector is cutting edge,
being able to find such skills sets is difficult and/or very expensive.
Expense is something major data companies don’t lose sleep over. With large companies pioneering the
way in creating solutions to big data problems that best suited their products. The Internet of things is
something every large multinational big data company needs to be ready for1. That is why most
companies are developing their own solutions to best suit their needs. Some of these companies release
software versions that are open source but have basic functionality. This is a great way to get more
customers to try their products and increase word of mouth. Examples of this include Pivotal, Cloudera,
Hortonworks etc. A more feature rich version of the same product would be sold with a software
license. This exclusive version would be marketed with a name like “GOLD” or “PLATNIUM” and comes
with telephone support on any issues during installation or day to day running of the product2. However
these licenses can be expensive and would be outside the budgetary abilities of the smaller companies.
In response to this many open source projects were setup3. The purpose of these Open source projects
was to automate solutions to the most common of big data architectural problems. A high level
Spring/Java knowledge is still needed but as a result of automation the setup and management becomes
more intuitive. Spring Open Source projects bring more reliable big data solutions to a wider user base
and in less time than ever before. Spring XD is one such project that is designed for single use or for
scale across a cluster with the ability of growing almost limitlessly4.
![Page 6: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/6.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
6 Micheál Walsh MSc Cloud Computing
As big data system design is still a very niche market, not many studies have been carried out into
projects like Spring XD. As new releases of Spring XD are being released periodically, time for system
testing, is for the most part left to the beta user. Online forums are a good indication of wither a
software package is being used by the general public. Spring XD seems to be growing and growing in
popularity. But as the user base grows so too the variations in use cases which ultimately leads to bug
discovery and lost time for the user. It is true to say that every system is different with differing inputs
(sources), requirements (analysis) and outputs (sources). The answer to this problem is a set of
guidelines that will not be accurate for every system but can be used to help steer Architects in the right
direction with designing systems that use Spring XD.
Spring XD is a software tool that has been created to help solve common Big Data input/output
problems. It basically moves data from point A to point B with the option of analyzing or editing the data
en-route. Spring XD supports common big data features like horizontal scalability and real time analytics.
It does this by supporting within its infrastructure Big Data projects like Apache Zookeeper5, Apache
Hadoop, Redis etc. These big data open source projects are known to be reliable from years of usage
and are utilized by Spring XD to coordinate data across VM clusters with the goal of scaling up the data
being processed.
![Page 7: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/7.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
7 Micheál Walsh MSc Cloud Computing
Architecture
Spring XD Servers
The two fundamental parts of Spring XD are the Admin and Container Servers. Once the Admin server is
running a shell console can be opened which can send commands via http to that Admin server. Spring
XD has 2 flavors; single mode or distributed mode. In single mode the Admin and Container servers are
on a single machine. For distributed mode the Admin server can have container servers on the same
machine but container servers can also be spun up on multiple machines within the cluster11.
The admin server and shell console are running commands can be submitted via http from the shell
console. These commands are in a Domain Specific Language (DSL) and comprise of the instructions for
the Admin Server to build a data stream. Each stream is made up of modules. Each module has an input
and an output. For the Admin server to accept a stream as being valid, two modules must be present, a
Source module and a Sink module. It is also possible to add multiple Processor modules between the
Source and Sink modules.
An Admin server accepts the stream as a sum of modules and gives each module a definition. In turn
each module definition gets assigned to a container within the cluster. Once the container accepts the
module the module gets deployed within that container. This creates a Spring Application Context for
each module as shown in Figure 1. Zookeeper keeps track of where all the module definitions are.
Zookeeper also keeps track of what state each container is in. If for example a container no longer is
communicating with Zookeeper the module will get reassigned to a different container on a different
machine.
![Page 8: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/8.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
8 Micheál Walsh MSc Cloud Computing
XD Admin
ZooKeeper
XD Container XD Container
Source output
Outbound adapter
Spring Application Context
Sinkinput
Inbound adapter
Spring Application Context
Redis
Rabbitmq
Transport Provider
Figure 1: Spring XD Architectural Components
Stream components
The Spring XD stream is made up of modules so a sink and a source is a module. The XD processor and
job are also modules. Each stream must have at least a source module and a sink module but can also
have 0-n processor modules in-between. Modules are basic reusable Spring Integration message flows.
Producer sends Message
Message ChannelConsumer receives
Message
Figure 2: Spring Integration message flow process
A simple example of a stream would be to create a stream that outputs the time every second, the
following is an example DSL command to do this.
![Page 9: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/9.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
9 Micheál Walsh MSc Cloud Computing
stream create --definition "time | log" --name ticktock –deploy
This DSL command has no processor module. Only the minimum source and sink modules. The name of
the stream is defined as “ticktock”. The input for the source module is a predefined Spring XD “Source”
taken from a java time method and the output of the source module is sent to the input of the Sink
module as a string of the current time approx. every second. Once the sink module receives this String,
the Sink just outputs the string of the time every second to the container console.
A Processor is a module that does something to the streaming data for example it can transform, split or
filter data while travelling from sink to source. A stream can contain zero, one or many modules and the
processor action will be carried out in order of definition within the DSL stream definition command.
Each module has a data input source and a data output source. The output of one module can be the
input of another module and vice versa. The Source module takes data from a predefined list of Spring
XD supported sources; for example a file or http source or external RabbitMQ message queue etc. The
Sink module is similar except it takes its input data from either the output of the Source module or the
output of a processor module. The output of the Sink module goes again to a predefined list of Spring
XD supported sources. For modules to communicate between each other like this a messaging service is
required, Spring XD currently supports the open source projects Redis and RabbiMQ with plans to add
Apache Kafka in a future release.
![Page 10: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/10.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
10 Micheál Walsh MSc Cloud Computing
Spring XD Message flow
XD Shell
XD Admin
ZooKeeper
XD ContainerWith Module No 3
XD ContainerWith Module No 1
XD ContainerWith Module No 2
XD ContainerWith Module No 4
VM
VM
VM
VM
VM
HDFS or Console or File or ….. etc
Data Source
Data Sink
Stream submission M1 ! M2 ! M3 ! M4
Redis or RabbitMQ
1
2
3
4
5
6
7
8
Figure 3: An example of a Spring XD message flow system
The following steps cover the user interaction with the system under test, the flow of messages
between modules and how the user controls the amount of modules being deployed. The steps begin
with the assumption that the test system is running and that Admin Server and all Container Servers are
running. The XD Shell connected to the Admin server is running and ready to accept commands.
![Page 11: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/11.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
11 Micheál Walsh MSc Cloud Computing
1. The user types the stream command into the XD shell and deploys the stream. The Stream used
in this example diagram has four modules. “Source|processor|processor|Sink” or
“M1|M2|M3|M4”
2. XD Admin Server creates each module in the stream and deploys them to be worked on by
containers.
3. Zookeeper takes over the process of finding a container to deploy each module. Zookeeper
randomly assigns the modules to containers that are available for work, it then keeps track of
where each module is and its place in the stream.
4. The messaging service (Redis or RabbitMQ) is used to communicate between modules. It is also
used to bring data from the Data Source and to the Data Sink location. So even though locations
or modules maybe physically located on different VMs they can still communicate and move
data between each other. For step 4 Redis or RabbitMQ is used to fetch data from the data
Source specified in the stream command and pass it to module one.
5. Redis or RabbitMQ is used to pass data from Module one to Module two.
6. Redis or RabbitMQ is used to pass data from Module two to Module three.
7. Redis or RabbitMQ is used to pass data from Module three to Module four.
8. Redis or RabbitMQ is used to pass data from Module four to whatever the data Sink was
specified in the stream command. For example a commonly used Sink on Big Data systems is
Hadoop Distributed File System (HDFS). On the system used for testing during the course of this
project, a HDFS instance was installed on a separate VM in the cluster and this being used for
the functionality of module Sink worked seamlessly.
![Page 12: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/12.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
12 Micheál Walsh MSc Cloud Computing
Message Bus 6
Stream plugins will use the message bus to bind a modules input and output channels to the transport
of choice. It will also query that module. For a stream when a module gets deployed the message bus is
responsible or the stream plugin is responsible for invoking the message bus to bind this output channel
which is typically just a direct channel which is defined in the application context which is part of the
Spring Integration. Within the XML file user defines the channel ID input and channel ID output then
some flow takes place under the covers which is taken care of by Spring Integration code. Stream plugin
will query this module asking for a component type message channel that’s named input and creates an
adapter for this on the fly and binds it to a RabbitMQ or Redis queue. Stream Plugin’s are also
responsible for binding “tap” points and named channels that are associated with that stream, so you
can tap a stream at any point before every module. If the user wants to tap the source they get a copy of
whatever messages are incoming to the stream to do real time analysis. An option to “tap” the stream
after some transformation has happened is also possible, those tap points are actually named channels
in Spring XD and can also create a stream with named channels for example HTTP > que:foo where foo is
the named channel and these all get bound by the message bus to the transport.
The other thing the message bus has to do is take care of the martialing so if you’re going over a remote
protocol like RabbitMQ or Redis and you have a POJO you might have to do some serialization, the
message bus has some optimization so internally decides that if the data is already a byte array then no
transformation will take place, on the other hand if the data is in java object format then it needs to be
serialized and Kryo7 can be used for this. Kyro is the default serialization tool with Spring XD. At run time
the Admin server uses the message bus to launch a job. The Admin sends a message to the job channel
that’s polling for messages, once received the job is kicked off. So the message bus is a shared
component by the Admin Server and the Container Servers.
So it is clear that the bulk of the work for the user is in the setup and configuration of a system such as
this. But what is missing is a specification that outlines data delay between source and sink or the impact
on a machine as the load increases. This paper aims to give a set of data that can be used as a guide
when designing big data systems where Spring XD is to be utilized.
![Page 13: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/13.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
13 Micheál Walsh MSc Cloud Computing
Application Context Hierarchy8
In Spring XD the application context itself is designed to be extensible. So Spring XD is a runtime and not
a frame work so when you build with XD you are not building the application context itself like a
traditional spring application. We create the application context using Spring boot to do the work under
the covers. To do this there has to be some way of linking the spring configuration which is located in a
highly visible place to the application context, in this way any kind of Spring Application or spring beans
can be added. This is done through a combination of component scanning and appliances that look for
certain types of components in specific locations; these then get added to the application context. For
example if a GemFire XD cache was used then special modules could be defined and accepted by Spring
XD that could then interact with the new GemFire cache.
The focus is on Spring XD as no major study has been undertaken into its performance metrics. The
purpose of this paper is to remedy this. So all elements of the Spring XD system (the main ones anyway)
needed to be pushed to breaking point. And in doing so record the variables that influenced where this
breaking point was and what external factors were relevant. Measurable data metrics include
time for data to stream from Sink to Source
CPU usage as data is streaming
Memory usage as data is streaming
All of the measured data is subject to the system that is used for testing. These system variables include
CPU make and speed. How many VM’s were used and how much memory was given to each VM.
Building the Test System
The Spring XD cluster was installed from the ground up, adding new components piece by piece. A
dedicated server with a VMware ESX hypervisor was installed on the bare metal server. A VM was spun
up and the first step was to install Spring XD. At this point tests were carried out on Spring XD Single
mode server. A twitter stream test was carried out to ensure the system was functioning correctly.
![Page 14: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/14.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
14 Micheál Walsh MSc Cloud Computing
XD Single mode System9
Http Post of Data Processing DSL
XD Admin Modules
Modules
XD Container
Figure 4: Spring XD Single node architecture
Spring XD single node is a version of XD that works within the confines of a single machine and is usually
used for testing purposes or for small workloads. Once the single mode Admin server is running, the XD
shell can be opened. The user has the option of starting one container instance that can deploy multiple
modules or starting multiple container instances each to hold one module. Once a container instance is
running the user can use the XD shell to provide the stream commands to the XD admin server via http
using the DSL. A pluggable messaging service is not required for single mode setup as the default
memory store is used.
![Page 15: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/15.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
15 Micheál Walsh MSc Cloud Computing
XD Distributed System10
Http Post of Data Processing DSL
XD Admin
XD Container
Modules
XD Container
Modules
ZookeeperRedis/Rabbitmq
MySQLXD Container
HadoopXD Container
VM4
VM1
VM2 VM3
VM5
Figure 5: Spring XD distributed architecture
The Virtual Machine Cluster was setup by adding four more Red-hat Enterprise Linux VMs to the ESX
Server. The following outlines the design of the cluster and is also illustrated in the diagram above.
VM1: Spring XD Admin server
VM2: Spring XD Container
VM3: Spring XD Container
VM4: Services and Spring XD Container
o Redis / RabbitMQ
o Zookeeper
o SQL Datastore
VM5: Hadoop and Spring XD Container
Spring XD was installed on each VM which enabled the XD Container server or XD Admin server to be
spun up on any of the VMs in the cluster. Once the services were installed on the VM4 the
xd/config/servers.yml file needed to be updated on each VM. Once the system is up and running the
servers.yml file should be homogeneous and contains the IP of the VM where the application server for
example Redis, Zookeeper etc was running. It also contained the port number and any other setup
information that would be used by that specific application. This allowed communication across the
Cluster between the Admin server and each of the Big Data application servers required by Spring XD.
![Page 16: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/16.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
16 Micheál Walsh MSc Cloud Computing
Once a stream is deployed in a distributed system each module of the stream is assigned to a random
container within the cluster. The modules communicate via a pluggable message bus protocol. Redis is
the default messaging service but RabbitMQ is recommended. The advantage of Redis is its speed but
RabbitMQ is more reliable with greater hand shaking of transactions and persistent queuing.
A major design advantage of spring XD is that every module has its own application context. This allows
the user to use two different configurations while streaming the same information on two different
streams. This is the type of freedom required for big data sources that need to be fed into different
systems simultaneously.
A “Tap” is a type of stream where it examines an existing stream and copies specific data from that
existing stream to create another stream. This causes no ill effects to the original stream.
To demonstrate the flexibility of modules and streams another more impressive example is using the
twitter API as the source for the data being streamed. To do this however the user must provide twitter
verification keys to gain access to the twitter stream as a source module. It is important to keep in mind
that the DSL command structure is all important in deploying a stream with a successful outcome.
Functionality of the DSL: Twitter Stream Example
The user needs the DSL to specify the structure of a stream. The following twitter stream example is a
good demonstration of this. To begin with an initial stream is constructed
xd:> stream create tweets --definition "twitterstream | log" 11
Next three “taps” are created such that each copy data from the initial stream,
xd:> stream create tweetlang --definition "tap:stream:tweets
xd:> stream create tweetcount --definition "tap:stream:tweets > aggregate-counter" –deploy
xd:> stream create tagcount --definition "tap:stream:tweets > field-value-counter --
fieldName=entities.hashtags.text --name=hashtags" --deploy
Finally the initial stream is deployed which begins the feed of tweets from the twitter source.
xd:> stream deploy tweets
The “–deploy” option could have been integrated to the initial “create” command but it is preferable in
this instance to separate them in order to set up the taps before beginning the process.
Other streams were also experimented with. For example a HDFS Sink was setup on VM5. This was used
to stream EMC log files (of differing sizes) from a file source on VM2 or VM3 to the HDFS location on
![Page 17: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/17.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
17 Micheál Walsh MSc Cloud Computing
VM5. Once the files had streamed it was possible to manually check the Hadoop file system on VM5 to
ensure the stream was successful.
Storing DSL commands 12
Spring XD has a couple of ways that DSL commands and Server state information get stored. The first is
default where this information gets stored to memory but not written down to disk. The down side to
this is if the server is closed all information gets lost. The second is using the Redis key-value store. For
the test system created for this project I also required MySQL database that Redis connects to in order
to store the values to disk. The Redis properties are set in the redis.properties file and the Admin and
Container Servers connect to Redis by inputting the location of Redis to the servers.yml file located in
the Spring XD config folder. For a distributed system Zookeeper is used as the centralized storage for
stream and job definitions, so while Zookeeper is running all definitions get kept in memory but also get
written down to disk. So if Zookeeper is rebooted for whatever reason all previous streams and jobs will
be there. The Zookeeper properties are set in the zoo.cfg file and again server.yml is where the user
points Spring XD to the running instance of Zookeeper.
Preparing the VMs
The version of Operating system used was Red Hat Enterprise Linux Server release 6.6 (Santiago). Java
version 7 or above should be installed. The hardware specifications for the ESX server were as follows
CPUs available: 12 x 2.1 GHz
Processor type: Intel Xeon CPU E5-2620 V2 @ 2.1GHz
Memory Capacity available: 98230MB
Number of NICs available: 4
The hardware specifications for each individual VM on the ESX server were as follows
CPUs available: 4 x 2.1 GHz
Processor type: Intel Xeon CPU E5-2620 V2 @ 2.1GHz
Memory Capacity available: 4096MB
Number of NICs available: 1
Java heap size 64MB (default setting used during testing)
Max java heap size possible on system VM 1GB
![Page 18: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/18.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
18 Micheál Walsh MSc Cloud Computing
Analysis
Analysis of Spring XD functionality
Spring XD has many excellent features and is defined by Pivotal in the following way – “A unified,
distributed and extensible service for data ingestion, real time analytics, batch processing and data
export”13. Breaking each of these elements down to extract their meaning is a good way of unlocking the
Spring XD feature rich system.
Unified14
There are currently many standalone Apache open source projects that tackle problems like data
ingestion, real time analysis, data streaming and data loading. Spring XD can be described as a Unified
service because it packages many solutions in one service. Below is a table of projects (that are available
as open source or with paid license from companies like Pivotal) that Spring XD has attempted to
combine in a single service. For data ingestion, loading and analysis the products are different and are
used in differing use cases so don’t directly compete with Spring XD. However it could also be argued
that there is cross over in most of these areas and that Spring XD is easier to use due to the automated
structure of the input, analysis, batch and output plugins.
Data Ingestion and Data Loading Apache Sqoop, Apache Flume, Pivotal Data Loader
Real time Data Analysis Apache Pig, Apache Hive and Apache Mahout
Data Streaming Apache Oozie, Spark Streaming and Apache Storm
Batch Processing Pivotal Data Loader, Apache Hive, Pivotal HAWQ, Apache Pig
Table 1: Products with partial Spring XD functionality
Distributed Runtime:
Spring XD for industry scale applications runs as a distributed system. This has requirements to deploy
modules and UN-deploy modules when necessary, which can be controlled as a dynamic system that
can be scaled up or down by the user as required. The system must also have the ability to know when a
new container has to be spun or up or down. So if a container fails/stops functioning then the work that
was being done on it can be restarted on a different container. The Admin Server has a set amount of
intelligence prescribed by the user and can actively assign modules to specific containers. The user can
![Page 19: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/19.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
19 Micheál Walsh MSc Cloud Computing
also specify the number of modules to be deployed on a specific container. The Admin Server can also
query the various containers in the system to understand the current state of a stream.
Spring XD also supports multi Admin Servers, the leader Admin gets randomly elected and is referred to
as the supervisor. The supervisor makes all the decisions on what containers get what module deployed
on it, it also deals with all Container Servers that get created or destroyed.
The standby Admin Servers are used for redundancy so there is no single point of failure within the
system. So if one Admin Server fails then one of the other two will get elected as supervisor and will pick
up where the last supervisor left off. The standby Admin Servers can also be useful for heavy workloads
say if a load balancer is placed in front of 3 Admin Servers then the load balancer would pick an Admin
Server that is the least stressed to supply its command to. The load going to the load balancer could be
coming from say a REST API with an unpredictable amount of data.
The distributed system like Spring XD as an entity must be able to come to a consensus on decisions like
wither a Container Server is dead or alive or on what to do if a general error is seen on the system.
Zookeeper is a tool set for building a highly available distributed coordination service that is essential in
the construction of distributed systems. On a large scale distributed system Zookeeper requires an odd
number of Zookeeper Servers to be running because any updates or reads from Zookeeper require a
quorum of Servers so if you set up 5 servers then at least 3 of those servers need to be up and be able to
see each other in order to make any progress on reads or writes. If the quorum is disrupted then no
reads or writes can occur15.
If updates are required to the system and 2 updates are sent one after the other then the first update
will be written to every part of the system before the second updates gets written to any part of the
system. This is called guaranteed ordered delivery. Spring XD uses Zookeeper as a storage system for
Stream and Job definitions. Zookeeper keeps these definitions in memory but also writes everything
down to disk in case of system wide failure.
Extensible:
An-other ability of Zookeeper is to notify the user on a specified thread if any changes to the system
occurs like a container failing etc. Spring XD uses this feature to notify it of any changes in the number of
Containers, Streams or Jobs.
![Page 20: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/20.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
20 Micheál Walsh MSc Cloud Computing
Ephemeral nodes is hugely important as the user can mark a node ephemeral which means the user is
notified if a node goes down or is still available for work. This is done through a type of heart beat
registry. Where each node must send a signal periodically to the master node telling the master node it
is alive. If this does not happen then the master node presumes the node is dead and reassigns its work
elsewhere. Spring XD uses this feature of Zookeeper to track Container Servers, Modules and Admin
Servers. If the streaming load increases then more containers can be added and the supervisor will
initially spread the new Container Servers out across the system and then load balance which means
moving workloads from Container Servers with high loads to the newly deployed Servers so each
Container Server has approximately the same work load.
Data ingestion16
One of the goals of Spring XD is to automate stream deployment so for the user using Spring XD will save
time, be easy to deploy and be reliable. To do this Spring XD uses Spring Integration adapters which are
compatible with the most used types of data sources. Table 2 below shows a list and description of
these input sources. Within the Shell Admin DSL commands variations can be specified by the user. For
example if data is to be streamed from a file location the user can specify if the data should be serialized
or the data be kept in the same format before reaching the output. The following command specifies
that the data be kept as plain text format: –outputformat=plain/text. Custom sources can also be built
with relative ease however a good knowledge of Spring Integration is required for this.
Input Source Description of what data is used as input to source module
HTTP When data is posted to the specified HTTP Server
SFTP Secure File Transfer Protocol is the protocol used to transfer files from a given local
directory
Tail When a file is appended to for example a log file of a running system, the data that is
added is copied and is used as the input to source module
File The File contents of a specified File directory
Mail The incoming Mail of a specified Mail Server directory
Twitter Receives data by continuously querying real time Twitter Server Streams
![Page 21: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/21.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
21 Micheál Walsh MSc Cloud Computing
Search
Stream
Ingests data from the Twitter streaming API
Gemfire
Source
Receives data from a specified data Cache
Gemfire CQ Receives data from a query operating on a specified cache source. Only receives data
when the query result changes.
Syslog The TCP protocol is used to ingest data from specified log files
TCP This source acts as a Server and allows a remote connection to Spring XD and submit
data via raw TCP sockets
TCP Client This source acts as a Client and allows a remote connection to Spring XD and submit data
via raw TCP sockets
Reactor IP This source acts as a Server and allows a remote connection to Spring XD and submit
data via raw TCP or UDP sockets
JMS Receives messages from Java Message Service
RabbitMQ Receives messages from RabbitMQ message queuing service
Time A String format containing the time every so often
MQTT Connects to MQTT server and receives telemetry messages
Stdout
Capture
Combination of TCP and NETCAT command to capture output of a command
Kafka Ingests data from Kafka topic configuration
JDBC Ingests data from various databases
Table 2: Spring XD Stream Sources
Real time Analytics:
For large scale work load analysis distributed systems Spring XD would be used as a single entity within a
toolbox of products. Spring XD would only handle the data streaming. The reason for this is there are
more sophisticated and specialized analysis tools on the market for high end analysis. However each
system is different and has different requirements so the level of analysis that spring D offers may be
![Page 22: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/22.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
22 Micheál Walsh MSc Cloud Computing
sufficient and render the more advanced tools redundant. In Spring XD Analytics can be done in two
different but similar formats. Analytics modules get added to a stream and are placed between the
Source and the Sink. These processor modules transform the data or perform the analytics on the
primary stream outputting just the result of the processes. The most simplistic type of analytics
supported by Spring XD is to use counters and gauges to perform various types of aggregation analytics.
A tap can also be used to create a secondary stream. This tap can be placed anyway along the stream to
best suit the needs of the system. This secondary stream can then have analytics applied to it. This is the
most widely used format of analytics in Spring XD and can be powerful when put together visual
representation of the streaming data. This can be done through REST API together with a script written
say in groovy. An example of this is the twitter stream example mentioned in Functionality of the DSL:
Twitter Stream Example section of this paper. Where the number of tweets containing a certain hashtag
can be visually represented on the screen which is changing in real time. Figure 6 was taken of the
analytics dashboard that was installed as part of the investigation for this project.
Figure 6: Twitter Stream Analytics Dashboard
Machine learning analytics algorithms can also be applied in Spring XD via extensible class libraries.
Machine learning algorithms usually run in a batch process on data that is saved in relational or non-
![Page 23: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/23.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
23 Micheál Walsh MSc Cloud Computing
relational file system. As the data streams in real time the system is limited to the amount of processing
it can do and this aspect of Spring XD is still being fleshed out.
Batch Processing:
Batch processing is the ordered sequence that is put in place as one task needs to be completed before
the next one should commence. Every Big data system is different so no two batch systems are likely to
be the same. One of the most used and user friendly solutions is Spring Batch. Spring Batch attempts to
tackle this problem and does so successfully to a degree. Spring project in general removes the need for
boiler plate code and has support through jar plugins that automate features. However a high degree of
Spring and java coding is required. With this in mind it is true to say that most batch orientated systems
are highly complex to implement and are not developer friendly. Hadoop would be an example of this
where anything deviating from the out of the box solution is extremely complex to implement. Spring
XD on the other hand has automated the process with the ability to add and remove processes and all
the user needs to do is set up the infrastructure then create and deploy the stream. The pipe structure
of the stream ensures that each process gets done in sequence.
Data export:
Similar to data ingestion Spring XD also automates plugs for data output or data streaming destinations.
This is referred to as the Sink. Every stream must end with a predefined Sink definition. Table 3 below
outlines all available predefined sink options.
Sink Description of Sinks(Output Sources)
Log Data gets output by application logger to the Container Console
File Sink Data gets output to a file on the Container OS
Hadoop HDFS Data is output to HDFS
HDFS Dataset Output data is java classes and are stored in that way on HDFS
JDBC Data gets output to a relational database table
TCP Sink Data is output via TCP protocol
Shell Sink Complex sink that allows a process written in any language modify the data
![Page 24: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/24.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
24 Micheál Walsh MSc Cloud Computing
Mongo Data is output to a Mongo collection
Mail Output messages get sent as emails
RabbitMQ Data is output as Rabbit messages
GemFire
Server
Data is output to a GemFire Cache
Splunk Server Data is output via TCP to a Slunk connector
MQTT Sink Connects to MQTT Server and data gets published as telemetry messages
Dynamic
Router
Routes messages to certain named channels based on outcomes of specified
expressions or scripts
Null Sink Data gets destroyed but not before analytics is run via a tap
Redis Data gets output to Redis data store
Kafka Sink Data gets output to Kafka topic configuration
Table 3: Spring XD Stream Sinks
Apache Storm:
Another open source project that streams data in real time is called Apache Storm. Some of the key
characteristics of apache storm include; highly scalable real time event stream processing platform,
extremely fault tolerant, guaranteed processing, language agnostic. Ostensibly Storm has many of the
features of Spring XD but with Storm the user is tasked with writing the processing logic. A plus however
is the user can write this processing logic in any language. In Spring XD it is possible to write unique
processor to transform the streaming data in a unique way but this is unlikely to be needed as most
transformations are automated. What makes the stream transformation unique in Spring XD is the way
the processors are queued.
Like Spring XD, Apache Storm uses Zookeeper to scale up the cluster if needed as data input grows.17
Storm like Spring XD does not use Zookeeper for message passing and so the quantity of data stored on
Zookeeper is low which is best practice for a Zookeeper cluster. As Storm is an Apache project it is built
to interact with the Apache family of products. This includes Apache Yarn and Apache Ambari. Apache
Yarn is installed on top of Hadoop as a resource manager and provides centralized management for
![Page 25: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/25.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
25 Micheál Walsh MSc Cloud Computing
HDFS cluster. This would include management of consistent operations, security and data governance18.
Figure 7 shows how Apache storm would fit into an eco-system such as this18.
Figure 7: Big Data environment
Apache Ambari is also used for a management tool for Storm. Used in a different way to YARN in that
Ambari visually represents all the servers running Storm and Hadoop so in one screen the controller can
see an overview of the system. In fact any Apache project that is running on the cluster can be added to
Ambari for monitoring and control which makes it a very powering tool. Spring XD does not boast an
overall system monitor such as Ambari. Most projects like Storm and Spring XD also have dedicated
REST based UIs. In the case of Spring XD the REST API overviews the Admin and Controllers Servers but
does not include the health of the Hadoop cluster.
An area where Storm falls down would be in the complexity of the setup. Where the user is encouraged
to view and edit code19. There is an inherent advantage to knowing how something functions from a
coding perspective but this limits the amount of people that can get a storm cluster to the point where it
is streaming data. 20 With Spring XD all of the Source connectivity is automated with declarative
statements adjusting the format of the data that acts as input. Storm uses a Source connector called a
Spout. This Spout connection reads data from a queuing broker21 like Kafka or RabbitMQ. It is these
![Page 26: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/26.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
26 Micheál Walsh MSc Cloud Computing
Spout implementations that are lacking automation and variation. Spring XD also has support for
RabbitMQ and is planning support for Kafka as Source input but this comes ready to use out of the box.
The design of Storm saw the input source only taking input from messaging services and that is why
most of the queuing systems are supported21.
Storm being an enterprise solution has most of its use cases for Source connectivity involving HBASE and
HDFS. Storm uses the concept of a bolt connector for data output. The architecture between Storm and
Spring XD is fundamentally different as can be seen in the function of a Storm Bolt. Any number of
streams can flow into a Bolt Processor21. The data can then be transformed using many functions like
Filters, joins, aggregations or input data from Relational databases. Once transformed the data can be
split or copied into any amount of output streams.
Figure 8: An Apache Storm Topology
A topology is the third concept that is introduced with Storm. Figure 8 shows an example topology
setup22. It is made up of Spouts and Bolts where once data is fed into an initial Spout it can be processed
many times by many different Bolts. In this setup the output of one bolt can be used as the input to
another bolt or Spout21. In this way a multi stage computation can take place but again there is no easy
intuitive way of connecting these bolts and Spouts to form the output the user is looking for.
Like Spring XD however Storm has a “local mode” (called single mode in Spring XD) for testing smaller
simulations when developing larger systems21.
![Page 27: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/27.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
27 Micheál Walsh MSc Cloud Computing
Spark Streaming
Spark Streaming is an add on to the Spark Framework that adds support for continuous data streaming.
The goal of the Spark Streaming infrastructure is to create a system that is fault tolerant cost effective
and fast23. It was noticed that current systems that were fault tolerant were slow to recover a fault and
this was the problem that Spark Streaming set out to solve. The answer was creating very small batch
jobs that streaming computation could run on. The lower the batch size the lower the latency required
to recover.
Within the Spark framework the user can write programs that transform the data and this is done across
Resilient Distributed Datasets (RDDs). RDDs are a fundamental architectural concept that is introduced
with Spark. Within this architecture the user can split data as they fell will best suit the needs of the
system. The data can be kept in memory or disk as it is being processed. Parallel processing of data is
supported in that if one process becomes too big within the memory of one machine the data will be
shared across multiple memory sets within a cluster. Like Spring XD fault tolerance is supported through
integration with Apache Zookeeper. RDDs are periodically monitored and if one stops responding it will
be rebuilt on a different machine.
Once the data is received Spark Streaming breaks the data in small batches. Once in batches the user
can task transformations map, filter and group by. This is similar to Spring XD in that task
transformations are a kin to putting together a stream of modules. The difference being that once a
stream is deployed in Spring XD each module will be carried out in order where as in Spark Streaming all
batch tasks will be carried out in parallel. Like spring XD nothing will happen unless the user Actions
(deploys) a command. Examples of Spark Streaming actions include count, collect and save.
You can save in HDFS with most storage types being supported. Before a task is actioned the transforms
just build up, and the RDDs plan how the tasks are going to be scheduled and run. Figure 9 shows the
architecture of Sparking Streaming24. Once the action is deployed the executers carry out the task and
return a result. Like Storm there are multiple languages supported to implement the executers, tasks
and actions. Examples include Scala, Python, Java etc.
![Page 28: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/28.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
28 Micheál Walsh MSc Cloud Computing
Figure 9: Spark Streaming Architecture
Comparing these three systems is difficult as each use case is different. It really comes down to choosing
the right tool for the job. Table 4 shows a generalization of where each framework excels and which
areas each framework are not market leaders in. Spark stream for example has been seen to be difficult
to maintain because of the existence of 2 stacks. One stack for streaming and the other for batch
processing25. The flexibility of such a system is high but learning curve when using a system such as this
is also high. Comparing this to Spring XD where the system is as useful in many ways. For example multi
copies of streams can be created at any time from the primary stream. This is done in the form of “taps”
which can then be transformed and saved for batch processing. Storm would also be seen as difficult to
put together due to its low level API. Where the user needs to structure the Spouts and Bolts to feed
into one another and come up with the correct form of processing.
Spark Streaming also does not scale well due to the fact it is so flexible to program. When small cluster is
doing many things well that is great but when the load increases so too do the bugs. When the system is
doing so many things at once a lot of fine tuning is required as the system grows. However if the
company implementing the system has the budget to upkeep a system such as this then Storm
streaming is well capable of scaling in such a flexible environment to process petabytes of data26.
![Page 29: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/29.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
29 Micheál Walsh MSc Cloud Computing
Spring XD Apache Storm Spark Streaming
Does the system scale
well?
YES YES NO
Does system have low
latency?
No Yes Yes
Flexibility level of the
system?
High Low High
Automation level of the
system?
High Medium Medium
Implementation
difficulty level of
system?
Low High High
Easy to Integrate for
Batch processing.
Yes No Yes
Table 4: Compare and contrast Spring XD, Apache Storm and Spark Streaming
![Page 30: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/30.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
30 Micheál Walsh MSc Cloud Computing
Results
The system under test
The purpose of this project was to analyze Spring XD. So this was done with different variables. The
messaging system used to carry information between modules is the main variable in the system so this
was used. Redis was the first messaging service tested as this is the default and is very quick and easy to
install. Redis is an in memory high performance key value datastore. Some draw backs to using Redis are
the data used for the datastore is limited to the non-persistence memory available on the system. If
across the network some of the data packets that the Redis messaging service is transporting get
dropped then there is no way for Redis to rectify this27. So preventing data corruption is not something
Redis can do. However Redis is fast and data integrity is not essential for many systems so with this in
mind Redis is a valid option for testing.
For comparison RabbitMQ was also installed for testing purposes. RabbitMQ is not as easy as Redis to
install. First Erlang needs to be installed. And this requires all VMs to be running the same version of OS.
In this case Red Hat Enterprise Linux 6 was used. Installing Erlang on machines that had a firewall was a
challenge and finding ways around this took a lot of trial and error. Once Erlang was installed RabbitMQ
was downloaded and installed. It is the little things that can hold up a project such as this. For example
once RabbitMQ was installed it would not communicate with the VMs that had the Admin server and
container server installed on them. Once the communication port had been opened and set in the
RabbitMQ config files and set in the server and container server.yml files, RabbitMQ was still failing to
carry messages once the streams were deployed. A change in the servers.yml file between Spring XD
versions had the documentation saying that the IP address of the machine running RabbitMQ needed to
be in the form “host:xx.xx.xx.xx” when with the new version of Spring XD this needed to be in the form
“address:xx.xx.xx.xx”.
Once installed RabbitMQ is reliable and unlike Redis it does have message acknowledgements. This has
the tradeoff of slowing down the system. These message acknowledgements come in two ways by
transactions and publisher confirms. RabbitMQ also gives the user more control from who can use the
system in the form of permissions for queues and exchanges. Also the user can decide wither the
information for specific exchanges and queues are kept in memory or written to disk28.
![Page 31: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/31.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
31 Micheál Walsh MSc Cloud Computing
The test process
As Redis is supposedly a fast messaging service and RabbitMQ supposedly a slow messaging service the
expectation was that a time calculation would bare out this difference. The stream used to calculate the
time was taken from a local file in the VM directory and the data was streamed to the containers
console as plain text. To calculate the time, steps needed to be prepared.
1. The Admin and Container Servers needed to be started.
2. The stream needed to be deployed. The following stream was used:
stream create --name sourcefilename --definition "file --outputType=text/plain | log" –deploy
The reason for using this stream definition as opposed to any other variation of stream is that a
stream definition had to be picked and file source was best suited to calculate time. Printing the
data to the console also suited for a time calculation but it also helped with visual confirmation
that the streaming had commenced.
3. A java script was written to record the time when a file was created within the source folder of
the stream. How Spring XD “file” source works is as follows. Any file stored in the folder location
/tmp/xd/input/name_of_the_deployed_stream will be streamed once the stream gets
deployed. So in this case when the stream is deployed there is nothing in the folder and Spring
XD polls the source folder waiting for a file to be dropped into it. The java script also polls the
same folder waiting for a file to be dropped into it. I used a Linux command to copy files that
were created at certain sizes. The command was as follows: “cp ../filesizes/2MB.txt .” the dot at
the end stands for here. So copy the file 2MB.txt to the current destination address.
4. Execute the Linux command. Once this command is executed the file gets created in the stream
source folder and the current time gets recorded. The data now stored in the source folder
starts to stream to the console of the Container server. Once the data stops streaming the time
is also recorded. Subtracting these two times gives the total time of the stream.
Data Limit:
Files with a lower size limit of 2MB and an upper limit of 50 MB were chosen for the streaming as 2MB
was a good starting point anything smaller than this and the user wouldn’t even notice the stream
taking place as the time was so insignificant. Anything above the 50MB file size is where the message
![Page 32: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/32.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
32 Micheál Walsh MSc Cloud Computing
service crashes. For Redis an “OutOfMemoryError” for the “java heap space” is returned as can be seen
by Figure 10. This occurs for any file being streamed where the size is greater than 50MB.
Figure 10: Redis out of memory java heap space error
For RabbitMQ a similar error is seen in Figure 11 where the queue called “xdbus” throws the
OutOfMemoryError” for the “java heap space”. The reason behind the similar error occurring at the
same point in the testing is due to the memory allocated for use to the Spring XD application is
approximately 50MB. For the JVM it doesn’t matter which messaging service is being used the heap
space allocation stays the same29. Once a file of >50MB is put on the heap the system crashes. So the
default system testing is restricted to an upper file limit of 50MB.
Figure 11: RabbitMQ out of memory java heap space error
![Page 33: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/33.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
33 Micheál Walsh MSc Cloud Computing
This queue can be seen in the RabbitMQ control UI where all unacknowledged messages are also
recorded. Figure 12 below was taken as part of this project where newfiletest1 is the name of the
stream source folder.
Figure 12: Image of RabbitMQ User Interface
![Page 34: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/34.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
34 Micheál Walsh MSc Cloud Computing
Results 2MB to 50MB:
The following data sets were measured from the system under test. First a range of files from 2MB to
50MB were streamed using the stream format already described (file to Container server console).
Figure 13: 2MB to 50MB measuring time taken for data to stream from Source to Sink
As can be seen from Figure 13 the results are linear. This is because as the size of the files increase so
too does the time taken for the stream to complete. The assertion that Rabbit MQ is a slower system
because of the handshaking that provides reliability for delivery of data is also born out in these results.
Examining the time taken to stream the smaller data transactions between 2MB and 6 MB it is clear the
times are relatively even. For Redis and RabbitMQ these times are relatively even because the
differences could be considered negligible or too small to measure accurately. But as the size of the data
increases so too does the time difference between the two messaging services. With a 33% time
increase from using Redis to RabbitMQ at 8MB to a 50% increase at 50MB. As the total time increases
with each jump in file size the overall increase between the messaging services also increases. With the
50% increase at 50MB equally an eleven second difference overall. This can only be described as
significant. So from these results it can be clearly seen excluding a few anomalies that RabbitMQ is in
fact slower than Redis.
0
5
10
15
20
25
30
35
2 4 6 8 10 20 30 40 50
Time in Secs
Size of source data in MB
RabbitMQ
Redis
![Page 35: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/35.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
35 Micheál Walsh MSc Cloud Computing
The next test was to measure and compare the percentage CPU usage for Redis and Rabbit. Again file
sizes from 2MB to 50 MB were used. RabbitMQ is slower and has built in hand shaking so should use
more CPU than Redis. This increase would be slight as the message service passes extra messages back
and forth. With both tests being relatively similar with Redis averaging at 8.9 % overall CPU usage while
data is streaming. RabbitMQ on the other hand has an average of 9.9 % overall CPU usage. This increase
can account for the extra load that RabbitMQ has to carry.
With lower file sizes the Redis values are below the overall average and this increases as the file sizes
increases. As the java heap approaches its limit Redis average of 8.9 increases by 30% to a value of 11.6.
This spike in CPU usage shows the stress the system is under to load data onto the java heap as it
reaches its limit. With RabbitMQ the increase from average value of 9.9 to spike value of 12.7 is 28%. So
both messaging services are consistent with each other as they reach the limits of allowable memory
usage.
Figure 14: 2MB to 50MB measuring % CPU usage while data is streaningfrom Source to Sink
The last category tested between the 2 messaging services was percentage memory space used of the
system while data was streaming. Here again both services perform very similar. With all tests up to
0
2
4
6
8
10
12
14
2 4 6 8 10 20 30 40 50
% CPU usage
Size of source data in MB
RabbitMQ
Redis
![Page 36: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/36.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
36 Micheál Walsh MSc Cloud Computing
8MB using the same amount of memory respectively. For RabbitMQ this averages at 18.3% total
memory used. For Redis this value is higher at 20.2% of total memory used.
From here an increase can be seen for both message services. For RabbitMQ this increase is from 18.3%
to 21.2% and Redis the increase is from 20.2% to 23%. In both cases this is approx. a 3% overall jump in
in memory usage. This is easily accounted for because the data being streamed is greater so more of the
streamed data is being stored in memory if only for a brief period.
As the system approaches the breaking point of 50MB the memory usage jumps dramatically. For
RabbitMQ the memory jumps to a massive 31% overall system usage which is up from a constant 18%
for lower file sizes. The percentage increase is calculated to be 72% for RabbitMQ. This can be explained
by the increase of handshaking to reliably transmit the data as the data size itself increases. For Redis
the story is similar as the jump is to 29% overall system usage up from a value of 20% for lower file sizes.
The percentage increase works out to be a 45% increase for Redis which again increases because the
data increase but not as much as RabbitMQ because there is no handshaking taking place.
Figure 15: 2MB to 50MB measuring % memory usage while data is streaming from Source to Sink
0
5
10
15
20
25
30
35
2 4 6 8 10 20 30 40 50
% memory usage
Size of source data in MB
RabbitMQ
Redis
![Page 37: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/37.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
37 Micheál Walsh MSc Cloud Computing
Results of streaming 2MB files:
The next data set used for measurement from the test system was multiple 2MB files. Only 2MB files
were used in amounts varying from 1 to 100. Again the time taken to stream these files is linear. So as
the amount of files increases so too does the time taken to stream all the files. Figure 16 shows a
smaller data set but this linear relationship is still evident. Similar to figure 4 for smaller data sizes no
real difference in time can be measured between the two messaging services under test. One surprising
outcome of this test was the fact that there is no limit to the amount of small files that can be streamed.
Figure 16: Measuring time taken for differing quantities of 2MB files to stream from Source to Sink
This is shown in Figure 17 where the source folder for the stream holds different test values above
50MB, where 50MB is the limit for a single file streaming within this test system. For this test however a
higher folder size of 200MB was used (100 * 2MB) but this could be increased to a value that is only
limited to the size of the source folder. The source folder is capable of being increased to a value
dependent on the hard drive of the OS it is running on. With this really being a hardware constraint
rather than a Spring XD limitation it would not be unreasonable to say that the size of the ingestion
stream is limitless, provided the file size used is kept low.
0
20
40
60
80
100
120
140
160
1 2 3 4 5 10 20 25 30
Time in Secs
Quantity of 2MB files
RabbitMQ
Redis
![Page 38: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/38.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
38 Micheál Walsh MSc Cloud Computing
The reason this is possible is the stream ingests one file at a time so each file within the source folder is
put into a queue and streamed in order. This puts very little stress on the system. So using multiple small
files there is no limit to the size of the folder that the stream can handle. This is in contrast to Figure 13
which shows a similar linear relationship to Figure 17 but has an upper limit of 50MB due to only one
source file being used.
Figure 17: Measuring time taken for differing quantities of 2MB files to stream from Source to Sink
An analysis of the CPU usage for testing multiple 2MB files demonstrates just how Spring XD handles the
multiple small files in order one after the other. Figure 18 shows this CPU analysis of the container VM
when dealing with different 2MB file quantities. A spike in work load approximately every five seconds is
evidence of the fact that Spring XD does systematically streams one file at a time until all files within the
source folder have been streamed. Analyzing Figure 18 it can be seen that the CPU usage drops to zero
and stays at zero for approximately 3 seconds between each spike. This 3 second period is where Spring
XD has loaded the file to the source and is outputting the stream to the container console which in this
case is what the user sees as visual evidence that the stream is functioning. Once the file has been
streamed to the console the next file is taken which accounts for the next spike in CPU usage.
0
100
200
300
400
500
600
1 2 3 4 5 10 20 25 30 40 50 60 75 100
Time in Secs
Quantity of 2MB files
RabbitMQ
Redis
![Page 39: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/39.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
39 Micheál Walsh MSc Cloud Computing
Figure 18: Measuring % CPU pattern as 2MB files are streaming with Redis as messaging service
Figure 18 shows the system using Redis for message transport. The same tests were carried out using
RabbitMQ and the results of which are shown in Figure 19. Both Figure 18 and Figure 19 show very
similar CPU patterns as both are showing an average of 7% CPU usage. This demonstrates that both
messaging services during general system use will put the OS under a similar amount of pressure. This
then should not be a consideration as to why the user would pick one messaging service over the other
when a load such as this is being used.
One aspect of the Redis analysis that does stand out is the random spikes in CPU usage that occur. Two
spikes that jump to a value close to 14% CPU usage can clearly be seen in Figure 18. These spikes could
be accounted for by the internal functionality of Redis. Redis uses a data base format called RDB. Redis
stores values in memory most of the time but also persists to disk when data becomes too large or at
predefined time intervals30. Redis dumps the data to disk but does not over modify the previous data set
that got persisted to disk but instead writes over the previous version. This causes large spikes in CPU
usage which could account for what can be seen in Figure 18.
These spikes are not seen when using the RabbitMQ messaging service which is shown on Figure 19.
However initial CPU usage spikes when the Spring XD stream is started are evident. This could be put
0
2
4
6
8
10
12
14
16
1 3 5 7 9 11 13 15 17 19 21 23 25 27
% CPU usage
Time in Secs
Redis 2 * 2MB files
3 * 2MB files
4 * 2MB files
5 * 2MB files
10 * 2MB files
20 * 2MB files
25 * 2MB files
30 * 2MB files
40 * 2MB files
50 * 2MB files
60 * 2MB files
75 * 2MB files
100 * 2MB files
![Page 40: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/40.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
40 Micheál Walsh MSc Cloud Computing
down to RabbitMQ being a more robust system and thus needing more resources when establishing
connections on system start up.
Figure 19: Measuring % CPU pattern as 2MB files are streaming with RabbitMQ as messaging service
For memory usage very similar results can be seen between the two messaging services. Both services
use between 19% and 21% of overall system memory throughout the wide range of differing folder
sizes. This is shown in Figure 20 below. As the folder size increases the memory used does not increase a
dramatic amount. A 2% memory usage increase is the maximum increase which could be described as
insignificant. This again is in contrast to previous tests using larger file sizes. In Figure 15 it can be seen
that as the file size increases so too does the memory usage with a dramatic spike (which could be
crippling to a system) being seen as the size approaches the java heap threshold. No such spikes are
seen when using multiple smaller files and thus the system is much more stable and reliable. This
stability is very much in evidence from Figure 18, Figure 19 and Figure 20.
0
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526
% CPU usage
Time in Secs
RabbitMQ 2 * 2MB files
3 * 2MB files
4 * 2MB files
5 * 2MB files
10 * 2MB files
20 * 2MB files
25 * 2MB files
30 * 2MB files
40 * 2MB files
50 * 2MB files
75 * 2MB files
100 * 2MB files
![Page 41: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/41.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
41 Micheál Walsh MSc Cloud Computing
Figure 20:Measuring % memory usage patterns as 2MB files are streaming
18
18.5
19
19.5
20
20.5
21
2 3 4 5 10 20 25 30 40 50 60 75 100
% Memory usage
Number of 2MB files being streamed
Memory usage:streaming 2MB files
RabbitMQ
Redis
![Page 42: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/42.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
42 Micheál Walsh MSc Cloud Computing
Results of streaming 10MB files:
Multiple 10MB files were the next data set used for measurement from the test system. Only 10MB files
were used in amounts varying from 2 to 15. Each addition of a file to the source folder yields a large
overall increase in the data being streamed. From the first round of tests it was evident that a single
10MB file was no trouble for a system such as this. However it was how the stream and the messaging
service linked these files together in the queue was what was being tested. Again from 12 the linear
relationship in time increase as the overall size of the source file folder increases, is evident. It could be
argued that as the size of the files being streamed increases i.e. the difference from Figure 16 (2MB files)
and Figure 21(10MB files) that the time between the two messaging services also increases. With the
same size data being processed in Figure 21 and the increase in the time it takes RabbitMQ to process
data is again significant. For example at a quantity 15 there is a time difference of 6.2% and at 8 this is
15%. As before this can be put down to the significant hand shaking that takes place for RabbitMQ that
does not take place with Redis.
Figure 21: Measuring time as 10MB files are streaming from Sink to Source
0
10
20
30
40
50
60
70
80
90
2 3 4 5 8 10 15
Time in Secs
Number of 10MB files being streamed
RabbitMQ
Redis
![Page 43: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/43.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
43 Micheál Walsh MSc Cloud Computing
When performing these tests it was clear the system was under pressure. There were longer delays from
the time the system was initialized to the time data started to stream in the console. The data was a lot
more erratic and didn’t flow as evenly or so it seemed. This pressure was most evident in the percentage
CPU usage data. There were no even spikes and hollows every 3 seconds like with 2 MB tests. Rather a
steady usage of between 3% and 12% CPU usage. This was followed by random spikes of anything up to
40% CPU usage every 4 to 5 seconds. Figure 22 shows the average of each test stream with each test
showing similar averages for RabbitMQ and Redis. This work load ideally should be avoided to prevent
the unwanted spikes that could cause the system to crash.
Figure 22: Measuring % CPU usage as 10MB files are streaming from Sink to Source
For memory usage the system fluctuates between 20% and 27% total percentage memory usage. This is
in contrast to the memory levels seen for the 2MB tests where the values stayed between 19% and 21%
even as the source folder size increased (Figure 20). This is another indication that the system is under
pressure to complete the stream. And as seen with the single file tests in Figure 15 the memory limit for
the java heap is approximately around the 30% mark, which these values are just under. This test is
unlikely to crash like the single file test as an excess of 50MB won’t be loaded into the java heap at any
0
1
2
3
4
5
6
7
8
2 3 4 5 8 10 15
% CPU usage
Number of 10MB files being streamed
RabbitMQ
Redis
![Page 44: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/44.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
44 Micheál Walsh MSc Cloud Computing
one time. But it is likely that a value close to this limit will as Redis keeps all data in memory before
persisting it to disk. This causes the CPU spikes as seen in Figure 18 and again could cause a system
running close to its limit to crash.
Figure 23: Measuring % memory usage as 10MB files are streaming from Sink to Source
0
5
10
15
20
25
30
2 3 4 5 8 10 15
% memory usage
Number of 10MB files being streamed
RabbitMQ
Redis
![Page 45: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/45.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
45 Micheál Walsh MSc Cloud Computing
Results of streaming 20MB files:
The final test conducted was done using multiple 20MB files. As 20MB is a sizable file size on its own
multiples of it grow the source folder quiet quickly. With that in mind 15 was the maximum number of
files used during this test. Again RabbitMQ and Redis were tested against each other with some similar
results as previous systems. Again RabbitMQ seems to take a longer period of time to process the same
load and the relationship between the time increasing and file growth in linear. Figure 24 demonstrates
this however the percentage time does not seem to be as large as with other tests. For example using a
quantity of 5 files RabbitMQ only showed a 2.5% increase in time and for 15 files showed 1.7% increase
in time. This could just be down to small inaccuracies in testing. However if Figure 13 is examined the
single 20MB file also showed a large time variation of 20% between RabbitMQ and Redis. This leads to
the conclusion that Redis slows down when dealing with multiple large files.
Figure 24: Measuring time as 20MB files are streaming from Sink to Source
The percentage CPU usage also increased if only by a small margin. This can be seen in figure 16 where
each test gives the average overall CPU value for that test. This test is very similar to the 10MB file test
with slightly higher values across the board. Another significant difference is CPU values every second
are more erratic with highs of greater than 30% a regular occurrence. The lows too are higher between
0
20
40
60
80
100
120
140
2 3 5 10 15
Time in Secs
Number of 20MB files streaming
RabbitMQ
Redis
![Page 46: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/46.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
46 Micheál Walsh MSc Cloud Computing
files with 3% and 5% the usual values seen. This then accounts for the higher overall average values seen
in Figure 25. Again this could be seen as not the load that is designed for such a system. And as the CPU
usage increased the system reached a critical state again. As 20MB files were streaming the system
struggled to output the data to the console and a “failed to flush writer” error was thrown on the Spring
XD container server. Figure 26 is a screen grab of this error. This did not prevent the write to the console
taking place however and the stream continues after the error is thrown. This shows the dynamism and
robustness of Spring XD.
Figure 25: Measuring % CPU usage as 20MB files are streaming from Sink to Source
0
1
2
3
4
5
6
7
8
9
10
2 3 5 10 15
% CPU usage
Number of 20MB files streaming
RabbitMQ
Redis
![Page 47: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/47.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
47 Micheál Walsh MSc Cloud Computing
Figure 26: Spring XD Container Server Error
Figure 18 shows memory usage for the 20MB test. What is evident is an average higher memory usage
than any other test conducted which also could have contributed to the errors on Figure 26. The only
test that comes close to these levels are the single file tests with a file size greater than 30MB (Figure 6)
which ultimately led to the java heap error and system crash. Again the memory usage levels only
exceed the 30% mark as the source folder size approaches 200MB. This total size is still far greater than
using a single file as the memory gets cleared each time a new 20MB file is streamed. What may be
happening here is that memory isn’t getting cleared fast enough and with every file that gets streamed
the memory fills up until again the system crashes.
![Page 48: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/48.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
48 Micheál Walsh MSc Cloud Computing
Figure 27: Measuring % memory usage as 20MB files are streaming from Sink to Source
0
5
10
15
20
25
30
35
2 3 5 10 15
% Memory usage
Number of 20MB files streaming
RabbitMQ
Redis
![Page 49: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/49.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
49 Micheál Walsh MSc Cloud Computing
Conclusion
The initial aim when starting this project was to study and experiment with, different methods of
analyzing a big data source. That goal was accomplished once the highly available Spring XD cluster was
setup and different stream tests were carried out. The four aspects of Spring XD: Data ingestion, Data
compute, data analytics and data export were broken down and examined. Examples of these were the
Twitter stream example and streaming EMC log files to a HDFS stream sink that was setup on VM5
within the ESX cluster.
The project then evolved to analyze Spring XD as it was evident that a set of guidelines could be
beneficial to anyone setting up a system such as this. My initial findings during my examination of Spring
XD were used to come up with a plan to create test streams. These tests were carried out in conjunction
with a java program that was run to capture stream time, system CPU and system memory metrics. In
this way an over view of what type of load was best suited the system and what load stressed or broke
the system.
The main variable within the system was the messaging service. So the only two messaging services
supported at the time of test were used. These were RabbitMQ and Redis. The same tests were carried
out for each messaging service and a comparison was done across different load types.
Of course every test system is different and the specifications for the one used in this case are listed
under Preparing the VMs. The java heap limit was set at default which was 64MB. This could have been
extended during testing to see what affect this would have. Users of Spring XD should be aware of this.
This will only become a problem if the source folders contain files greater than 30MB as this is the point
when the Spring XD container server starts to become stressed. Spring XD container server will reach
the breaking point at anything greater than 50MB.
It was then found that a perfect load for Spring XD was multiple small files of 2MB. These files would
stream very happily one after the other until all were streamed to the stream sink. The system would
not get stressed and no errors occurred. Further tests were carried out using larger file sizes of 10MB
and 20MB. It was clear that the system was becoming more stressed as the size of the files grew. So this
leads to a conclusion that any quantity of files under a size of 10MB would provide the perfect load for a
system such as this. Streams with source files greater than 10MB will function correctly, if absolute
![Page 50: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/50.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
50 Micheál Walsh MSc Cloud Computing
limits are adhered to. However systems with such loads could be in danger of CPU spikes or java heap
errors which could lead to data loss or data not being analyzed.
Comparing RabbitMQ and Redis it is evident that RabbitMQ is a slower messaging service and this time
difference is very slight when using smaller files but grows as the file sizes grow. A decision on what
messaging service to use really comes down to system requirements. If speed is essential with data loss
acceptable then Redis is the choice. If data integrity is essential then RabbitMQ is the only viable option.
As speed can be made up for in other ways but data loss cannot. For example the system is horizontally
scalable so time can be made up by spreading the load across multiple containers on multiple VMs.
An ideal work load of endless small files makes Spring XD fit for purpose. This is a tool that is perfect for
Big Data phenomena like the Internet of things (IoT) or any system that requires the collection and
analysis of multiple small files.
Future Work
Spring XD version 1.0.0 was used during this project. As of March 27 2015 the latest version of Spring XD
is 1.1.1 which brings with it new supported sinks, sources and messaging services. This newly supported
messaging service is called Apache Kafka. So a further comparison against Redis and RabbitMQ could be
carried out.
There are many Spring XD Sources and Sinks that were not tested due to time constraints. So a more in-
depth test plan could be prepared using more stream types. This could be used to compile a more
comprehensive set of data for time taken, % CPU and % memory usage on the system under test.
An enhancement that could also be carried out in the future is extending the java heap size on the
Spring XD Container Server VMs. Then repeating the tests carried out during this project would help
build a better picture as to what the absolute maximum file size is, that a Spring XD container can
stream.
![Page 51: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/51.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
51 Micheál Walsh MSc Cloud Computing
Appendices
Java Code: CPU and Memory
The following java code was written to output the command prompt screen the percentage CPU and
percentage memory being used at that instance. Time and date is attached to each reading before being
output to the screen. The “SIGAR” library is first added as a jar file and is used to capture system
information, in this case memory and CPU usage31.
The full program is as follows:
package ie.cit.msc;
import org.hyperic.sigar.SigarException;
public class Application {
public static void main (String args[]) throws SigarException {
new Application(args);
}
public Application(String[] args) throws SigarException {
CPUPoller poller = new CPUPoller();
poller.start();
} }
package ie.cit.msc;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.hyperic.sigar.CpuPerc;
import org.hyperic.sigar.Sigar;
import org.hyperic.sigar.SigarException;
import org.hyperic.sigar.cmd.Shell;
import org.hyperic.sigar.Mem;
public class CPUPoller extends Thread {
private Sigar sigar;
private boolean finished = false;
private final SimpleDateFormat format = new SimpleDateFormat("HH:mm:ss:SSS");
![Page 52: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/52.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
52 Micheál Walsh MSc Cloud Computing
public CPUPoller() throws SigarException {
sigar = new Shell().getSigar();
org.hyperic.sigar.CpuInfo[] infos = sigar.getCpuInfoList();
org.hyperic.sigar.Mem infoMem = sigar.getMem();
org.hyperic.sigar.CpuInfo info = infos[0];
System.out.println("Vendor........." + info.getVendor());
System.out.println("Model.........." + info.getModel());
System.out.println("Mhz............" + info.getMhz());
System.out.println("Total CPUs....." + info.getTotalCores());
System.out.println("Total Memory..." + infoMem.getRam()+"MB");
}
@Override
public void run() {
while(!finished)
CpuPerc cpu;
Mem usedmemory;
try {
cpu = sigar.getCpuPerc();
usedmemory = sigar.getMem();
System.out.println("CPU used....."+format.format(new Date()) + " : " +
CpuPerc.format(cpu.getCombined()));
String totalusedMem = String.format("%.1f",
usedmemory.getUsedPercent());
System.out.println("Mem used....."+format.format(new Date()) + " : " +
totalusedMem+"%");
} catch (SigarException e) {
e.printStackTrace();
}
try {
Thread.sleep(1000);
![Page 53: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/53.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
53 Micheál Walsh MSc Cloud Computing
} catch (InterruptedException e) {
e.printStackTrace();
}
}
super.run();
}
public void finish() {
finished = true;
}}
Java Code: Start time of Spring XD Stream
Basically added a while loop to poll the source folder for any file to be created in that source folder.
Once this happens the time gets output to the screen and the application finishes.
The full program is as follows:
package ie.cit.msc;
import java.io.File;
import java.text.SimpleDateFormat;
import java.util.Date;
public class Application {
private static final String INPUT_DIR = "/tmp/xd/input/";
private final SimpleDateFormat format = new SimpleDateFormat("EEE MMM dd HH:mm:ss:SSS
zzz yyyy");
private String nameOfStream = "filetofile1";
private Date startTime;
public static void main (String args[]) {
new Application(args);
}
public Application(String[] args) {
nameOfStream = args[0];
File inputDirectory = new File(INPUT_DIR + nameOfStream);
while (! (inputDirectory.list() != null && inputDirectory.list().length > 0)) {
// do nothing
![Page 54: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/54.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
54 Micheál Walsh MSc Cloud Computing
}
startTime = new Date();
System.out.println("Processing started at " + format.format(startTime));
System.exit(0);
} }
Preparing the VMs
The version of Operating system used was Red Hat Enterprise Linux Server release 6.6 (Santiago).
Java version 7 or above should be installed
download java jdk .tar.gz
download java to the /usr/java folder
unpack with tar zxvf jdk......tar.gz
check with command “java –version”
cd into /usr/bin
Create copies of the old java files with command
mv java oldjava
mv javac oldjavac
then use command to create symbolic link to newly created java 7
still in /usr/bin not /usr/java
sudo ln -s -v /usr/java/jdk1.7.0_60/bin/java java
then check java version 7 is available
java -version
The hosts file must also be edited
cd into /etc
then edit “hosts” file
after the following line
127.0.0.1 localhost.localdomain localhost
add these 2 lines
127.0.0.1 localhost
127.0.1.1 “name of VM”
![Page 55: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/55.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
55 Micheál Walsh MSc Cloud Computing
Where for example “qecorkc85nfvm2” would be used as the name of the VM
The following steps add java home and XD home to the global path
Cd to /etc then edit “environment” file
Add the following txt to the empty file
Export JAVA_HOME=/usr/java/jre1.7.0_67
Export XD_HOME=/root/spring-xd-1.0.0.M4/xd
restart the VM for changes to take effect
Zookeeper
Within the servers.yml file the following needs to be set in each VM in cluster for Zookeeper to be
supported
#Zookeeper properties
zk:
client:
connect: 10.73.18.165:2181
Zookeeper-3.4.6 was installed and the following commands could be used provided similar Linux OS and
version:
wget zookeeper_link_address.tar.gz
tar xvzf zookeeper-3.4.6.tar.gz
yum install zookeeper-3.4.632
Cd into zookeeper-3.4.6/bin and run the following command
./zkServer.sh start
Staying in the zookeeper-3.4.6/bin connect to zookeeper command line using command
./zkCli.sh
Show all running nodes
Ls /
Shows all information zookeeper has on Spring XD
Ls /xd
Redis
Within the servers.yml file the following needs to be set in each VM in cluster for Redis to be supported
![Page 56: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/56.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
56 Micheál Walsh MSc Cloud Computing
# Redis properties
spring:
redis:
port: 6379
host: 10.73.18.165
Redis 2.8.16 was installed and the following commands could be used provided similar Linux OS and
version:
yum update
yum install make gcc wget
now try going to redis/bin
./install-redis
If this doesn’t install the following could also be attempted:
redis/dist
tar xvzf redis.tar.gz
cd to redis/deps
make all four files in here
then cd.. back to redis and run
make install
to start the server after install just cd into redis/src and type
./redis-server
Check if installation was successful by checking keys in the message que,
Cd into redis/src and type
./redis-cli
Keys *
Rabbitmq33
Within the servers.yml file the following needs to be set in each VM in cluster for Rabbitmq to be
supported
# RabbitMQ properties
spring:
rabbitmq:
host: localhost
port: 5672
![Page 57: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/57.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
57 Micheál Walsh MSc Cloud Computing
username: guest
password: guest
virtual_host: /
Erlang package needs to be installed
wget http://packages.erlang-solutions.com/erlang-solutions-1.0-1.noarch.rpm
yum install erlang
Install VMware tools
cd vmware-tools-distrib/
./vmware-install.pl
Install Rabbitmq server
rpm --import http://www.rabbitmq.com/rabbitmq-signing-key-public.asc
wget https://www.rabbitmq.com/releases/rabbitmq-server/v3.4.4/rabbitmq-server-generic-unix-
3.4.4.tar.gz
tar xvzf rabbitmq-server-generic-unix-3.4.4.tar.gz
yum install rabbitmq-server-3.4.4-1.noarch.rpm
MySQL
Within the servers.yml file the following needs to be set in each VM in cluster for MySQL to be
supported
#Config for use with MySQL
spring:
datasource:
url: jdbc:mysql://10.73.18.165:3306/xdjobs
username: xduser
password: pass1
driverClassName: com.mysql.jdbc.Driver
MySQL was installed and the following commands could be used provided similar Linux OS and version:
yum install mysql-server mysql
To start MySQL database use the command
mysqld –user=root&
To setup a password for a root user use
mysqladmin -u root password “enter password here” 34
To check if mysql is installed correctly use
![Page 58: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/58.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
58 Micheál Walsh MSc Cloud Computing
mysql –u xduser –p
“enter password here”
This is the mysql command line to view available databases use
Show databases;
Xdjobs should be one of the databases so change to this DB using command
Use xdjobs;
To now see all tables available inside the database use
Show tables;
Hadoop # Hadoop properties
spring:
hadoop:
fsUri: hdfs://10.73.18.167:9000
wget http://mirror.nexcess.net/apache/hadoop/common/hadoop-2.4.1/hadoop-2.4.1.tar.gz
tar -zxvf hadoop-2.4.1.tar.gz
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
make sure the /etc/environment file is pointing at the correct location for each given path
export JAVA_HOME=…/java/latest
export HADOOP_HOME=…/hadoop-2.4.1
export HADOOP_PREFIX=…/hadoop-2.4.1
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_PREFIX/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
Check that these are set with the echo command
Echo $JAVA_HOME or echo $HADOOP_HOME
In Hadoop-2.4.1/etc/Hadoop edit the core-site.xml file35
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://10.73.18.167:9000</value>
</property>
</configuration>
![Page 59: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/59.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
59 Micheál Walsh MSc Cloud Computing
In Hadoop-2.4.1/etc/Hadoop edit the hdfs-site.xml file
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/data</value>
</property>
</configuration>
In Hadoop-2.4.1/etc/Hadoop leave the mapred-site.xml and yarn-site.xml files empty as follows
<configuration>
</configuration>
Next format the name node with the following command
bin/hdfs namenode –format
next start Hadoop cluster
sbin/start-dfs.sh
If the cluster does not start first timethings that may be wrong are as follows
Check /etc/hosts file should have IP addresses of local machine
127.0.0.1 localhost.localdomain localhost
127.0.0.1 localhost
127.0.1.1 xxxorkc85xxvm5
The following three commands checks for corrupted files and disables safe mode
bin/hadoop fsck / -blocks -locations -files
bin/hdfs dfsadmin fsck
bin/hdfs dfsadmin -safemode leave
The user can also look at the cluster of files
Bin/hdfs dfs –ls /
Or add to the cluster manually
Bin/hdfs dfs –mkdir /xd
![Page 60: An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis](https://reader034.fdocuments.in/reader034/viewer/2022042820/55d03139bb61eb110e8b46b1/html5/thumbnails/60.jpg)
An investigation into Spring XD to study methods of Big Data Analysis
60 Micheál Walsh MSc Cloud Computing
References
1 http://bigthink.com/think-tank/the-internet-of-things-meets-big-data 2 http://pivotal.io/platform-as-a-service/press-release/cloud-foundry-foundation-adds-swisscom-to-roster 3 https://spring.io/projects 4 http://projects.spring.io/spring-xd/ 5 https://zookeeper.apache.org/ 6 http://www.slideshare.net/SpringCentral/spring-xd-guided-tour?related=1 (slide 11) 7 https://github.com/EsotericSoftware/kryo/wiki/Documentation-for-Kryo-version-1.x
8 http://www.slideshare.net/SpringCentral/spring-xd-guided-tour?related=1 (slide 12) 9 http://docs.spring.io/spring-xd/docs/current/reference/html/#_start_the_runtime_and_the_xd_shell 10 http://docs.spring.io/spring-xd/docs/0.1.x-SNAPSHOT/reference/html/running-distributed-mode.html 11 http://www.infoq.com/articles/introducing-spring-xd 12 Machine learning: hands on for developers and technical Professionals by jason bell page 201-202 13
http://docs.spring.io/spring-xd/docs/0.1.x-SNAPSHOT/reference/html/architecture.html 14
http://projects.spring.io/spring-xd/ 15
http://www.slideshare.net/SpringCentral/spring-xd-guided-tour?related=1 (slide 20) 16
http://docs.spring.io/spring-xd/docs/current/reference/html/#sources 17
https://storm.apache.org/documentation/Setting-up-a-Storm-cluster.html 18
http://hortonworks.com/hadoop/yarn/ 19
https://nathanmarz.github.io/storm/doc/backtype/storm/spout/ISpout.html 20
http://hortonworks.com/hadoop-tutorial/simulating-transporting-realtime-events-stream-apache-kafka/ 21
https://storm.apache.org/about/simple-api.html 22
http://www.accenture.com/us-en/blogs/technology-blog/archive/2014/04/28/the-right-big-data-technology-for-smart-grid-distributed-stream-computing.aspx 23
http://www.cs.duke.edu/~kmoses/cps516/dstream.html 24
http://www.csdn.net/article/2014-01-27/2818282-Spark-Streaming-big-data 25
http://stanford.edu/~rezab/sparkclass/slides/td_streaming.pdf 26
http://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html 27
http://planetcassandra.org/redis-to-cassandra-migration/ 28
http://www.quora.com/What-are-the-advantages-and-disadvantages-of-Beanstalkd-as-a-work-queue 29
https://plumbr.eu/outofmemoryerror/java-heap-space 30
http://redis.io/topics/persistence 31
http://lizhouwangnotes.blogspot.ie/2011/08/use-sigar-api-in-java-to-capture-system.html 32
http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-5-0/CDH4-Installation-Guide/cdh4ig_topic_21_3.html 33 http://www.rabbitmq.com/install-rpm.html
34 http://www.cyberciti.biz/faq/how-to-install-mysql-under-rhel/
35 http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/