An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis

An investigation into Spring XD to study methods of Big Data Analysis

1 Micheál Walsh MSc Cloud Computing

An investigation into Spring XD

to study

methods of Big Data Analysis

By

Micheál Walsh

16/04/2015



Acknowledgements

I would first like to thank my Mom (Maureen) and my Dad (Edward) for all their support in any endeavor

that I wish to accomplish in life.

I would like to thank all my friends who provided guidance and support both technically and emotionally

for the past 2 years. Especially Declan Tarrant, Keith Lee, John Ryan, Eoin Johnston, Steve Carter and

David Hallissey.

I would especially like to thank my project supervisor Donna O’Shea and Eugene Bell who were always

available to talk through any problems that were encountered.

I would like to thank the management at EMC who encourage employees to better themselves by

providing the financial support required to complete an MSc such as this.

I would also like to thank Brian Casey and Don ORegan for providing the facilities needed to complete

this project. Without their support the technical detail accomplished within this project would not have

been possible.



Table of Contents

Acknowledgements ....................................................................................................................................... 2

Introduction .................................................................................................................................................. 5

Architecture .................................................................................................................................................. 7

Spring XD Servers ...................................................................................................................................... 7

Stream components .................................................................................................................................. 8

Spring XD Message flow .......................................................................................................................... 10

Message Bus ........................................................................................................................................... 12

Application Context Hierarchy ................................................................................................................ 13

Building the Test System ......................................................................................................................... 13

XD Single mode System ....................................................................................................................... 14

XD Distributed System ........................................................................................................................ 15

Functionality of the DSL: Twitter Stream Example ............................................................................. 16

Storing DSL commands ...................................................................................................................... 17

Preparing the VMs .................................................................................................................................. 17

Analysis ....................................................................................................................................................... 18

Analysis of Spring XD functionality ......................................................................................................... 18

Unified ................................................................................................................................................. 18

Distributed Runtime: ........................................................................................................................... 18

Extensible: ........................................................................................................................................... 19

Data ingestion ..................................................................................................................................... 20

Real time Analytics: ............................................................................................................................. 21

Batch Processing: ................................................................................................................................ 23

Data export: ........................................................................................................................................ 23

Apache Storm: ......................................................................................................................................... 24

Spark Streaming ...................................................................................................................................... 27

Results ......................................................................................................................................................... 30

The system under test ............................................................................................................................ 30

The test process ...................................................................................................................................... 31

Data Limit: ............................................................................................................................................... 31



Results 2MB to 50MB: ............................................................................................................................ 34

Results of streaming 2MB files: .............................................................................................................. 37

Results of streaming 10MB files: ............................................................................................................ 42

Results of streaming 20MB files: ............................................................................................................ 45

Conclusion ................................................................................................................................................... 49

Appendices .................................................................................................................................................. 51

Java Code: CPU and Memory .................................................................................................................. 51

Java Code: Start time of Spring XD Stream ............................................................................................. 53

Zookeeper ............................................................................................................................................... 55

Redis ........................................................................................................................................................ 55

Rabbitmq ................................................................................................................................................. 56

MySQL ..................................................................................................................................................... 57

Hadoop .................................................................................................................................................... 58

References .................................................................................................................................................. 60



Introduction

It has been shown that data is expanding exponentially year on year and that IT departments are being

tasked with organizing this data. With more and more resources being laid at the feet of IT Architects,

expectations also mount to deliver a system that will ingest, organize and analyze any amount of data.

With the added pressure of speed, this must be done before that data becomes irrelevant or out of

date. Just a few years ago this type of task was designed and programmed on a project by project basis;

which varied from company to company. With data increasing so rapidly software developers needed to

change their approach as it was becoming clear that current solutions were not able to handle the

differing data sources or the volume of data. Data management and speed have been put forward as

the most important reasons for moving to the cloud. This is only true if the correct software architecture

is in place to meet the needs of the user. Most open sourced software solutions would need adjusting

by skilled programmers to meet the needs of a small to medium business. As this sector is cutting edge,

being able to find such skills sets is difficult and/or very expensive.

Expense is something major data companies don’t lose sleep over. With large companies pioneering the

way in creating solutions to big data problems that best suited their products. The Internet of things is

something every large multinational big data company needs to be ready for1. That is why most

companies are developing their own solutions to best suit their needs. Some of these companies release

software versions that are open source but have basic functionality. This is a great way to get more

customers to try their products and increase word of mouth. Examples of this include Pivotal, Cloudera,

Hortonworks etc. A more feature rich version of the same product would be sold with a software

license. This exclusive version would be marketed with a name like “GOLD” or “PLATNIUM” and comes

with telephone support on any issues during installation or day to day running of the product2. However

these licenses can be expensive and would be outside the budgetary abilities of the smaller companies.

In response to this many open source projects were setup3. The purpose of these Open source projects

was to automate solutions to the most common of big data architectural problems. A high level

Spring/Java knowledge is still needed but as a result of automation the setup and management becomes

more intuitive. Spring Open Source projects bring more reliable big data solutions to a wider user base

and in less time than ever before. Spring XD is one such project that is designed for single use or for

scale across a cluster with the ability of growing almost limitlessly4.



As big data system design is still a very niche market, not many studies have been carried out into

projects like Spring XD. As new releases of Spring XD are being released periodically, time for system

testing, is for the most part left to the beta user. Online forums are a good indication of wither a

software package is being used by the general public. Spring XD seems to be growing and growing in

popularity. But as the user base grows so too the variations in use cases which ultimately leads to bug

discovery and lost time for the user. It is true to say that every system is different with differing inputs

(sources), requirements (analysis) and outputs (sources). The answer to this problem is a set of

guidelines that will not be accurate for every system but can be used to help steer Architects in the right

direction with designing systems that use Spring XD.

Spring XD is a software tool that has been created to help solve common Big Data input/output

problems. It basically moves data from point A to point B with the option of analyzing or editing the data

en-route. Spring XD supports common big data features like horizontal scalability and real time analytics.

It does this by supporting within its infrastructure Big Data projects like Apache Zookeeper5, Apache

Hadoop, Redis etc. These big data open source projects are known to be reliable from years of usage

and are utilized by Spring XD to coordinate data across VM clusters with the goal of scaling up the data

being processed.



Architecture

Spring XD Servers

The two fundamental parts of Spring XD are the Admin and Container Servers. Once the Admin server is

running a shell console can be opened which can send commands via http to that Admin server. Spring

XD has 2 flavors; single mode or distributed mode. In single mode the Admin and Container servers are

on a single machine. For distributed mode the Admin server can have container servers on the same

machine but container servers can also be spun up on multiple machines within the cluster11.

The admin server and shell console are running commands can be submitted via http from the shell

console. These commands are in a Domain Specific Language (DSL) and comprise of the instructions for

the Admin Server to build a data stream. Each stream is made up of modules. Each module has an input

and an output. For the Admin server to accept a stream as being valid, two modules must be present, a

Source module and a Sink module. It is also possible to add multiple Processor modules between the

Source and Sink modules.

An Admin server accepts the stream as a sum of modules and gives each module a definition. In turn

each module definition gets assigned to a container within the cluster. Once the container accepts the

module the module gets deployed within that container. This creates a Spring Application Context for

each module as shown in Figure 1. Zookeeper keeps track of where all the module definitions are.

Zookeeper also keeps track of what state each container is in. If for example a container no longer is

communicating with Zookeeper the module will get reassigned to a different container on a different

machine.



XD Admin

ZooKeeper

XD Container XD Container

Source output

Outbound adapter

Spring Application Context

Sinkinput

Inbound adapter

Spring Application Context

Redis

Rabbitmq

Transport Provider

Figure 1: Spring XD Architectural Components

Stream components

The Spring XD stream is made up of modules so a sink and a source is a module. The XD processor and

job are also modules. Each stream must have at least a source module and a sink module but can also

have 0-n processor modules in-between. Modules are basic reusable Spring Integration message flows.

Producer sends Message

Message ChannelConsumer receives

Message

Figure 2: Spring Integration message flow process

A simple example of a stream would be to create a stream that outputs the time every second, the

following is an example DSL command to do this.



stream create --definition "time | log" --name ticktock –deploy

This DSL command has no processor module. Only the minimum source and sink modules. The name of

the stream is defined as “ticktock”. The input for the source module is a predefined Spring XD “Source”

taken from a java time method and the output of the source module is sent to the input of the Sink

module as a string of the current time approx. every second. Once the sink module receives this String,

the Sink just outputs the string of the time every second to the container console.

A Processor is a module that does something to the streaming data for example it can transform, split or

filter data while travelling from sink to source. A stream can contain zero, one or many modules and the

processor action will be carried out in order of definition within the DSL stream definition command.

Each module has a data input source and a data output source. The output of one module can be the

input of another module and vice versa. The Source module takes data from a predefined list of Spring

XD supported sources; for example a file or http source or external RabbitMQ message queue etc. The

Sink module is similar except it takes its input data from either the output of the Source module or the

output of a processor module. The output of the Sink module goes again to a predefined list of Spring

XD supported sources. For modules to communicate between each other like this a messaging service is

required, Spring XD currently supports the open source projects Redis and RabbiMQ with plans to add

Apache Kafka in a future release.



Spring XD Message flow

XD Shell

XD Admin

ZooKeeper

XD ContainerWith Module No 3




VM

VM

VM

VM

VM

HDFS or Console or File or ….. etc

Data Source

Data Sink

Stream submission M1 ! M2 ! M3 ! M4

Redis or RabbitMQ

1

2

3

4

5

6

7

8

Figure 3: An example of a Spring XD message flow system

The following steps cover the user interaction with the system under test, the flow of messages

between modules and how the user controls the amount of modules being deployed. The steps begin

with the assumption that the test system is running and that Admin Server and all Container Servers are

running. The XD Shell connected to the Admin server is running and ready to accept commands.



1. The user types the stream command into the XD shell and deploys the stream. The Stream used

in this example diagram has four modules. “Source|processor|processor|Sink” or

“M1|M2|M3|M4”

2. XD Admin Server creates each module in the stream and deploys them to be worked on by

containers.

3. Zookeeper takes over the process of finding a container to deploy each module. Zookeeper

randomly assigns the modules to containers that are available for work, it then keeps track of

where each module is and its place in the stream.

4. The messaging service (Redis or RabbitMQ) is used to communicate between modules. It is also

used to bring data from the Data Source and to the Data Sink location. So even though locations

or modules maybe physically located on different VMs they can still communicate and move

data between each other. For step 4 Redis or RabbitMQ is used to fetch data from the data

Source specified in the stream command and pass it to module one.

5. Redis or RabbitMQ is used to pass data from Module one to Module two.

6. Redis or RabbitMQ is used to pass data from Module two to Module three.

7. Redis or RabbitMQ is used to pass data from Module three to Module four.

8. Redis or RabbitMQ is used to pass data from Module four to whatever the data Sink was

specified in the stream command. For example a commonly used Sink on Big Data systems is

Hadoop Distributed File System (HDFS). On the system used for testing during the course of this

project, a HDFS instance was installed on a separate VM in the cluster and this being used for

the functionality of module Sink worked seamlessly.



Message Bus 6

Stream plugins will use the message bus to bind a modules input and output channels to the transport

of choice. It will also query that module. For a stream when a module gets deployed the message bus is

responsible or the stream plugin is responsible for invoking the message bus to bind this output channel

which is typically just a direct channel which is defined in the application context which is part of the

Spring Integration. Within the XML file user defines the channel ID input and channel ID output then

some flow takes place under the covers which is taken care of by Spring Integration code. Stream plugin

will query this module asking for a component type message channel that’s named input and creates an

adapter for this on the fly and binds it to a RabbitMQ or Redis queue. Stream Plugin’s are also

responsible for binding “tap” points and named channels that are associated with that stream, so you

can tap a stream at any point before every module. If the user wants to tap the source they get a copy of

whatever messages are incoming to the stream to do real time analysis. An option to “tap” the stream

after some transformation has happened is also possible, those tap points are actually named channels

in Spring XD and can also create a stream with named channels for example HTTP > que:foo where foo is

the named channel and these all get bound by the message bus to the transport.

The other thing the message bus has to do is take care of the martialing so if you’re going over a remote

protocol like RabbitMQ or Redis and you have a POJO you might have to do some serialization, the

message bus has some optimization so internally decides that if the data is already a byte array then no

transformation will take place, on the other hand if the data is in java object format then it needs to be

serialized and Kryo7 can be used for this. Kyro is the default serialization tool with Spring XD. At run time

the Admin server uses the message bus to launch a job. The Admin sends a message to the job channel

that’s polling for messages, once received the job is kicked off. So the message bus is a shared

component by the Admin Server and the Container Servers.

So it is clear that the bulk of the work for the user is in the setup and configuration of a system such as

this. But what is missing is a specification that outlines data delay between source and sink or the impact

on a machine as the load increases. This paper aims to give a set of data that can be used as a guide

when designing big data systems where Spring XD is to be utilized.



Application Context Hierarchy8

In Spring XD the application context itself is designed to be extensible. So Spring XD is a runtime and not

a frame work so when you build with XD you are not building the application context itself like a

traditional spring application. We create the application context using Spring boot to do the work under

the covers. To do this there has to be some way of linking the spring configuration which is located in a

highly visible place to the application context, in this way any kind of Spring Application or spring beans

can be added. This is done through a combination of component scanning and appliances that look for

certain types of components in specific locations; these then get added to the application context. For

example if a GemFire XD cache was used then special modules could be defined and accepted by Spring

XD that could then interact with the new GemFire cache.

The focus is on Spring XD as no major study has been undertaken into its performance metrics. The

purpose of this paper is to remedy this. So all elements of the Spring XD system (the main ones anyway)

needed to be pushed to breaking point. And in doing so record the variables that influenced where this

breaking point was and what external factors were relevant. Measurable data metrics include

time for data to stream from Sink to Source

CPU usage as data is streaming

Memory usage as data is streaming

All of the measured data is subject to the system that is used for testing. These system variables include

CPU make and speed. How many VM’s were used and how much memory was given to each VM.

Building the Test System

The Spring XD cluster was installed from the ground up, adding new components piece by piece. A

dedicated server with a VMware ESX hypervisor was installed on the bare metal server. A VM was spun

up and the first step was to install Spring XD. At this point tests were carried out on Spring XD Single

mode server. A twitter stream test was carried out to ensure the system was functioning correctly.



XD Single mode System9

Http Post of Data Processing DSL

XD Admin Modules

Modules

XD Container

Figure 4: Spring XD Single node architecture

Spring XD single node is a version of XD that works within the confines of a single machine and is usually

used for testing purposes or for small workloads. Once the single mode Admin server is running, the XD

shell can be opened. The user has the option of starting one container instance that can deploy multiple

modules or starting multiple container instances each to hold one module. Once a container instance is

running the user can use the XD shell to provide the stream commands to the XD admin server via http

using the DSL. A pluggable messaging service is not required for single mode setup as the default

memory store is used.



XD Distributed System10

Http Post of Data Processing DSL

XD Admin

XD Container

Modules

XD Container

Modules

ZookeeperRedis/Rabbitmq

MySQLXD Container

HadoopXD Container

VM4

VM1

VM2 VM3

VM5

Figure 5: Spring XD distributed architecture

The Virtual Machine Cluster was setup by adding four more Red-hat Enterprise Linux VMs to the ESX

Server. The following outlines the design of the cluster and is also illustrated in the diagram above.

VM1: Spring XD Admin server

VM2: Spring XD Container

VM3: Spring XD Container

VM4: Services and Spring XD Container

o Redis / RabbitMQ

o Zookeeper

o SQL Datastore

VM5: Hadoop and Spring XD Container

Spring XD was installed on each VM which enabled the XD Container server or XD Admin server to be

spun up on any of the VMs in the cluster. Once the services were installed on the VM4 the

xd/config/servers.yml file needed to be updated on each VM. Once the system is up and running the

servers.yml file should be homogeneous and contains the IP of the VM where the application server for

example Redis, Zookeeper etc was running. It also contained the port number and any other setup

information that would be used by that specific application. This allowed communication across the

Cluster between the Admin server and each of the Big Data application servers required by Spring XD.



Once a stream is deployed in a distributed system each module of the stream is assigned to a random

container within the cluster. The modules communicate via a pluggable message bus protocol. Redis is

the default messaging service but RabbitMQ is recommended. The advantage of Redis is its speed but

RabbitMQ is more reliable with greater hand shaking of transactions and persistent queuing.

A major design advantage of spring XD is that every module has its own application context. This allows

the user to use two different configurations while streaming the same information on two different

streams. This is the type of freedom required for big data sources that need to be fed into different

systems simultaneously.

A “Tap” is a type of stream where it examines an existing stream and copies specific data from that

existing stream to create another stream. This causes no ill effects to the original stream.

To demonstrate the flexibility of modules and streams another more impressive example is using the

twitter API as the source for the data being streamed. To do this however the user must provide twitter

verification keys to gain access to the twitter stream as a source module. It is important to keep in mind

that the DSL command structure is all important in deploying a stream with a successful outcome.

Functionality of the DSL: Twitter Stream Example

The user needs the DSL to specify the structure of a stream. The following twitter stream example is a

good demonstration of this. To begin with an initial stream is constructed

xd:> stream create tweets --definition "twitterstream | log" 11

Next three “taps” are created such that each copy data from the initial stream,

xd:> stream create tweetlang --definition "tap:stream:tweets

xd:> stream create tweetcount --definition "tap:stream:tweets > aggregate-counter" –deploy

xd:> stream create tagcount --definition "tap:stream:tweets > field-value-counter --

fieldName=entities.hashtags.text --name=hashtags" --deploy

Finally the initial stream is deployed which begins the feed of tweets from the twitter source.

xd:> stream deploy tweets

The “–deploy” option could have been integrated to the initial “create” command but it is preferable in

this instance to separate them in order to set up the taps before beginning the process.

Other streams were also experimented with. For example a HDFS Sink was setup on VM5. This was used

to stream EMC log files (of differing sizes) from a file source on VM2 or VM3 to the HDFS location on



VM5. Once the files had streamed it was possible to manually check the Hadoop file system on VM5 to

ensure the stream was successful.

Storing DSL commands 12

Spring XD has a couple of ways that DSL commands and Server state information get stored. The first is

default where this information gets stored to memory but not written down to disk. The down side to

this is if the server is closed all information gets lost. The second is using the Redis key-value store. For

the test system created for this project I also required MySQL database that Redis connects to in order

to store the values to disk. The Redis properties are set in the redis.properties file and the Admin and

Container Servers connect to Redis by inputting the location of Redis to the servers.yml file located in

the Spring XD config folder. For a distributed system Zookeeper is used as the centralized storage for

stream and job definitions, so while Zookeeper is running all definitions get kept in memory but also get

written down to disk. So if Zookeeper is rebooted for whatever reason all previous streams and jobs will

be there. The Zookeeper properties are set in the zoo.cfg file and again server.yml is where the user

points Spring XD to the running instance of Zookeeper.

Preparing the VMs

The version of Operating system used was Red Hat Enterprise Linux Server release 6.6 (Santiago). Java

version 7 or above should be installed. The hardware specifications for the ESX server were as follows

CPUs available: 12 x 2.1 GHz

Processor type: Intel Xeon CPU E5-2620 V2 @ 2.1GHz

Memory Capacity available: 98230MB

Number of NICs available: 4

The hardware specifications for each individual VM on the ESX server were as follows

CPUs available: 4 x 2.1 GHz

Processor type: Intel Xeon CPU E5-2620 V2 @ 2.1GHz

Memory Capacity available: 4096MB

Number of NICs available: 1

Java heap size 64MB (default setting used during testing)

Max java heap size possible on system VM 1GB



Analysis

Analysis of Spring XD functionality

Spring XD has many excellent features and is defined by Pivotal in the following way – “A unified,

distributed and extensible service for data ingestion, real time analytics, batch processing and data

export”13. Breaking each of these elements down to extract their meaning is a good way of unlocking the

Spring XD feature rich system.

Unified14

There are currently many standalone Apache open source projects that tackle problems like data

ingestion, real time analysis, data streaming and data loading. Spring XD can be described as a Unified

service because it packages many solutions in one service. Below is a table of projects (that are available

as open source or with paid license from companies like Pivotal) that Spring XD has attempted to

combine in a single service. For data ingestion, loading and analysis the products are different and are

used in differing use cases so don’t directly compete with Spring XD. However it could also be argued

that there is cross over in most of these areas and that Spring XD is easier to use due to the automated

structure of the input, analysis, batch and output plugins.

Data Ingestion and Data Loading Apache Sqoop, Apache Flume, Pivotal Data Loader

Real time Data Analysis Apache Pig, Apache Hive and Apache Mahout

Data Streaming Apache Oozie, Spark Streaming and Apache Storm

Batch Processing Pivotal Data Loader, Apache Hive, Pivotal HAWQ, Apache Pig

Table 1: Products with partial Spring XD functionality

Distributed Runtime:

Spring XD for industry scale applications runs as a distributed system. This has requirements to deploy

modules and UN-deploy modules when necessary, which can be controlled as a dynamic system that

can be scaled up or down by the user as required. The system must also have the ability to know when a

new container has to be spun or up or down. So if a container fails/stops functioning then the work that

was being done on it can be restarted on a different container. The Admin Server has a set amount of

intelligence prescribed by the user and can actively assign modules to specific containers. The user can



also specify the number of modules to be deployed on a specific container. The Admin Server can also

query the various containers in the system to understand the current state of a stream.

Spring XD also supports multi Admin Servers, the leader Admin gets randomly elected and is referred to

as the supervisor. The supervisor makes all the decisions on what containers get what module deployed

on it, it also deals with all Container Servers that get created or destroyed.

The standby Admin Servers are used for redundancy so there is no single point of failure within the

system. So if one Admin Server fails then one of the other two will get elected as supervisor and will pick

up where the last supervisor left off. The standby Admin Servers can also be useful for heavy workloads

say if a load balancer is placed in front of 3 Admin Servers then the load balancer would pick an Admin

Server that is the least stressed to supply its command to. The load going to the load balancer could be

coming from say a REST API with an unpredictable amount of data.

The distributed system like Spring XD as an entity must be able to come to a consensus on decisions like

wither a Container Server is dead or alive or on what to do if a general error is seen on the system.

Zookeeper is a tool set for building a highly available distributed coordination service that is essential in

the construction of distributed systems. On a large scale distributed system Zookeeper requires an odd

number of Zookeeper Servers to be running because any updates or reads from Zookeeper require a

quorum of Servers so if you set up 5 servers then at least 3 of those servers need to be up and be able to

see each other in order to make any progress on reads or writes. If the quorum is disrupted then no

reads or writes can occur15.

If updates are required to the system and 2 updates are sent one after the other then the first update

will be written to every part of the system before the second updates gets written to any part of the

system. This is called guaranteed ordered delivery. Spring XD uses Zookeeper as a storage system for

Stream and Job definitions. Zookeeper keeps these definitions in memory but also writes everything

down to disk in case of system wide failure.

Extensible:

An-other ability of Zookeeper is to notify the user on a specified thread if any changes to the system

occurs like a container failing etc. Spring XD uses this feature to notify it of any changes in the number of

Containers, Streams or Jobs.



Ephemeral nodes is hugely important as the user can mark a node ephemeral which means the user is

notified if a node goes down or is still available for work. This is done through a type of heart beat

registry. Where each node must send a signal periodically to the master node telling the master node it

is alive. If this does not happen then the master node presumes the node is dead and reassigns its work

elsewhere. Spring XD uses this feature of Zookeeper to track Container Servers, Modules and Admin

Servers. If the streaming load increases then more containers can be added and the supervisor will

initially spread the new Container Servers out across the system and then load balance which means

moving workloads from Container Servers with high loads to the newly deployed Servers so each

Container Server has approximately the same work load.

Data ingestion16

One of the goals of Spring XD is to automate stream deployment so for the user using Spring XD will save

time, be easy to deploy and be reliable. To do this Spring XD uses Spring Integration adapters which are

compatible with the most used types of data sources. Table 2 below shows a list and description of

these input sources. Within the Shell Admin DSL commands variations can be specified by the user. For

example if data is to be streamed from a file location the user can specify if the data should be serialized

or the data be kept in the same format before reaching the output. The following command specifies

that the data be kept as plain text format: –outputformat=plain/text. Custom sources can also be built

with relative ease however a good knowledge of Spring Integration is required for this.

Input Source Description of what data is used as input to source module

HTTP When data is posted to the specified HTTP Server

SFTP Secure File Transfer Protocol is the protocol used to transfer files from a given local

directory

Tail When a file is appended to for example a log file of a running system, the data that is

added is copied and is used as the input to source module

File The File contents of a specified File directory

Mail The incoming Mail of a specified Mail Server directory

Twitter Receives data by continuously querying real time Twitter Server Streams



Search

Twitter

Stream

Ingests data from the Twitter streaming API

Gemfire

Source

Receives data from a specified data Cache

Gemfire CQ Receives data from a query operating on a specified cache source. Only receives data

when the query result changes.

Syslog The TCP protocol is used to ingest data from specified log files

TCP This source acts as a Server and allows a remote connection to Spring XD and submit

data via raw TCP sockets

TCP Client This source acts as a Client and allows a remote connection to Spring XD and submit data

via raw TCP sockets

Reactor IP This source acts as a Server and allows a remote connection to Spring XD and submit

data via raw TCP or UDP sockets

JMS Receives messages from Java Message Service

RabbitMQ Receives messages from RabbitMQ message queuing service

Time A String format containing the time every so often

MQTT Connects to MQTT server and receives telemetry messages

Stdout

Capture

Combination of TCP and NETCAT command to capture output of a command

Kafka Ingests data from Kafka topic configuration

JDBC Ingests data from various databases

Table 2: Spring XD Stream Sources

Real time Analytics:

For large scale work load analysis distributed systems Spring XD would be used as a single entity within a

toolbox of products. Spring XD would only handle the data streaming. The reason for this is there are

more sophisticated and specialized analysis tools on the market for high end analysis. However each

system is different and has different requirements so the level of analysis that spring D offers may be



sufficient and render the more advanced tools redundant. In Spring XD Analytics can be done in two

different but similar formats. Analytics modules get added to a stream and are placed between the

Source and the Sink. These processor modules transform the data or perform the analytics on the

primary stream outputting just the result of the processes. The most simplistic type of analytics

supported by Spring XD is to use counters and gauges to perform various types of aggregation analytics.

A tap can also be used to create a secondary stream. This tap can be placed anyway along the stream to

best suit the needs of the system. This secondary stream can then have analytics applied to it. This is the

most widely used format of analytics in Spring XD and can be powerful when put together visual

representation of the streaming data. This can be done through REST API together with a script written

say in groovy. An example of this is the twitter stream example mentioned in Functionality of the DSL:

Twitter Stream Example section of this paper. Where the number of tweets containing a certain hashtag

can be visually represented on the screen which is changing in real time. Figure 6 was taken of the

analytics dashboard that was installed as part of the investigation for this project.

Figure 6: Twitter Stream Analytics Dashboard

Machine learning analytics algorithms can also be applied in Spring XD via extensible class libraries.

Machine learning algorithms usually run in a batch process on data that is saved in relational or non-



relational file system. As the data streams in real time the system is limited to the amount of processing

it can do and this aspect of Spring XD is still being fleshed out.

Batch Processing:

Batch processing is the ordered sequence that is put in place as one task needs to be completed before

the next one should commence. Every Big data system is different so no two batch systems are likely to

be the same. One of the most used and user friendly solutions is Spring Batch. Spring Batch attempts to

tackle this problem and does so successfully to a degree. Spring project in general removes the need for

boiler plate code and has support through jar plugins that automate features. However a high degree of

Spring and java coding is required. With this in mind it is true to say that most batch orientated systems

are highly complex to implement and are not developer friendly. Hadoop would be an example of this

where anything deviating from the out of the box solution is extremely complex to implement. Spring

XD on the other hand has automated the process with the ability to add and remove processes and all

the user needs to do is set up the infrastructure then create and deploy the stream. The pipe structure

of the stream ensures that each process gets done in sequence.

Data export:

Similar to data ingestion Spring XD also automates plugs for data output or data streaming destinations.

This is referred to as the Sink. Every stream must end with a predefined Sink definition. Table 3 below

outlines all available predefined sink options.

Sink Description of Sinks(Output Sources)

Log Data gets output by application logger to the Container Console

File Sink Data gets output to a file on the Container OS

Hadoop HDFS Data is output to HDFS

HDFS Dataset Output data is java classes and are stored in that way on HDFS

JDBC Data gets output to a relational database table

TCP Sink Data is output via TCP protocol

Shell Sink Complex sink that allows a process written in any language modify the data



Mongo Data is output to a Mongo collection

Mail Output messages get sent as emails

RabbitMQ Data is output as Rabbit messages

GemFire

Server

Data is output to a GemFire Cache

Splunk Server Data is output via TCP to a Slunk connector

MQTT Sink Connects to MQTT Server and data gets published as telemetry messages

Dynamic

Router

Routes messages to certain named channels based on outcomes of specified

expressions or scripts

Null Sink Data gets destroyed but not before analytics is run via a tap

Redis Data gets output to Redis data store

Kafka Sink Data gets output to Kafka topic configuration

Table 3: Spring XD Stream Sinks

Apache Storm:

Another open source project that streams data in real time is called Apache Storm. Some of the key

characteristics of apache storm include; highly scalable real time event stream processing platform,

extremely fault tolerant, guaranteed processing, language agnostic. Ostensibly Storm has many of the

features of Spring XD but with Storm the user is tasked with writing the processing logic. A plus however

is the user can write this processing logic in any language. In Spring XD it is possible to write unique

processor to transform the streaming data in a unique way but this is unlikely to be needed as most

transformations are automated. What makes the stream transformation unique in Spring XD is the way

the processors are queued.

Like Spring XD, Apache Storm uses Zookeeper to scale up the cluster if needed as data input grows.17

Storm like Spring XD does not use Zookeeper for message passing and so the quantity of data stored on

Zookeeper is low which is best practice for a Zookeeper cluster. As Storm is an Apache project it is built

to interact with the Apache family of products. This includes Apache Yarn and Apache Ambari. Apache

Yarn is installed on top of Hadoop as a resource manager and provides centralized management for



HDFS cluster. This would include management of consistent operations, security and data governance18.

Figure 7 shows how Apache storm would fit into an eco-system such as this18.

Figure 7: Big Data environment

Apache Ambari is also used for a management tool for Storm. Used in a different way to YARN in that

Ambari visually represents all the servers running Storm and Hadoop so in one screen the controller can

see an overview of the system. In fact any Apache project that is running on the cluster can be added to

Ambari for monitoring and control which makes it a very powering tool. Spring XD does not boast an

overall system monitor such as Ambari. Most projects like Storm and Spring XD also have dedicated

REST based UIs. In the case of Spring XD the REST API overviews the Admin and Controllers Servers but

does not include the health of the Hadoop cluster.

An area where Storm falls down would be in the complexity of the setup. Where the user is encouraged

to view and edit code19. There is an inherent advantage to knowing how something functions from a

coding perspective but this limits the amount of people that can get a storm cluster to the point where it

is streaming data. 20 With Spring XD all of the Source connectivity is automated with declarative

statements adjusting the format of the data that acts as input. Storm uses a Source connector called a

Spout. This Spout connection reads data from a queuing broker21 like Kafka or RabbitMQ. It is these



Spout implementations that are lacking automation and variation. Spring XD also has support for

RabbitMQ and is planning support for Kafka as Source input but this comes ready to use out of the box.

The design of Storm saw the input source only taking input from messaging services and that is why

most of the queuing systems are supported21.

Storm being an enterprise solution has most of its use cases for Source connectivity involving HBASE and

HDFS. Storm uses the concept of a bolt connector for data output. The architecture between Storm and

Spring XD is fundamentally different as can be seen in the function of a Storm Bolt. Any number of

streams can flow into a Bolt Processor21. The data can then be transformed using many functions like

Filters, joins, aggregations or input data from Relational databases. Once transformed the data can be

split or copied into any amount of output streams.

Figure 8: An Apache Storm Topology

A topology is the third concept that is introduced with Storm. Figure 8 shows an example topology

setup22. It is made up of Spouts and Bolts where once data is fed into an initial Spout it can be processed

many times by many different Bolts. In this setup the output of one bolt can be used as the input to

another bolt or Spout21. In this way a multi stage computation can take place but again there is no easy

intuitive way of connecting these bolts and Spouts to form the output the user is looking for.

Like Spring XD however Storm has a “local mode” (called single mode in Spring XD) for testing smaller

simulations when developing larger systems21.



Spark Streaming

Spark Streaming is an add on to the Spark Framework that adds support for continuous data streaming.

The goal of the Spark Streaming infrastructure is to create a system that is fault tolerant cost effective

and fast23. It was noticed that current systems that were fault tolerant were slow to recover a fault and

this was the problem that Spark Streaming set out to solve. The answer was creating very small batch

jobs that streaming computation could run on. The lower the batch size the lower the latency required

to recover.

Within the Spark framework the user can write programs that transform the data and this is done across

Resilient Distributed Datasets (RDDs). RDDs are a fundamental architectural concept that is introduced

with Spark. Within this architecture the user can split data as they fell will best suit the needs of the

system. The data can be kept in memory or disk as it is being processed. Parallel processing of data is

supported in that if one process becomes too big within the memory of one machine the data will be

shared across multiple memory sets within a cluster. Like Spring XD fault tolerance is supported through

integration with Apache Zookeeper. RDDs are periodically monitored and if one stops responding it will

be rebuilt on a different machine.

Once the data is received Spark Streaming breaks the data in small batches. Once in batches the user

can task transformations map, filter and group by. This is similar to Spring XD in that task

transformations are a kin to putting together a stream of modules. The difference being that once a

stream is deployed in Spring XD each module will be carried out in order where as in Spark Streaming all

batch tasks will be carried out in parallel. Like spring XD nothing will happen unless the user Actions

(deploys) a command. Examples of Spark Streaming actions include count, collect and save.

You can save in HDFS with most storage types being supported. Before a task is actioned the transforms

just build up, and the RDDs plan how the tasks are going to be scheduled and run. Figure 9 shows the

architecture of Sparking Streaming24. Once the action is deployed the executers carry out the task and

return a result. Like Storm there are multiple languages supported to implement the executers, tasks

and actions. Examples include Scala, Python, Java etc.



Figure 9: Spark Streaming Architecture

Comparing these three systems is difficult as each use case is different. It really comes down to choosing

the right tool for the job. Table 4 shows a generalization of where each framework excels and which

areas each framework are not market leaders in. Spark stream for example has been seen to be difficult

to maintain because of the existence of 2 stacks. One stack for streaming and the other for batch

processing25. The flexibility of such a system is high but learning curve when using a system such as this

is also high. Comparing this to Spring XD where the system is as useful in many ways. For example multi

copies of streams can be created at any time from the primary stream. This is done in the form of “taps”

which can then be transformed and saved for batch processing. Storm would also be seen as difficult to

put together due to its low level API. Where the user needs to structure the Spouts and Bolts to feed

into one another and come up with the correct form of processing.

Spark Streaming also does not scale well due to the fact it is so flexible to program. When small cluster is

doing many things well that is great but when the load increases so too do the bugs. When the system is

doing so many things at once a lot of fine tuning is required as the system grows. However if the

company implementing the system has the budget to upkeep a system such as this then Storm

streaming is well capable of scaling in such a flexible environment to process petabytes of data26.



Spring XD Apache Storm Spark Streaming

Does the system scale

well?

YES YES NO

Does system have low

latency?

No Yes Yes

Flexibility level of the

system?

High Low High

Automation level of the

system?

High Medium Medium

Implementation

difficulty level of

system?

Low High High

Easy to Integrate for

Batch processing.

Yes No Yes

Table 4: Compare and contrast Spring XD, Apache Storm and Spark Streaming



Results

The system under test

The purpose of this project was to analyze Spring XD. So this was done with different variables. The

messaging system used to carry information between modules is the main variable in the system so this

was used. Redis was the first messaging service tested as this is the default and is very quick and easy to

install. Redis is an in memory high performance key value datastore. Some draw backs to using Redis are

the data used for the datastore is limited to the non-persistence memory available on the system. If

across the network some of the data packets that the Redis messaging service is transporting get

dropped then there is no way for Redis to rectify this27. So preventing data corruption is not something

Redis can do. However Redis is fast and data integrity is not essential for many systems so with this in

mind Redis is a valid option for testing.

For comparison RabbitMQ was also installed for testing purposes. RabbitMQ is not as easy as Redis to

install. First Erlang needs to be installed. And this requires all VMs to be running the same version of OS.

In this case Red Hat Enterprise Linux 6 was used. Installing Erlang on machines that had a firewall was a

challenge and finding ways around this took a lot of trial and error. Once Erlang was installed RabbitMQ

was downloaded and installed. It is the little things that can hold up a project such as this. For example

once RabbitMQ was installed it would not communicate with the VMs that had the Admin server and

container server installed on them. Once the communication port had been opened and set in the

RabbitMQ config files and set in the server and container server.yml files, RabbitMQ was still failing to

carry messages once the streams were deployed. A change in the servers.yml file between Spring XD

versions had the documentation saying that the IP address of the machine running RabbitMQ needed to

be in the form “host:xx.xx.xx.xx” when with the new version of Spring XD this needed to be in the form

“address:xx.xx.xx.xx”.

Once installed RabbitMQ is reliable and unlike Redis it does have message acknowledgements. This has

the tradeoff of slowing down the system. These message acknowledgements come in two ways by

transactions and publisher confirms. RabbitMQ also gives the user more control from who can use the

system in the form of permissions for queues and exchanges. Also the user can decide wither the

information for specific exchanges and queues are kept in memory or written to disk28.



The test process

As Redis is supposedly a fast messaging service and RabbitMQ supposedly a slow messaging service the

expectation was that a time calculation would bare out this difference. The stream used to calculate the

time was taken from a local file in the VM directory and the data was streamed to the containers

console as plain text. To calculate the time, steps needed to be prepared.

1. The Admin and Container Servers needed to be started.

2. The stream needed to be deployed. The following stream was used:

stream create --name sourcefilename --definition "file --outputType=text/plain | log" –deploy

The reason for using this stream definition as opposed to any other variation of stream is that a

stream definition had to be picked and file source was best suited to calculate time. Printing the

data to the console also suited for a time calculation but it also helped with visual confirmation

that the streaming had commenced.

3. A java script was written to record the time when a file was created within the source folder of

the stream. How Spring XD “file” source works is as follows. Any file stored in the folder location

/tmp/xd/input/name_of_the_deployed_stream will be streamed once the stream gets

deployed. So in this case when the stream is deployed there is nothing in the folder and Spring

XD polls the source folder waiting for a file to be dropped into it. The java script also polls the

same folder waiting for a file to be dropped into it. I used a Linux command to copy files that

were created at certain sizes. The command was as follows: “cp ../filesizes/2MB.txt .” the dot at

the end stands for here. So copy the file 2MB.txt to the current destination address.

4. Execute the Linux command. Once this command is executed the file gets created in the stream

source folder and the current time gets recorded. The data now stored in the source folder

starts to stream to the console of the Container server. Once the data stops streaming the time

is also recorded. Subtracting these two times gives the total time of the stream.

Data Limit:

Files with a lower size limit of 2MB and an upper limit of 50 MB were chosen for the streaming as 2MB

was a good starting point anything smaller than this and the user wouldn’t even notice the stream

taking place as the time was so insignificant. Anything above the 50MB file size is where the message



service crashes. For Redis an “OutOfMemoryError” for the “java heap space” is returned as can be seen

by Figure 10. This occurs for any file being streamed where the size is greater than 50MB.

Figure 10: Redis out of memory java heap space error

For RabbitMQ a similar error is seen in Figure 11 where the queue called “xdbus” throws the

OutOfMemoryError” for the “java heap space”. The reason behind the similar error occurring at the

same point in the testing is due to the memory allocated for use to the Spring XD application is

approximately 50MB. For the JVM it doesn’t matter which messaging service is being used the heap

space allocation stays the same29. Once a file of >50MB is put on the heap the system crashes. So the

default system testing is restricted to an upper file limit of 50MB.

Figure 11: RabbitMQ out of memory java heap space error



This queue can be seen in the RabbitMQ control UI where all unacknowledged messages are also

recorded. Figure 12 below was taken as part of this project where newfiletest1 is the name of the

stream source folder.

Figure 12: Image of RabbitMQ User Interface



Results 2MB to 50MB:

The following data sets were measured from the system under test. First a range of files from 2MB to

50MB were streamed using the stream format already described (file to Container server console).

Figure 13: 2MB to 50MB measuring time taken for data to stream from Source to Sink

As can be seen from Figure 13 the results are linear. This is because as the size of the files increase so

too does the time taken for the stream to complete. The assertion that Rabbit MQ is a slower system

because of the handshaking that provides reliability for delivery of data is also born out in these results.

Examining the time taken to stream the smaller data transactions between 2MB and 6 MB it is clear the

times are relatively even. For Redis and RabbitMQ these times are relatively even because the

differences could be considered negligible or too small to measure accurately. But as the size of the data

increases so too does the time difference between the two messaging services. With a 33% time

increase from using Redis to RabbitMQ at 8MB to a 50% increase at 50MB. As the total time increases

with each jump in file size the overall increase between the messaging services also increases. With the

50% increase at 50MB equally an eleven second difference overall. This can only be described as

significant. So from these results it can be clearly seen excluding a few anomalies that RabbitMQ is in

fact slower than Redis.

0

5

10

15

20

25

30

35

2 4 6 8 10 20 30 40 50

Time in Secs

Size of source data in MB

RabbitMQ

Redis



The next test was to measure and compare the percentage CPU usage for Redis and Rabbit. Again file

sizes from 2MB to 50 MB were used. RabbitMQ is slower and has built in hand shaking so should use

more CPU than Redis. This increase would be slight as the message service passes extra messages back

and forth. With both tests being relatively similar with Redis averaging at 8.9 % overall CPU usage while

data is streaming. RabbitMQ on the other hand has an average of 9.9 % overall CPU usage. This increase

can account for the extra load that RabbitMQ has to carry.

With lower file sizes the Redis values are below the overall average and this increases as the file sizes

increases. As the java heap approaches its limit Redis average of 8.9 increases by 30% to a value of 11.6.

This spike in CPU usage shows the stress the system is under to load data onto the java heap as it

reaches its limit. With RabbitMQ the increase from average value of 9.9 to spike value of 12.7 is 28%. So

both messaging services are consistent with each other as they reach the limits of allowable memory

usage.

Figure 14: 2MB to 50MB measuring % CPU usage while data is streaningfrom Source to Sink

The last category tested between the 2 messaging services was percentage memory space used of the

system while data was streaming. Here again both services perform very similar. With all tests up to

0

2

4

6

8

10

12

14

2 4 6 8 10 20 30 40 50

% CPU usage


RabbitMQ

Redis



8MB using the same amount of memory respectively. For RabbitMQ this averages at 18.3% total

memory used. For Redis this value is higher at 20.2% of total memory used.

From here an increase can be seen for both message services. For RabbitMQ this increase is from 18.3%

to 21.2% and Redis the increase is from 20.2% to 23%. In both cases this is approx. a 3% overall jump in

in memory usage. This is easily accounted for because the data being streamed is greater so more of the

streamed data is being stored in memory if only for a brief period.

As the system approaches the breaking point of 50MB the memory usage jumps dramatically. For

RabbitMQ the memory jumps to a massive 31% overall system usage which is up from a constant 18%

for lower file sizes. The percentage increase is calculated to be 72% for RabbitMQ. This can be explained

by the increase of handshaking to reliably transmit the data as the data size itself increases. For Redis

the story is similar as the jump is to 29% overall system usage up from a value of 20% for lower file sizes.

The percentage increase works out to be a 45% increase for Redis which again increases because the

data increase but not as much as RabbitMQ because there is no handshaking taking place.

Figure 15: 2MB to 50MB measuring % memory usage while data is streaming from Source to Sink

0

5

10

15

20

25

30

35

2 4 6 8 10 20 30 40 50

% memory usage


RabbitMQ

Redis



Results of streaming 2MB files:

The next data set used for measurement from the test system was multiple 2MB files. Only 2MB files

were used in amounts varying from 1 to 100. Again the time taken to stream these files is linear. So as

the amount of files increases so too does the time taken to stream all the files. Figure 16 shows a

smaller data set but this linear relationship is still evident. Similar to figure 4 for smaller data sizes no

real difference in time can be measured between the two messaging services under test. One surprising

outcome of this test was the fact that there is no limit to the amount of small files that can be streamed.

Figure 16: Measuring time taken for differing quantities of 2MB files to stream from Source to Sink

This is shown in Figure 17 where the source folder for the stream holds different test values above

50MB, where 50MB is the limit for a single file streaming within this test system. For this test however a

higher folder size of 200MB was used (100 * 2MB) but this could be increased to a value that is only

limited to the size of the source folder. The source folder is capable of being increased to a value

dependent on the hard drive of the OS it is running on. With this really being a hardware constraint

rather than a Spring XD limitation it would not be unreasonable to say that the size of the ingestion

stream is limitless, provided the file size used is kept low.

0

20

40

60

80

100

120

140

160

1 2 3 4 5 10 20 25 30

Time in Secs

Quantity of 2MB files

RabbitMQ

Redis



The reason this is possible is the stream ingests one file at a time so each file within the source folder is

put into a queue and streamed in order. This puts very little stress on the system. So using multiple small

files there is no limit to the size of the folder that the stream can handle. This is in contrast to Figure 13

which shows a similar linear relationship to Figure 17 but has an upper limit of 50MB due to only one

source file being used.

Figure 17: Measuring time taken for differing quantities of 2MB files to stream from Source to Sink

An analysis of the CPU usage for testing multiple 2MB files demonstrates just how Spring XD handles the

multiple small files in order one after the other. Figure 18 shows this CPU analysis of the container VM

when dealing with different 2MB file quantities. A spike in work load approximately every five seconds is

evidence of the fact that Spring XD does systematically streams one file at a time until all files within the

source folder have been streamed. Analyzing Figure 18 it can be seen that the CPU usage drops to zero

and stays at zero for approximately 3 seconds between each spike. This 3 second period is where Spring

XD has loaded the file to the source and is outputting the stream to the container console which in this

case is what the user sees as visual evidence that the stream is functioning. Once the file has been

streamed to the console the next file is taken which accounts for the next spike in CPU usage.

0

100

200

300

400

500

600

1 2 3 4 5 10 20 25 30 40 50 60 75 100

Time in Secs

Quantity of 2MB files

RabbitMQ

Redis



Figure 18: Measuring % CPU pattern as 2MB files are streaming with Redis as messaging service

Figure 18 shows the system using Redis for message transport. The same tests were carried out using

RabbitMQ and the results of which are shown in Figure 19. Both Figure 18 and Figure 19 show very

similar CPU patterns as both are showing an average of 7% CPU usage. This demonstrates that both

messaging services during general system use will put the OS under a similar amount of pressure. This

then should not be a consideration as to why the user would pick one messaging service over the other

when a load such as this is being used.

One aspect of the Redis analysis that does stand out is the random spikes in CPU usage that occur. Two

spikes that jump to a value close to 14% CPU usage can clearly be seen in Figure 18. These spikes could

be accounted for by the internal functionality of Redis. Redis uses a data base format called RDB. Redis

stores values in memory most of the time but also persists to disk when data becomes too large or at

predefined time intervals30. Redis dumps the data to disk but does not over modify the previous data set

that got persisted to disk but instead writes over the previous version. This causes large spikes in CPU

usage which could account for what can be seen in Figure 18.

These spikes are not seen when using the RabbitMQ messaging service which is shown on Figure 19.

However initial CPU usage spikes when the Spring XD stream is started are evident. This could be put

0

2

4

6

8

10

12

14

16

1 3 5 7 9 11 13 15 17 19 21 23 25 27

% CPU usage

Time in Secs

Redis 2 * 2MB files

3 * 2MB files

4 * 2MB files

5 * 2MB files

10 * 2MB files

20 * 2MB files

25 * 2MB files

30 * 2MB files

40 * 2MB files

50 * 2MB files

60 * 2MB files

75 * 2MB files

100 * 2MB files



down to RabbitMQ being a more robust system and thus needing more resources when establishing

connections on system start up.

Figure 19: Measuring % CPU pattern as 2MB files are streaming with RabbitMQ as messaging service

For memory usage very similar results can be seen between the two messaging services. Both services

use between 19% and 21% of overall system memory throughout the wide range of differing folder

sizes. This is shown in Figure 20 below. As the folder size increases the memory used does not increase a

dramatic amount. A 2% memory usage increase is the maximum increase which could be described as

insignificant. This again is in contrast to previous tests using larger file sizes. In Figure 15 it can be seen

that as the file size increases so too does the memory usage with a dramatic spike (which could be

crippling to a system) being seen as the size approaches the java heap threshold. No such spikes are

seen when using multiple smaller files and thus the system is much more stable and reliable. This

stability is very much in evidence from Figure 18, Figure 19 and Figure 20.

0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526

% CPU usage

Time in Secs

RabbitMQ 2 * 2MB files

3 * 2MB files

4 * 2MB files

5 * 2MB files

10 * 2MB files

20 * 2MB files

25 * 2MB files

30 * 2MB files

40 * 2MB files

50 * 2MB files

75 * 2MB files

100 * 2MB files



Figure 20:Measuring % memory usage patterns as 2MB files are streaming

18

18.5

19

19.5

20

20.5

21

2 3 4 5 10 20 25 30 40 50 60 75 100

% Memory usage

Number of 2MB files being streamed

Memory usage:streaming 2MB files

RabbitMQ

Redis




Multiple 10MB files were the next data set used for measurement from the test system. Only 10MB files

were used in amounts varying from 2 to 15. Each addition of a file to the source folder yields a large

overall increase in the data being streamed. From the first round of tests it was evident that a single

10MB file was no trouble for a system such as this. However it was how the stream and the messaging

service linked these files together in the queue was what was being tested. Again from 12 the linear

relationship in time increase as the overall size of the source file folder increases, is evident. It could be

argued that as the size of the files being streamed increases i.e. the difference from Figure 16 (2MB files)

and Figure 21(10MB files) that the time between the two messaging services also increases. With the

same size data being processed in Figure 21 and the increase in the time it takes RabbitMQ to process

data is again significant. For example at a quantity 15 there is a time difference of 6.2% and at 8 this is

15%. As before this can be put down to the significant hand shaking that takes place for RabbitMQ that

does not take place with Redis.

Figure 21: Measuring time as 10MB files are streaming from Sink to Source

0

10

20

30

40

50

60

70

80

90

2 3 4 5 8 10 15

Time in Secs


RabbitMQ

Redis



When performing these tests it was clear the system was under pressure. There were longer delays from

the time the system was initialized to the time data started to stream in the console. The data was a lot

more erratic and didn’t flow as evenly or so it seemed. This pressure was most evident in the percentage

CPU usage data. There were no even spikes and hollows every 3 seconds like with 2 MB tests. Rather a

steady usage of between 3% and 12% CPU usage. This was followed by random spikes of anything up to

40% CPU usage every 4 to 5 seconds. Figure 22 shows the average of each test stream with each test

showing similar averages for RabbitMQ and Redis. This work load ideally should be avoided to prevent

the unwanted spikes that could cause the system to crash.

Figure 22: Measuring % CPU usage as 10MB files are streaming from Sink to Source

For memory usage the system fluctuates between 20% and 27% total percentage memory usage. This is

in contrast to the memory levels seen for the 2MB tests where the values stayed between 19% and 21%

even as the source folder size increased (Figure 20). This is another indication that the system is under

pressure to complete the stream. And as seen with the single file tests in Figure 15 the memory limit for

the java heap is approximately around the 30% mark, which these values are just under. This test is

unlikely to crash like the single file test as an excess of 50MB won’t be loaded into the java heap at any

0

1

2

3

4

5

6

7

8

2 3 4 5 8 10 15

% CPU usage


RabbitMQ

Redis



one time. But it is likely that a value close to this limit will as Redis keeps all data in memory before

persisting it to disk. This causes the CPU spikes as seen in Figure 18 and again could cause a system

running close to its limit to crash.

Figure 23: Measuring % memory usage as 10MB files are streaming from Sink to Source

0

5

10

15

20

25

30

2 3 4 5 8 10 15

% memory usage


RabbitMQ

Redis




The final test conducted was done using multiple 20MB files. As 20MB is a sizable file size on its own

multiples of it grow the source folder quiet quickly. With that in mind 15 was the maximum number of

files used during this test. Again RabbitMQ and Redis were tested against each other with some similar

results as previous systems. Again RabbitMQ seems to take a longer period of time to process the same

load and the relationship between the time increasing and file growth in linear. Figure 24 demonstrates

this however the percentage time does not seem to be as large as with other tests. For example using a

quantity of 5 files RabbitMQ only showed a 2.5% increase in time and for 15 files showed 1.7% increase

in time. This could just be down to small inaccuracies in testing. However if Figure 13 is examined the

single 20MB file also showed a large time variation of 20% between RabbitMQ and Redis. This leads to

the conclusion that Redis slows down when dealing with multiple large files.

Figure 24: Measuring time as 20MB files are streaming from Sink to Source

The percentage CPU usage also increased if only by a small margin. This can be seen in figure 16 where

each test gives the average overall CPU value for that test. This test is very similar to the 10MB file test

with slightly higher values across the board. Another significant difference is CPU values every second

are more erratic with highs of greater than 30% a regular occurrence. The lows too are higher between

0

20

40

60

80

100

120

140

2 3 5 10 15

Time in Secs

Number of 20MB files streaming

RabbitMQ

Redis



files with 3% and 5% the usual values seen. This then accounts for the higher overall average values seen

in Figure 25. Again this could be seen as not the load that is designed for such a system. And as the CPU

usage increased the system reached a critical state again. As 20MB files were streaming the system

struggled to output the data to the console and a “failed to flush writer” error was thrown on the Spring

XD container server. Figure 26 is a screen grab of this error. This did not prevent the write to the console

taking place however and the stream continues after the error is thrown. This shows the dynamism and

robustness of Spring XD.

Figure 25: Measuring % CPU usage as 20MB files are streaming from Sink to Source

0

1

2

3

4

5

6

7

8

9

10

2 3 5 10 15

% CPU usage


RabbitMQ

Redis



Figure 26: Spring XD Container Server Error

Figure 18 shows memory usage for the 20MB test. What is evident is an average higher memory usage

than any other test conducted which also could have contributed to the errors on Figure 26. The only

test that comes close to these levels are the single file tests with a file size greater than 30MB (Figure 6)

which ultimately led to the java heap error and system crash. Again the memory usage levels only

exceed the 30% mark as the source folder size approaches 200MB. This total size is still far greater than

using a single file as the memory gets cleared each time a new 20MB file is streamed. What may be

happening here is that memory isn’t getting cleared fast enough and with every file that gets streamed

the memory fills up until again the system crashes.



Figure 27: Measuring % memory usage as 20MB files are streaming from Sink to Source

0

5

10

15

20

25

30

35

2 3 5 10 15

% Memory usage


RabbitMQ

Redis



Conclusion

The initial aim when starting this project was to study and experiment with, different methods of

analyzing a big data source. That goal was accomplished once the highly available Spring XD cluster was

setup and different stream tests were carried out. The four aspects of Spring XD: Data ingestion, Data

compute, data analytics and data export were broken down and examined. Examples of these were the

Twitter stream example and streaming EMC log files to a HDFS stream sink that was setup on VM5

within the ESX cluster.

The project then evolved to analyze Spring XD as it was evident that a set of guidelines could be

beneficial to anyone setting up a system such as this. My initial findings during my examination of Spring

XD were used to come up with a plan to create test streams. These tests were carried out in conjunction

with a java program that was run to capture stream time, system CPU and system memory metrics. In

this way an over view of what type of load was best suited the system and what load stressed or broke

the system.

The main variable within the system was the messaging service. So the only two messaging services

supported at the time of test were used. These were RabbitMQ and Redis. The same tests were carried

out for each messaging service and a comparison was done across different load types.

Of course every test system is different and the specifications for the one used in this case are listed

under Preparing the VMs. The java heap limit was set at default which was 64MB. This could have been

extended during testing to see what affect this would have. Users of Spring XD should be aware of this.

This will only become a problem if the source folders contain files greater than 30MB as this is the point

when the Spring XD container server starts to become stressed. Spring XD container server will reach

the breaking point at anything greater than 50MB.

It was then found that a perfect load for Spring XD was multiple small files of 2MB. These files would

stream very happily one after the other until all were streamed to the stream sink. The system would

not get stressed and no errors occurred. Further tests were carried out using larger file sizes of 10MB

and 20MB. It was clear that the system was becoming more stressed as the size of the files grew. So this

leads to a conclusion that any quantity of files under a size of 10MB would provide the perfect load for a

system such as this. Streams with source files greater than 10MB will function correctly, if absolute



limits are adhered to. However systems with such loads could be in danger of CPU spikes or java heap

errors which could lead to data loss or data not being analyzed.

Comparing RabbitMQ and Redis it is evident that RabbitMQ is a slower messaging service and this time

difference is very slight when using smaller files but grows as the file sizes grow. A decision on what

messaging service to use really comes down to system requirements. If speed is essential with data loss

acceptable then Redis is the choice. If data integrity is essential then RabbitMQ is the only viable option.

As speed can be made up for in other ways but data loss cannot. For example the system is horizontally

scalable so time can be made up by spreading the load across multiple containers on multiple VMs.

An ideal work load of endless small files makes Spring XD fit for purpose. This is a tool that is perfect for

Big Data phenomena like the Internet of things (IoT) or any system that requires the collection and

analysis of multiple small files.

Future Work

Spring XD version 1.0.0 was used during this project. As of March 27 2015 the latest version of Spring XD

is 1.1.1 which brings with it new supported sinks, sources and messaging services. This newly supported

messaging service is called Apache Kafka. So a further comparison against Redis and RabbitMQ could be

carried out.

There are many Spring XD Sources and Sinks that were not tested due to time constraints. So a more in-

depth test plan could be prepared using more stream types. This could be used to compile a more

comprehensive set of data for time taken, % CPU and % memory usage on the system under test.

An enhancement that could also be carried out in the future is extending the java heap size on the

Spring XD Container Server VMs. Then repeating the tests carried out during this project would help

build a better picture as to what the absolute maximum file size is, that a Spring XD container can

stream.



Appendices

Java Code: CPU and Memory

The following java code was written to output the command prompt screen the percentage CPU and

percentage memory being used at that instance. Time and date is attached to each reading before being

output to the screen. The “SIGAR” library is first added as a jar file and is used to capture system

information, in this case memory and CPU usage31.

The full program is as follows:

package ie.cit.msc;

import org.hyperic.sigar.SigarException;

public class Application {

public static void main (String args[]) throws SigarException {

new Application(args);

}

public Application(String[] args) throws SigarException {

CPUPoller poller = new CPUPoller();

poller.start();

} }

package ie.cit.msc;

import java.text.SimpleDateFormat;

import java.util.Date;

import org.hyperic.sigar.CpuPerc;

import org.hyperic.sigar.Sigar;

import org.hyperic.sigar.SigarException;

import org.hyperic.sigar.cmd.Shell;

import org.hyperic.sigar.Mem;

public class CPUPoller extends Thread {

private Sigar sigar;

private boolean finished = false;

private final SimpleDateFormat format = new SimpleDateFormat("HH:mm:ss:SSS");



public CPUPoller() throws SigarException {

sigar = new Shell().getSigar();

org.hyperic.sigar.CpuInfo[] infos = sigar.getCpuInfoList();

org.hyperic.sigar.Mem infoMem = sigar.getMem();

org.hyperic.sigar.CpuInfo info = infos[0];

System.out.println("Vendor........." + info.getVendor());

System.out.println("Model.........." + info.getModel());

System.out.println("Mhz............" + info.getMhz());

System.out.println("Total CPUs....." + info.getTotalCores());

System.out.println("Total Memory..." + infoMem.getRam()+"MB");

}

@Override

public void run() {

while(!finished)

CpuPerc cpu;

Mem usedmemory;

try {

cpu = sigar.getCpuPerc();

usedmemory = sigar.getMem();

System.out.println("CPU used....."+format.format(new Date()) + " : " +

CpuPerc.format(cpu.getCombined()));

String totalusedMem = String.format("%.1f",

usedmemory.getUsedPercent());

System.out.println("Mem used....."+format.format(new Date()) + " : " +

totalusedMem+"%");

} catch (SigarException e) {

e.printStackTrace();

}

try {

Thread.sleep(1000);



} catch (InterruptedException e) {

e.printStackTrace();

}

}

super.run();

}

public void finish() {

finished = true;

}}

Java Code: Start time of Spring XD Stream

Basically added a while loop to poll the source folder for any file to be created in that source folder.

Once this happens the time gets output to the screen and the application finishes.

The full program is as follows:

package ie.cit.msc;

import java.io.File;

import java.text.SimpleDateFormat;

import java.util.Date;

public class Application {

private static final String INPUT_DIR = "/tmp/xd/input/";

private final SimpleDateFormat format = new SimpleDateFormat("EEE MMM dd HH:mm:ss:SSS

zzz yyyy");

private String nameOfStream = "filetofile1";

private Date startTime;

public static void main (String args[]) {

new Application(args);

}

public Application(String[] args) {

nameOfStream = args[0];

File inputDirectory = new File(INPUT_DIR + nameOfStream);

while (! (inputDirectory.list() != null && inputDirectory.list().length > 0)) {

// do nothing



}

startTime = new Date();

System.out.println("Processing started at " + format.format(startTime));

System.exit(0);

} }

Preparing the VMs

The version of Operating system used was Red Hat Enterprise Linux Server release 6.6 (Santiago).

Java version 7 or above should be installed

download java jdk .tar.gz

download java to the /usr/java folder

unpack with tar zxvf jdk......tar.gz

check with command “java –version”

cd into /usr/bin

Create copies of the old java files with command

mv java oldjava

mv javac oldjavac

then use command to create symbolic link to newly created java 7

still in /usr/bin not /usr/java

sudo ln -s -v /usr/java/jdk1.7.0_60/bin/java java

then check java version 7 is available

java -version

The hosts file must also be edited

cd into /etc

then edit “hosts” file

after the following line

127.0.0.1 localhost.localdomain localhost

add these 2 lines

127.0.0.1 localhost

127.0.1.1 “name of VM”



Where for example “qecorkc85nfvm2” would be used as the name of the VM

The following steps add java home and XD home to the global path

Cd to /etc then edit “environment” file

Add the following txt to the empty file

Export JAVA_HOME=/usr/java/jre1.7.0_67

Export XD_HOME=/root/spring-xd-1.0.0.M4/xd

restart the VM for changes to take effect

Zookeeper

Within the servers.yml file the following needs to be set in each VM in cluster for Zookeeper to be

supported

#Zookeeper properties

zk:

client:

connect: 10.73.18.165:2181

Zookeeper-3.4.6 was installed and the following commands could be used provided similar Linux OS and

version:

wget zookeeper_link_address.tar.gz

tar xvzf zookeeper-3.4.6.tar.gz

yum install zookeeper-3.4.632

Cd into zookeeper-3.4.6/bin and run the following command

./zkServer.sh start

Staying in the zookeeper-3.4.6/bin connect to zookeeper command line using command

./zkCli.sh

Show all running nodes

Ls /

Shows all information zookeeper has on Spring XD

Ls /xd

Redis

Within the servers.yml file the following needs to be set in each VM in cluster for Redis to be supported



# Redis properties

spring:

redis:

port: 6379

host: 10.73.18.165

Redis 2.8.16 was installed and the following commands could be used provided similar Linux OS and

version:

yum update

yum install make gcc wget

now try going to redis/bin

./install-redis

If this doesn’t install the following could also be attempted:

redis/dist

tar xvzf redis.tar.gz

cd to redis/deps

make all four files in here

then cd.. back to redis and run

make install

to start the server after install just cd into redis/src and type

./redis-server

Check if installation was successful by checking keys in the message que,

Cd into redis/src and type

./redis-cli

Keys *

Rabbitmq33

Within the servers.yml file the following needs to be set in each VM in cluster for Rabbitmq to be

supported

# RabbitMQ properties

spring:

rabbitmq:

host: localhost

port: 5672



username: guest

password: guest

virtual_host: /

Erlang package needs to be installed

wget http://packages.erlang-solutions.com/erlang-solutions-1.0-1.noarch.rpm

yum install erlang

Install VMware tools

cd vmware-tools-distrib/

./vmware-install.pl

Install Rabbitmq server

rpm --import http://www.rabbitmq.com/rabbitmq-signing-key-public.asc

wget https://www.rabbitmq.com/releases/rabbitmq-server/v3.4.4/rabbitmq-server-generic-unix-

3.4.4.tar.gz

tar xvzf rabbitmq-server-generic-unix-3.4.4.tar.gz

yum install rabbitmq-server-3.4.4-1.noarch.rpm

MySQL

Within the servers.yml file the following needs to be set in each VM in cluster for MySQL to be

supported

#Config for use with MySQL

spring:

datasource:

url: jdbc:mysql://10.73.18.165:3306/xdjobs

username: xduser

password: pass1

driverClassName: com.mysql.jdbc.Driver

MySQL was installed and the following commands could be used provided similar Linux OS and version:

yum install mysql-server mysql

To start MySQL database use the command

mysqld –user=root&

To setup a password for a root user use

mysqladmin -u root password “enter password here” 34

To check if mysql is installed correctly use

http://packages.erlang-solutions.com/erlang-solutions-1.0-1.noarch.rpm



mysql –u xduser –p

“enter password here”

This is the mysql command line to view available databases use

Show databases;

Xdjobs should be one of the databases so change to this DB using command

Use xdjobs;

To now see all tables available inside the database use

Show tables;

Hadoop # Hadoop properties

spring:

hadoop:

fsUri: hdfs://10.73.18.167:9000

wget http://mirror.nexcess.net/apache/hadoop/common/hadoop-2.4.1/hadoop-2.4.1.tar.gz

tar -zxvf hadoop-2.4.1.tar.gz

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

make sure the /etc/environment file is pointing at the correct location for each given path

export JAVA_HOME=…/java/latest

export HADOOP_HOME=…/hadoop-2.4.1

export HADOOP_PREFIX=…/hadoop-2.4.1

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_PREFIX/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"

Check that these are set with the echo command

Echo $JAVA_HOME or echo $HADOOP_HOME

In Hadoop-2.4.1/etc/Hadoop edit the core-site.xml file35

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://10.73.18.167:9000</value>

</property>

</configuration>

http://mirror.nexcess.net/apache/hadoop/common/hadoop-2.4.1/hadoop-2.4.1.tar.gz



In Hadoop-2.4.1/etc/Hadoop edit the hdfs-site.xml file

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>/data</value>

</property>

</configuration>

In Hadoop-2.4.1/etc/Hadoop leave the mapred-site.xml and yarn-site.xml files empty as follows

<configuration>

</configuration>

Next format the name node with the following command

bin/hdfs namenode –format

next start Hadoop cluster

sbin/start-dfs.sh

If the cluster does not start first timethings that may be wrong are as follows

Check /etc/hosts file should have IP addresses of local machine

127.0.0.1 localhost.localdomain localhost

127.0.0.1 localhost

127.0.1.1 xxxorkc85xxvm5

The following three commands checks for corrupted files and disables safe mode

bin/hadoop fsck / -blocks -locations -files

bin/hdfs dfsadmin fsck

bin/hdfs dfsadmin -safemode leave

The user can also look at the cluster of files

Bin/hdfs dfs –ls /

Or add to the cluster manually

Bin/hdfs dfs –mkdir /xd



References

1 http://bigthink.com/think-tank/the-internet-of-things-meets-big-data 2 http://pivotal.io/platform-as-a-service/press-release/cloud-foundry-foundation-adds-swisscom-to-roster 3 https://spring.io/projects 4 http://projects.spring.io/spring-xd/ 5 https://zookeeper.apache.org/ 6 http://www.slideshare.net/SpringCentral/spring-xd-guided-tour?related=1 (slide 11) 7 https://github.com/EsotericSoftware/kryo/wiki/Documentation-for-Kryo-version-1.x

8 http://www.slideshare.net/SpringCentral/spring-xd-guided-tour?related=1 (slide 12) 9 http://docs.spring.io/spring-xd/docs/current/reference/html/#_start_the_runtime_and_the_xd_shell 10 http://docs.spring.io/spring-xd/docs/0.1.x-SNAPSHOT/reference/html/running-distributed-mode.html 11 http://www.infoq.com/articles/introducing-spring-xd 12 Machine learning: hands on for developers and technical Professionals by jason bell page 201-202 13

http://docs.spring.io/spring-xd/docs/0.1.x-SNAPSHOT/reference/html/architecture.html 14

http://projects.spring.io/spring-xd/ 15

http://www.slideshare.net/SpringCentral/spring-xd-guided-tour?related=1 (slide 20) 16

http://docs.spring.io/spring-xd/docs/current/reference/html/#sources 17

https://storm.apache.org/documentation/Setting-up-a-Storm-cluster.html 18

http://hortonworks.com/hadoop/yarn/ 19

https://nathanmarz.github.io/storm/doc/backtype/storm/spout/ISpout.html 20

http://hortonworks.com/hadoop-tutorial/simulating-transporting-realtime-events-stream-apache-kafka/ 21

https://storm.apache.org/about/simple-api.html 22

http://www.accenture.com/us-en/blogs/technology-blog/archive/2014/04/28/the-right-big-data-technology-for-smart-grid-distributed-stream-computing.aspx 23

http://www.cs.duke.edu/~kmoses/cps516/dstream.html 24

http://www.csdn.net/article/2014-01-27/2818282-Spark-Streaming-big-data 25

http://stanford.edu/~rezab/sparkclass/slides/td_streaming.pdf 26

http://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html 27

http://planetcassandra.org/redis-to-cassandra-migration/ 28

http://www.quora.com/What-are-the-advantages-and-disadvantages-of-Beanstalkd-as-a-work-queue 29

https://plumbr.eu/outofmemoryerror/java-heap-space 30

http://redis.io/topics/persistence 31

http://lizhouwangnotes.blogspot.ie/2011/08/use-sigar-api-in-java-to-capture-system.html 32

http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-5-0/CDH4-Installation-Guide/cdh4ig_topic_21_3.html 33 http://www.rabbitmq.com/install-rpm.html

34 http://www.cyberciti.biz/faq/how-to-install-mysql-under-rhel/

35 http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/

http://www.slideshare.net/SpringCentral/spring-xd-guided-tour?related=1



http://lizhouwangnotes.blogspot.ie/2011/08/use-sigar-api-in-java-to-capture-system.html

http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-5-0/CDH4-Installation-Guide/cdh4ig_topic_21_3.html

http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-5-0/CDH4-Installation-Guide/cdh4ig_topic_21_3.html

http://www.rabbitmq.com/install-rpm.html

http://www.cyberciti.biz/faq/how-to-install-mysql-under-rhel/

http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/

An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis

Documents

Transcript of An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis