Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project...

67
Visvesvaraya Technological University “Jnana Sangama”, Santhibastawad Road, Machhe, Belgaum-14 2016-2017 A Dissertation Report On Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution.Submitted in partial fulfillment of the requirements for the award of degree of BACHELOR OF ENGINEERING in COMPUTER SCIENCE AND ENGINEERING by ANANYA. R (1IC13CS002) KIRAN VASISHTA. T. S (1IC13CS009) SHILPARANI. J (1IC13CS027) Under the guidance of Mrs. Rekha.M.S Assistant Professor Department of CSE ICEAS, Bangalore Department of Computer Science and Engineering IMPACT COLLEGE OF ENGINEERING AND APPLIED SCIENCE SAHAKAR NAGAR, BANGALORE-560092 2016-2017

Transcript of Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project...

Page 1: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Visvesvaraya Technological University “Jnana Sangama”, Santhibastawad Road, Machhe, Belgaum-14

2016-2017

A Dissertation Report On

“Big Data Analytics Framework to identify

Agriculture/Aquaculture Diseases and recommendation

of a solution.”

Submitted in partial fulfillment of the requirements for the award of degree of

BACHELOR OF ENGINEERING

in

COMPUTER SCIENCE AND ENGINEERING

by

ANANYA. R (1IC13CS002)

KIRAN VASISHTA. T. S (1IC13CS009)

SHILPARANI. J (1IC13CS027)

Under the guidance of

Mrs. Rekha.M.S

Assistant Professor

Department of CSE

ICEAS, Bangalore

Department of Computer Science and Engineering

IMPACT COLLEGE OF ENGINEERING AND APPLIED SCIENCE

SAHAKAR NAGAR, BANGALORE-560092

2016-2017

Page 2: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

IMPACT COLLEGE OF ENGINEERING AND APPLIED SCIENCE

SAHAKAR NAGAR, BANGALORE-560092

CERTIFICATE

This is to certify that the project entitled ”Big Data Analytics Framework to

identify Agriculture/Aquaculture Diseases and recommendation of a solution” is a

bonafide work carried out by Ms. ANANYA R (1IC13CS002), Mr. KIRAN

VASISHTA T S (1IC13CS009), Ms. SHILPARANI J (1IC13CS027) in partial

fulfillment for the award of Bachelor of Engineering in Computer Science and

Engineering of Visvesvaraya Technological University, Belgaum during the year 2016-

2017.The project report has been approved as it satisfies the academic requirements in

aspect of project work prescribed for the said degree.

Signature of Internal Guide Signature of HOD Signature of Principal

Mrs. Rekha.M.S Mrs. Neenu Rana Dr.Narayan Singh

Assistant Professor Professor & HOD Principal

Department of CSE Department of CSE ICEAS, Bangalore

ICEAS, Bangalore. ICEAS, Bangalore.

Internal Examiner External Examiner

Name: __________________ Name: __________________

Signature: _______________ Signature: _______________

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Page 3: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

ACKNOWLEDGEMENTS

It gives me an immense pleasure to convey my sincere thanks to all those intellectuals

who have supported me in submitting my project titled “Big Data Analytics Framework

to identify Agriculture/Aquaculture Diseases and recommendation of a solution.”

with their guidance and encouragement.

I wish to express my gratitude to Dr. NARAYAN SINGH, PRINCIPAL for his

encouragement and also for providing all the facilities for accomplishing this project.

I extend my sincere gratitude to Mrs NEENU RANA, Head of Department of Computer

Science and Engineering for her support and advice.

I also extend my sincere gratitude to Mrs REKHA.M.S, Assistant Professor, Department

of Computer Science and Engineering for his guidance, support and continuous

encouragement.

I express my gratitude to management of IMPACT COLLEGE OF ENGINEERING

AND APPLIED SCIENCES, BANGALORE for providing me an opportunity to fulfil

my cherished goal of taking up the project work as a part of the Under Graduate program.

I also thank all those who have helped me directly or indirectly in ways of time,

resources, moral and technical support.

ANANYA.R(1IC13CS002)

KIRAN VASISHTA.T. S(1IC13CS009)

SHILPARANI.J(1IC13CS027)

Page 4: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming
Page 5: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

ABSTRACT

Due to technology the term data is replaced by transforming big data in many fields. Rapid

advancements in the technology causes agricultural data enter into the era of big data.

Traditional tools and techniques are unable to store and analyze this massive amount of data.

To store and analyze this type of data parallel computing and analyze paradigm is required.

Big data analytics is used as a solution to this. In the project big data analytic Agriculture and

Aquaculture framework is developed that identify disease based on symptoms similarity and

recommend a solution based on high similarity. To achieve this objective Hadoop and Hive

tools has been used. The data is collected, cleansed and normalized. Data is collected from

laboratory reports, web sites etc. then cleansing of data is done that is important information

is extracted from unstructured redundant data. In the next step normalization is done i.e

features are extracted from cleaned data. Normalized data is uploaded on HDFS and saved in

a file supported by hive. HiveQL is a SQL like query language which is used to analyze the

data. It finds out disease name based on crop/fish disease symptoms and proposes a solution

based on evidence from historical data. Result is useful for recommending a solution that is

highly used or high symptoms similarity.

Page 6: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

CONTENTS

CHAPTER Nos. TITLE PAGE Nos.

1 Introduction 1

1.1 Motivation 1

1.2 Objective 2

1.3 Methodology 2

1.4 Existing System 4

1.5 Proposed System 5

2 Literature Survey 6

2.1 Software Description 8

2.1.1 Java Technology 8

2.1.2 IntelliJ IDEA 9

2.1.3 IntelliJ IDEA Platform 10

2.1.4 IntelliJ IDEA IDE 12

3 System Analysis 13

3.1 Functional Requirements 13

3.2 Non Functional Requirements 14

3.3 System Requirements 14

3.3.1 Hardware Requirement Specification 14

Page 7: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

3.3.2 Software Requirement Specification 14

4 System Design 15

4.1 System Architecture 17

4.2 Use-Case Diagram 21

4.3 Dataflow Diagram 22

4.4 Sequence Diagrams 25

5 Implementation 27

5.1 MapReduce Algorithm 28

5.2 Partitioner 31

5.3 Combiner 32

6 Testing 36

6.1 Introduction to Testing 36

6.1.1 Functional and Non Functional Testing 37

6.1.2 Compatibility Testing 37

6.1.3 Verification and Validation 37

6.2 Testing Methodologies 40

6.3 Testing Levels 40

6.3.1 Unit Testing 40

6.3.2 Integration Testing 41

6.3.3 System Testing 41

6.4 Unit Testing of Main Modules 41

Page 8: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

6.4.1 Unit Testing for User 41

7 Results 43

7.1 Snapshots of Browsing HDFS 43

7.2 Snapshots of Hadoop and Hive Interfaces 47

7.3 Snapshots of the project deployed on IntelliJ IDEA 50

7.4 Snapshots of Web Application/Framework 52

8 Conclusion And Future Work 54

References

Page 9: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

List of Figures

Figure Nos. Title Page Nos.

1 3V’s of Big Data 6

2 Cloud Computing 7

3 System Architecture 17

4 Hadoop Architecture 18

5 HDFS Architecture 20

6 Use-Case Diagram 21

7 Data Flow diagram 22

8 Host and Symptom DFD 23

9 Splitting Data DFD 23

10 Searching keyword-DFD 24

11 Main sequence diagram 25

12 Login failure sequence diagram 26

13 MapReduce 27

14 MapReduce Architecture 28

15 Working of MapReduce Classes 29

16 Working of MapReduce 30

17 Combiner 32

18 Example of MapReduce 35

Page 10: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

19 Overview of HDFS setup 42

320 Summary of HDFS setup 44

21 NameNode status of HDFS 45

22 Datanode information of HDFS 46

23 Hadoop and Hive initializations 47

24 Processes running on Hadoop user 47

25 Hive tables 48

26 Hosttable rows 48

27 Query Execution on Hive table 49

28 Project on IntelliJ 50

29 Core module of project 50

30 Web module of project Testing 51

31 Front web page of the application 52

32 Searching for required data 52

33 Result 53

Page 11: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

List of Tables

Table Nos. Title Page Nos.

1 Input, output for key-value pairs 31

2 Unit Test Case 1 41

3 Unit Test Case 2 42

4 Unit Test Case 3 42

5 Unit Test Case 4 42

Page 12: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 1

Chapter 1

INTRODUCTION

Big data is a term used to depict augmented growth of data. Data may be in the form of

file system or it may be in database, that can’t be processed by traditional software

techniques and databases. The main aim of the paper is to develop a recommendation

system to identify and provide solution of agriculture crop diseases and aquaculture fish

diseases. With the help of big data analytics, researchers can easily make decision from

historical data. It will be a great innovation and pioneering work in human history if big

data analytics is used in agriculture and aquaculture. Agriculture and aquaculture data is

increasing day by day at astonishing rate. The solution for this is to use big data analytics

and for analysis of such type of data Hadoop and its tools are used.

Apache Hadoop is an open-source software framework used for distributed

storage and processing of dataset of big data using the MapReduce programming model.

It consists of computer clusters built from commodity hardware. The core of Apache

Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS),

and a processing part which is a MapReduce programming model. Hadoop splits files into

large blocks and distributes them across nodes in a cluster. It then transfers packaged

code into nodes to process the data in parallel. This approach takes advantage of data

locality, where nodes manipulate the data they have access to. This allows the dataset to

be processed faster and more efficiently than it would be in a more

conventional supercomputer architecture that relies on a parallel file system where

computation and data are distributed via high-speed networking.

1.1 Motivation

In the present scenario there is huge dependency on agricultural data. As we are in

the 21st

century, already the generations are prevailed by the digital systems. Hence the

motivation of making up agriculture with the enhancement of aquaculture disease and

recommendation system is produced in this project.

Page 13: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 2

1.2 Objective

The objective of this project is to develop a web application which can take up the

symptom of a crop or fish which is expected of an infectious disease from the user and

query with the huge data using hadoop and hive tools and recommend a solution of what

kind of disease it leads to and what might be the prevention for it from the historical data.

1.3 Methodology

Primary motive of generation of results from the collection of data is to serve researchers

by giving a solution for various diseases of crops. It was not an easy task to develop a

new framework identify disease and recommend solution based on symptoms similarity.

These frameworks provide the solution based on historical data. Data for this framework

is collected from various sources.

This model basically works on recommendation system. The recommendation

systems use the historical data or the knowledge of the product. Many e -commerce

companies use recommendation system for sales (e.g. Amazon. in). In the proposed

model recommendation system is applied to agriculture domain.

Firstly data is collected from various sources e.g. lab reports, agriculture websites

etc. collected data is known as raw data because it contain irregularities and unwanted

information. So data is unformatted and it needs formatting or confirmation. This data is

stored on HDFS. NameNode of HDFS keeps track how your files are broken down into

file blocks, which nodes store those blocks. Clients communicate directly with DataNode

to process the local files corresponding to the blocks.

Feasibility Study

Feasibility studies aim to objectively and rationally uncover the strengths and weaknesses

of the existing system and the proposed venture, the resources required to carry through,

and ultimately the prospects for success. A detailed feasibility study was conducted to

know the technical and financial feasibility of the project and it was found that the project

is feasible to design, develop, use and maintain in all respects.

Page 14: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 3

Requirement Analysis and Project Planning

The requirements of this project was analysed in detail which includes system

requirements specification, software and hardware requirements. The project plan was

developed with the help of requirements gathered in this phase.

Design

After successful analysis of the system requirement, design of the project started where

various design constraints were analysed. The design phase consists of various modules

to be developed for generation of data of historic and live nature which is a need for the

testing of the product. A functional design methodology and top-down strategy is used

in this design phase. The flow diagrams along with the activity diagrams are indicated to

show the flow of control at various stages of the project.

Coding

The design of the system developed during the design phase is converted into code using

the Java environment and Perl as and when required at different stages of the project. The

coding is done according to the design strategy which aligns with the functional

requirements that are categorized as in the previous step.

Testing

The program is tested by executing with the set of test cases in different set of setup

environments and also stand alone systems. Then output of the program for the test cases

is evaluated of determine if the program is performing as expected. I have used

Incremental Testing strategy to ensure functional testing. First some main parts of the

project were tested independently. Then these parts are combined together forming

subsystems, which are then tested separately. Due to integration of various other modules

into the system the testing was carried out to ensure performance behaviour of Web

application.

Page 15: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 4

1.4 Existing system

At present there are many agricultural websites and apps which help in the cultivation and

crop yielding technics. Some of the existing systems are as follows:

MySmartFarm: MySmartFarm is one-stop-shop-all software for all a farmer’s data and

technology. Hosted in the cloud, driven by statistics and powered by intelligent models

and machine learning, it is designed for easy real-time 'anywhere-access', to empower

farmers with scientific advice, optimizing decision and to save time and money.

aWhere: aWhere’s Agricultural Intelligence platform provides users with accurate global

weather information for all agricultural needs. In order to accomplish this, aWhere

employs 13,000 ground weather stations around the world to collect specific data to

create a continuous weather map for the planet’s surface. By collecting and organizing

this data, aWhere is able to create a valuable network of more than 1.5 million virtual

weather stations that provide hourly forecasts as well as 10+ years of historical data. In

order to ensure that aWhere’s customers are getting the most precise data possible,

aWhere excludes any exhaustive outliers in aWhere’s data sets. With the exclusion of

these outliers, and the use of other algorithm-based methods, aWhere is able to provide

the most accurate and up-to-date weather information on the planet.

Phenonet: Phenonet collects, processes and visualizes sensor data from the field in near

real-time. It is helping plant scientists and farmers identify the best crop varieties to

increase yield and efficiency of digital agriculture.

Farmlogs: Managing your nitrogen efficiently is one of the key ways to drive higher

yields and higher profit. With FarmLogs, you'll have the tools you need to make nitrogen

management easier and more efficient.

Datafloq: Datafloq offers information, insights, knowledge and opportunities to drive

innovation through data. You can read high-quality articles, find big data and technology

vendors, post jobs, connect with talent, find or publish events and register for our online

training.

Page 16: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 5

1.5 Proposed System

Primary motive of generation of results from the collection of data is to serve researchers by

giving a solution for various diseases of crops. It was not an easy task to develop a new

framework identify disease and recommend solution based on symptoms similarity. These

frameworks provide the solution based on historical data. Data for this framework is collected

from various sources.

This model basically works on recommendation system. The recommendation systems

use the historical data or the knowledge of the product. Many e -commerce companies use

recommendation system for sales (e.g. Amazon. in). In the proposed model recommendation

system is applied to agriculture domain.

Page 17: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 6

Chapter 2

LITERATURE SURVEY

Laney D, Meta Group Inc. Application Delivery Strategies, February 2001: 3D Data

Management: Controlling Data Volume, Velocity and Variety [1] addresses current

business conditions and mediums that are pushing traditional data managements to their

limits, giving rise to novel, more formalized approaches.

Fig.1 3V’s of Big Data

Xue-Wen Chen, Xiatong Lin, IEEE Access 2014 May: Big Data Deep Learning

Challenges and Perspective [2] discusses that with the sheer size of data available today,

big data brings big opportunities and transformative potential for various sectors; on the

other hand, it also presents unprecedented challenges to harnessing data and information.

As the data keeps getting bigger, deep learning is coming to play a key role in providing

big data predictive analytics solutions. In this paper, we provide a brief overview of deep

learning, and highlight current research efforts and the challenges to big data, as well as

the future trends.

Marx V, Nature 2013, January: Biology- The Big Challenges Of Big Data [3]

introduces that in cloud computing, large data sets are processed on remote Internet

servers, rather than on researchers’ local computers. Large files with the big data problem

in the local systems are passed through the security firewalls and sent or mounted on the

systems of data centers which store the data on the cloud platform as shown in the figure.

Page 18: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 7

Fig.2 Cloud Computing

David B Lobell, Vol 143(2013): The satellite data for crop yield gap analysis [4]

this paper utilizes the various approaches people have used in past to identify crop yield

and its variations. One such approach is defined in this paper is through analysis of

satellite images in collaboration with other factors such as weather and land factors

determine the crop yield. This approach involves communication with the satellite and

involves cost factor. Other approach is to use data from soil management sensors and

weather information to predict field crop yield. We are also working on one such data.

However, advantage of satellite image analysis approach is much faster compared to

other methods as this communication happens in real time and provides realistic results.

J. Ben, Schafter, Joseph A, Konstan, Kluwer Academic Publishers, Manufactured

in the Netherlands.2001: E-Commerce Recommendation Application: Data Mining and

Knowledge Discovery [5] Primary focus area is to help the consumer choose the product

which he/she is looking for much quicker by analyzing his search history and what he/she

is interested in. It also helps the e-commerce sites to recommend products to consumers

while they are looking some specific products. This helps in improved sales and overall

buying time online. This analysis is either done based on some predefined rules provided

by experts or data that is mined from behavior of the consumer while shopping on sites.

Provide a feel of "Business knows the Consumer best". The accuracy of this

recommendation improves as there is more interaction of the system with the consumer as

it's a self-learning system.

Page 19: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 8

2.1 Software Description

2.1.1 Java Technology

Java is a general purpose, concurrent, class based, object oriented computer programming

language that is specifically designed to have as few implementation dependencies as

possible. It is intended to let application developers "write once, run anywhere" (WORA),

meaning that code that runs on one platform does not need to be recompiled to run on

another. Java applications are typically compiled to byte code (class file) that can run on

any Java virtual machine (JVM) regardless of computer architecture. Java is, as of 2012,

one of the most popular programming languages in use, particularly for client-server web

applications, with a reported 10 million users. Java was originally developed by James

Gosling at Sun Microsystems (which has since merged into Oracle Corporation) and

released in 1995 as a core component of Sun Microsystems' Java platform. The language

derives much of its syntax from C and C++, but it has fewer low-level facilities than

either of them.

James Gosling, Mike Sheridan, and Patrick Naughton initiated the Java language

project in June 1991. Java was originally designed for interactive television, but it was too

advanced for the digital cable television industry at the time. The language was initially

called Oak after an oak tree that stood outside Gosling's office; it went by the

name Green later, and was later renamed Java, from Java coffee, said to be consumed in

large quantities by the language's creators. Gosling aimed to implement a virtual

machine and a language that had a familiar C/C++ style of notation.

Sun Microsystems released the first public implementation as Java 1.0 in 1995. It

promised "Write Once, Run Anywhere" (WORA), providing no-cost run-times on

popular platforms. Fairly secure and featuring configurable security, it allowed network-

and file-access restrictions. Major web browsers soon incorporated the ability to run Java

applets within web pages, and Java quickly became popular. With the advent of Java

2 (released initially as J2SE 1.2 in December 1998 – 1999), new versions had multiple

configurations built for different types of platforms. For example, J2EE targeted

enterprise applications and the greatly stripped-down version J2ME for mobile

applications (Mobile Java). J2SE designated the Standard Edition. In 2006, for marketing

purposes, Sun renamed new J2 versions as Java EE, Java ME, and Java SE, respectively.

Page 20: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 9

In 1997, Sun Microsystems approached the ISO/IEC JTC1 standards body and

later the Ecma International to formalize Java, but it soon withdrew from the

process. Java remains a de facto standard, controlled through the Java Community

Process. At one time, Sun made most of its Java implementations available without

charge, despite their proprietary software status. Sun generated revenue from Java

through the selling of licenses for specialized products such as the Java Enterprise

System. Sun distinguishes between its Software Development Kit (SDK) and Runtime

Environment (JRE) (a subset of the SDK); the primary distinction involves the JRE's lack

of the compiler, utility programs, and header files.

On November 13, 2006, Sun released much of Java as free and open source

software, (FOSS), under the terms of the GNU General Public License (GPL). On May 8,

2007, Sun finished the process, making all of Java's core code available under free

software/open-source distribution terms, aside from a small portion of code to which Sun

did not hold the copyright.

Sun's vice-president Rich Green said that Sun's ideal role with regards to Java was

as an "evangelist." Following Oracle Corporation's acquisition of Sun Microsystems in

2009–2010, Oracle has described itself as the "steward of Java technology with a

relentless commitment to fostering a community of participation and transparency". This

did not hold Oracle, however, from filing a lawsuit against Google shortly after that for

using Java inside the Android SDK (see Google section below). Java software runs

on laptops to data centers, game consoles to scientific supercomputers.

There are 930 million Java Runtime Environment downloads each year and 3

billion mobile phones run Java. On April 2, 2010, James Gosling resigned from Oracle.

There were five primary goals in the creation of the Java language:

1. It should be "simple, object-oriented and familiar"

2. It should be "robust and secure"

3. It should be "architecture-neutral and portable"

4. It should execute with "high performance"

5. It should be "interpreted, threaded, and dynamic"

Page 21: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 10

2.1.2 IntelliJ IDEA

IntelliJ IDEA is a Java integrated development environment (IDE) for developing

computer software. It is developed by Jet Brains (formerly known as IntelliJ), and is

available as an Apache 2 Licensed community edition, and in a proprietary commercial

edition. Both can be used for commercial development.

The first version of IntelliJ IDEA was released in January 2001, and was one of

the first available Java IDEs with advanced code navigation and code

refactoring capabilities integrated.

In a 2010 InfoWorld report, IntelliJ received the highest test center score out of

the four top Java programming tools: Eclipse, IntelliJ IDEA, NetBeans and JDeveloper.

In December 2014, Google announced version 1.0 of Android Studio, an open source IDE

for Android apps, based on the open source community edition of IntelliJ IDEA. Other

development environments based on IntelliJ's framework

include AppCode, CLion, PhpStorm, PyCharm, RubyMine, WebStorm, and MPS.

2.1.3 IntelliJ IDEA Platform

IntelliJ supports plugins through which one can add additional functionality to the IDE.

One can download and install plugins either from IntelliJ's plugin repository website or

through IDE's inbuilt plugin search and install feature. Currently IntelliJ IDEA

Community edition has 1495 plugins available, where as the Ultimate edition has 1626

plugins available.

The Community and Ultimate editions differ in their support for various programming

languages like:

Java

Clojure

Dart

Erlang

Go

Groovy

Page 22: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 11

Haxe

Perl

Scala

XML/XSL

Kotlin

ActionScript/MXML

CoffeeScript

Haskell

HTML/XHTML/CSS

JavaScript

Lua

PHP

Python

Ruby/JRuby

SQL

TypeScript

Community Edition supports the following technologies and frameworks:

Android

Ant

Gradle

JavaFX

JUnit

Maven

SBT

TestNG

Ultimate Edition supports the following technologies and frameworks:

Django

Page 23: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 12

EJB

FreeMarker

Google App Engine

Google Web Toolkit

Grails

Hibernate/JPA

Java ME MIDP/CLDC

JBoss Seam

JSF

JSP

Jelastic

Node.js

OSGi

Play

Ruby on Rails

Spring

Struts 2

Struts

Tapestry

Velocity

Web services

2.1.4 IntelliJ IDE

IntelliJ IDE provides certain features like code completion by analyzing the context, code

navigation where one can jump to a class or declaration in the code directly, code

refactoring and providing options to fix inconsistencies via suggestions.

The IDE provides for integration with build/packaging tools like grunt, bower, gradle,

and SBT. It supports version control systems like GIT, Mercurial, Perforce, and SVN.

Databases like Microsoft SQL Server, ORACLE, PostgreSQL, and MySQL can be

accessed directly from the IDE.

Page 24: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 13

Chapter 3

SYSTEM ANALYSIS

System analysis is the phase or step of the systems approach to problem solving using

computers. It is a process of gathering and interpreting facts, diagnosis of problems and

using the information to recommend improvements to the existing system.

The proposed system for the project entitled “Big Data Analytics Framework to

identify Agriculture/Aquaculture Diseases and recommendation of a solution.”

includes:

To make a web application, that handles the big data problems of agriculture and

aquaculture diseases.

To build a framework that is user friendly.

The data of agriculture and aquaculture are maintained in the form of hive

database. Hive Query Language is used to query with HIVE.

HADOOP technology is used to handle Big Data.

The farmer/researcher inputs the hostname and the symptom of the infected crop

of fisheries.

The receiver receives back the output on the same web page with the details of the

infected disease, location where it can effect and the various steps to prevent it

from further damages.

3.1 Functional Requirements

The definition for a functional requirement specifies what the system should do. A

requirement specifies a function that a system or component must be able to perform.

Functional requirements specify specific behavior or functions. The functional

requirements are those that refer to the functionality of the system. The functional

requirements of the project are given below:

Real time monitor should display the details of all the data that flowing in and out

of the system.

The application takes the data from the client and interacts with the server and

displays the output on the screen.

The web application is deployed on cloud which is accessible through www.

Page 25: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 14

3.2 Non-Functional Requirements

The definition for a non-functional requirement specifies how the system should behave:

A non-functional requirement is a statement of how a system must behave; it is a

constraint upon the systems behavior. Non-functional requirements specify all the

remaining requirements not covered by the functional requirements. They specify criteria

that judge the operation of a system, rather than specific behaviors. Non-Functional

Requirements in Software Engineering presents a systematic and pragmatic approach to

`building quality into' software systems. Systems must exhibit software quality attributes,

such as accuracy, performance, security and modifiability.

Non-functional requirement are:

The application must be easily operated without needed much knowledge

of the algorithm as well as coding.

It should provide an easy interface to add some more features for other

applications.

3.3 System Requirements

3.3.1 Hardware Requirement Specification

SYSTEMS: 3 systems for multi-clustering in HADOOP.

PROCESSOR: Intel® Core™ i3-2330M CPU @2.20 GHz.

HARDDISK: 40 GB or more.

RAM: 256 Mb or more.

3.3.2 Software Requirement Specification

OPERATING SYSTEM: Windows XP or more.

LANGUAGE USED: JAVA (JDK 1.8 or more).

TOOL USED: Apache HADOOP 2.8Apache TOMCAT 8IDE

IDE USED: IntelliJ IDEA (2015 or more)

Page 26: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 15

Chapter 4

SYSTEM DESIGN

Primary motive of generation of results from the collection of data is to serve researchers by

giving a solution for various diseases of crops. It was not an easy task to develop a new

framework identify disease and recommend solution based on symptoms similarity. These

frameworks provide the solution based on historical data. Data for this framework is collected

from various sources.

This model basically works on recommendation system. The recommendation systems

use the historical data or the knowledge of the product. Many e -commerce companies use

recommendation system for sales (e.g. Amazon. in). In the proposed model recommendation

system is applied to agriculture domain.

Firstly data is collected from various sources e.g. lab reports, agriculture websites etc.

collected data is known as raw data because it contain irregularities and unwanted information. So

data is unformatted and it needs formatting or confirmation. This data is stored on HDFS.

NameNode of HDFS keeps track how your files are broken down into file blocks, which nodes

store those blocks. Clients communicate directly with DataNode to process the local files

corresponding to the blocks.

Data sources are:

Laboratory Test reports:

It is a crucial source of data for researchers .the tests conducted are soil, water, manure, plant

analysis etc.

Agriculture/Aquaculture info websites:

These websites act like mentor for farmers. These sites give information related to agricultural

economic entity; commonly used pesticides etc. agriculture information websites provide

information to farmers about which crop to plant where and when. And suggest solutions to

various problems related to crops. by these sites farmers get knowledge about new techniques and

tools.

Page 27: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 16

Agriculture/Aquaculture department reports:

Using these reports decision making is easy for crops of particular area. These reports are

important to provide information regarding particular field of a geographical area.

Data that is collected from above sources is stored on Hadoop distributed file system in

the form of text file. Collected data is unstructured and it contains irrelevant data. Data that is

collected from above sources is stored on Hadoop distributed file system in the form of text file.

Collected data is unstructured and it contains irrelevant data.

Firstly unimportant data is removed and relevant data is extracted from collected data.

Then features are selected and extracted from relevant data and save into text file on hive data

warehouse. Hive is used to querying the data in distributed environment. Hive is open source

software tool used for data ware housing. To extract data out from Hadoop system Hive provides

interface that is similar to SQL interface which is termed as HIVEQL HIVE query language.

Query is submitted in distributed environment by three ways:

By using command line interface

Application programming interface

Web user interface

Thrift server is used as an interface when client and server use different language.

HiveQL extract data from hive data warehouse and save query results into text file that will store

on HDFS. Now submit text file to distributed environment to identify crop disease name based on

crop disease symptoms similarity. In this process after splitting text file submitted to mapper to

calculate pair based symptoms similarity, pair based similarity ignore spelling mistakes and word

ordering this will increase efficiency of recommendation system.

Page 28: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 17

4.1 System Architecture

Fig.3 System Architecture

As explained in the proposed system, the data is collected from various data sources and

this data is called as raw data. Raw data is then cleansed by the cleaning process which

removes the unwanted entries from the data. The required data is written into a .csv file.

This file is then stored in the HDFS ( Hadoop Distributed File System) of the Apache

HADOOP.

Page 29: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 18

Fig.4 Hadoop Architecture

Apache Hadoop is an open-source software framework for storage and large-scale

processing of data-sets on clusters of commodity hardware. There are mainly five

building blocks inside this runtime environment.

The cluster is the set of host machines (nodes). Nodes may be partitioned

in racks. This is the hardware part of the infrastructure.

The YARN Infrastructure (Yet Another Resource Negotiator) is the framework

responsible for providing the computational resources (e.g., CPUs, memory, etc.)

needed for application executions. Two important elements are:

The Resource Manager (one per cluster) is the master. It knows where

the slaves are located (Rack Awareness) and how many resources they

have. It runs several services; the most important is the Resource

Scheduler which decides how to assign the resources.

The Node Manager (many per cluster) is the slave of the infrastructure.

When it starts, it announces himself to the Resource Manager.

Periodically, it sends a heartbeat to the Resource Manager. Each Node

Manager offers some resources to the cluster. Its resource capacity is

the amount of memory and the number of scores. At run-time, the

Page 30: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 19

Resource Scheduler will decide how to use this capacity: a Container is

a fraction of the NM capacity and it is used by the client for running a

program.

The HDFS Federation is the framework responsible for providing permanent,

reliable and distributed storage. This is typically used for storing inputs and output

(but not intermediate ones).

Other alternative storage solutions. For instance, Amazon uses the Simple Storage

Service (S3).

The MapReduce Framework is the software layer implementing the MapReduce

paradigm.

Hadoop File System was developed using distributed file system design. It is run

on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant

and designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such

huge data, the files are stored across multiple machines. These files are stored in

redundant fashion to rescue the system from possible data losses in case of failure.

HDFS also makes applications available to parallel processing.

Features of HDFS

It is suitable for the distributed storage and processing.

Hadoop provides a command interface to interact with HDFS.

The built-in servers of namenode and datanode help users to easily check the

status of cluster.

Streaming access to file system data.

HDFS provides file permissions and authentication.

HDFS Architecture

HDFS follows the master-slave architecture and it has the following elements.

Page 31: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 20

Fig.5 HDFS Architecture

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating

system and the namenode software. It is software that can be run on commodity

hardware. The system having the namenode acts as the master server and it does the

following tasks:

Manages the file system namespace.

Regulates client’s access to files.

It also executes file system operations such as renaming, closing, and opening

files and directories.

Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and

datanode software. For every node (Commodity hardware/System) in a cluster, there will

be a datanode. These nodes manage the data storage of their system.

Datanodes perform read-write operations on the file systems, as per client request.

They also perform operations such as block creation, deletion, and replication

according to the instructions of the namenode.

Page 32: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 21

Block

Generally the user data is stored in the files of HDFS. The file in a file system will be

divided into one or more segments and/or stored in individual data nodes. These file

segments are called as blocks. In other words, the minimum amount of data that HDFS

can read or write is called a Block. The default block size is 64MB, but it can be

increased as per the need to change in HDFS configuration.

After receiving the data from the UI the data is sent to hive to check with the data

that is already stored in hive database. This connection is done through hive driver

classes and thrift drivers. It is written in Java with the help of JDBC/ODBC drivers.

Further the data from the hive are managed by the various shuffling, mapping and

reducing algorithms inside hadoop. The final result is sent back from the hadoop to the

UI.

4.2 Use-Case Diagram

Fig.6 Use-Case Diagram

In the use-case diagram of this project there is one actor and Backend which is the server.

The actor reacts to the backen through the framework which is built that is the web

Page 33: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 22

application. The web application’s name is ASK-Agri. The symptom and the host is

entered on the web application. The data entered is filtered and sent to the backend of the

system. At the backend is the Hadoop system and Hive tools. The data of the crop

diseases already collected is the hive table. The data to be searched from the hive table is

queried by the HiveQL. Result is sent back to the backend and that inturn returns the data

back to the web appliaction and is shown to the user through the UI of the web

application.

4.3 Dataflow Diagram

Fig.7 Data Flow diagram

The user enters the data by logging in to the web application. The host name and the

symptom is entered. This data is read and split into keyword by the splitter algorithms in

hadoop. The keyord is then passed to the matching algorithm which matches with the

data. The matched keyword is searched in the database which is already present. The

corresponding row of the database is selected and returned back to the UI which is shown

as output.

Page 34: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 23

Host and symptom-DFD

Fig.8 Host and Symptom DFD

The data from the user is of two forms: the hostname and the symptom. This is given or

typed on the UI of the system. This data is sent through a java servlet to the hadoop

system.

Splitting Data-DFD

Fig.9 Splitting Data DFD

Page 35: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 24

The input given is split into chunks of data in the hadoop systems. Each chunk of data is

stored in datanodes. The details of which datanode holds what type of data are stored on

the namenode.

Searching keyword-DFD

Fig.10 Searching keyword-DFD

The keyword after it is matched is sent to the Hive database where the hive tables are

present through the hive driver class written in Java.

Page 36: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 25

4.4 Sequence Diagrams

Fig.11 Main sequence diagram

The main seuence diagram of the project consistes of four entities like User, ASK-Agri,

Hadoop and Database.

The user entity starts its activity by logging into the ASK-Agri framework. After

the login is sucessful the hostname and symptom is entered.

The ASK-Agri then generates the keyword from the entered data and that keyord

is sent to the Hadoop system. Hadoop system checks for the details of the disease from

the already collected data stored in the database. If the search is successful then the

retrieved data is sent back to the Hadoop and then to the framwork. The framework

creates an activity of displaying the result back to the user.

During the login process the sequence flow is between the user, ASK-Agri

framework and the login database.

The user starts an activity by reuesting the login page from the web application. If

the reuest is granted the page is sent where the user enters his credentials like username

and password. The credential details of the registered user is serach in the login database.

Page 37: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 26

If the user is found then login is successful. If in case the user data is not available login

error occurs.

Fig.12 Login failure sequence diagram

Page 38: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 27

Chapter 5

IMPLEMENTATION

The objective of implementation step is to create the code, test it for required output and

debug the errors occurring during the execution of the program. System implementation

involves testing the tool created on the setup and finding that the data is generated in the

central manager database.

Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store and process

data. The following illustration depicts a schematic view of a traditional enterprise

system. Traditional model is certainly not suitable to process huge volumes of scalable

data and cannot be accommodated by standard database servers. Moreover, the

centralized system creates too much of a bottleneck while processing multiple files

simultaneously.

Fig.13 MapReduce

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce

divides a task into small parts and assigns them to many computers. Later, the results are

collected at one place and integrated to form the result dataset.

Page 39: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 28

Fig.14 MapReduce Architecture

How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

The Map task takes a set of data and converts it into another set of data, where

individual elements are broken down into tuples (key-value pairs).

The Reduce task takes the output from the Map as an input and combines those

data tuples (key-value pairs) into a smaller set of tuples.

The reduce task is always performed after the map job.

5.1 MapReduce Algorithm

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

The map task is done by means of Mapper Class

The reduce task is done by means of Reducer Class.

Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class

is used as input by Reducer class, which in turn searches matching pairs and reduces

them.

Page 40: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 29

Fig.15 Working of MapReduce Classes

MapReduce implements various mathematical algorithms to divide a task into

small parts and assign them to multiple systems. In technical terms, MapReduce

algorithm helps in sending the Map & Reduce tasks to appropriate servers in a cluster.

These mathematical algorithms may include the following −

Sorting

Searching

Indexing

TF-IDF

Sorting Algorithm

Sorting is one of the basic MapReduce algorithms to process and analyze data.

MapReduce implements sorting algorithm to automatically sort the output key-value

pairs from the mapper by their keys.

Sorting methods are implemented in the mapper class itself.

In the Shuffle and Sort phase, after tokenizing the values in the mapper class,

the Context class (user-defined class) collects the matching valued keys as a

collection.

To collect similar key-value pairs (intermediate keys), the Mapper class takes the

help of RawComparator class to sort the key-value pairs.

The set of intermediate key-value pairs for a given Reducer is automatically

sorted by Hadoop to form key-values (K2, {V2, V2, …}) before they are

presented to the Reducer.

Page 41: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 30

Searching Algorithm

Searching plays an important role in MapReduce algorithm. It helps in the combiner

phase (optional) and in the Reducer phase.

Generally MapReduce paradigm is based on sending map-reduce programs to

computers where the actual data resides.

During a MapReduce job, Hadoop sends Map and Reduce tasks to appropriate

servers in the cluster.

The framework manages all the details of data-passing like issuing tasks,

verifying task completion, and copying data around the cluster between the

nodes.

Most of the computing takes place on the nodes with data on local disks that

reduces the network traffic.

After completing a given task, the cluster collects and reduces the data to form an

appropriate result, and sends it back to the Hadoop server.

Fig.16 Working of MapReduce

Inputs and Outputs (Java Perspective)

The MapReduce framework operates on key-value pairs, that is, the framework views

the input to the job as a set of key-value pairs and produces a set of key-value pair as the

output of the job, conceivably of different types.

Page 42: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 31

The key and value classes have to be serializable by the framework and hence, it

is required to implement the Writable interface. Additionally, the key classes have to

implement the WritableComparable interface to facilitate sorting by the framework.

Both the input and output format of a MapReduce job are in the form of key-value pairs

(Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3> (Output).

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Table 1.Input, output for key-value pairs

5.2 Partitioner

A partitioner works like a condition in processing an input dataset. The partition phase

takes place after the Map phase and before the Reduce phase.

The number of partitioners is equal to the number of reducers. That means a

partitioner will divide the data according to the number of reducers. Therefore, the data

passed from a single partitioner is processed by a single Reducer.

A partitioner partitions the key-value pairs of intermediate Map-outputs. It

partitions the data using a user-defined condition, which works like a hash function. The

total number of partitions is same as the number of Reducer tasks for the job. Let us take

an example to understand how the partitioner works.

Map Tasks

The map task accepts the key-value pairs as input while we have the text data in a text

file.

Page 43: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 32

Partitioner Task

The partitioner task accepts the key-value pairs from the map task as its input. Partition

implies dividing the data into segments.

Reduce Tasks

The number of partitioner tasks is equal to the number of reducer tasks.

5.3 Combiner

A Combiner, also known as a semi-reducer, is an optional class that operates by

accepting the inputs from the Map class and thereafter passing the output key-value pairs

to the Reducer class.

The main function of a Combiner is to summarize the map output records with

the same key. The output (key-value collection) of the combiner will be sent over the

network to the actual Reducer task as input.

The Combiner class is used in between the Map class and the Reduce class to

reduce the volume of data transfer between Map and Reduce. Usually, the output of the

map task is large and the data transferred to the reduce task is high.

The following MapReduce task diagram shows the COMBINER PHASE.

Fig.17 Combiner

Page 44: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 33

How Combiner Works?

Here is a brief summary on how MapReduce Combiner works −

Step 1: A combiner does not have a predefined interface and it must implement the

Reducer interface’s reduce() method.

Step 2: A combiner operates on each map output key. It must have the same output key-

value types as the Reducer class.

Step 3: A combiner can produce summary information from a large dataset because it

replaces the original Map output.

Although, Combiner is optional yet it helps segregating data into multiple groups for

Reduce phase, which makes it easier to process.

The important phases of the MapReduce program with Combiner are discussed below.

Record Reader

This is the first phase of MapReduce where the Record Reader reads every line from the

input text file as text and yields output as key-value pairs.

Input − Line by line text from the input file.

Output − Forms the key-value pairs.

Map Phase

The Map phase takes input from the Record Reader, processes it, and produces the

output as another set of key-value pairs.

Input − The following key-value pair is the input taken from the Record Reader.

The Map phase reads each key-value pair, divides each word from the value using

StringTokenizer, and treats each word as key and the count of that word as value. The

following code snippet shows the Mapper class and the map function.

Combiner Phase

The Combiner phase takes each key-value pair from the Map phase, processes it, and

produces the output as key-value collection pairs.

Page 45: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 34

Input − The following key-value pair is the input taken from the Map phase.

The Combiner phase reads each key-value pair, combines the common words as key and

values as collection. Usually, the code and operation for a Combiner is similar to that of a

Reducer.

Reducer Phase

The Reducer phase takes each key-value collection pair from the Combiner phase,

processes it, and passes the output as key-value pairs. Note that the Combiner

functionality is same as the Reducer.

Input − The following key-value pair is the input taken from the Combiner phase.

The Reducer phase reads each key-value pair.

Record Writer

This is the last phase of MapReduce where the Record Writer writes every key-value

pair from the Reducer phase and sends the output as text.

Input − Each key-value pair from the Reducer phase along with the Output format.

Output − It gives you the key-value pairs in text format.

Page 46: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 35

Fig.18 Example of MapReduce

Page 47: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 36

Chapter 6

TESTING

6.1 Introduction to Testing

Software testing is an investigation conducted to provide stakeholders with information

about the quality of the product or service under test. Software testing can also provide an

objective, independent view of the software to allow the business to appreciate and

understand the risks of software implementation. Test techniques include, but are not

limited to, the process of executing a program or application with the intent of

finding software bugs (errors or other defects).

Software testing can be stated as the process of validating and verifying of a

software program/application/product which meets the requirements that guided its

design and development also works as expected and can be implemented with the same

characteristics.

Software testing, depending on the testing method employed, can be implemented

at any time in the development process. However, most of the test effort traditionally

occurs after the requirements have been defined and the coding process has been

completed having been shown that fixing a bug is less expensive when found earlier in

the development process. Although in the Agile approaches most of the test effort is,

conversely, on-going. As such, the methodology of the test is governed by the software

development methodology adopted.

Different software development models will focus the test effort at different points

in the development process. Newer development models, such as Agile, often employ test

driven development and place an increased portion of the testing in the hands of the

developer, before it reaches a formal team of testers. In a more traditional model, most of

the test execution occurs after the requirements have been defined and the coding process

has been completed.

Based on various parameters there are different methods of testing. A few

commonly used ones are as follows:

Page 48: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 37

6.1.1 Functional and Non-functional testing

Functional testing refers to activities that verify a specific action or function of the code.

These are usually found in the code requirements documentation, although some

development methodologies work from use cases or user stories. Functional tests tend to

answer the question of "can the user do this" or "does this particular feature work."

Non-functional testing refers to aspects of the software that may not be related to a

specific function or user action, such as scalability or other performance, behaviour under

certain constraints, or security. Testing will determine the flake point, the point at which

extremes of scalability or performance leads to unstable execution. Non-functional

requirements tend to be those that reflect the quality of the product, particularly in the

context of the suitability perspective of its users.

6.1.2 Compatibility testing

A common cause of software failure (real or perceived) is a lack of its compatibility with

other application software, operating systems (or operating system versions, old or new),

or target environments that differ greatly from the original (such as

a terminal or GUI application intended to be run on the desktop now being required to

become a web application, which must render in a web browser). For example, in the case

of a lack of backward compatibility, this can occur because the programmers develop and

test software only on the latest version of the target environment, which not all users may

be running. This results in an unintended consequence that the latest work may not

function on earlier versions of the target environment, or on older hardware that earlier

versions of the target environment was capable of using. Sometimes such issues can be

fixed by proactively abstracting operating system functionality into a separate

program module or library.

6.1.3 Verification and Validation

Validation testing is a concern which overlaps with integration testing. Ensuring that the

application fulfils its specification is a major criterion for the construction of an

integration test. Validation testing also overlaps to a large extent with System Testing,

where the application is tested with respect to its typical working environment.

Consequently for many processes no clear division between validation and system testing

Page 49: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 38

can be made. Specific tests which can be performed in either or both stages include the

following.

Regression Testing: Where this version of the software is tested with the

automated test harness used with previous versions to ensure that the required

features of the previous version are skill working in the new version.

Recovery Testing: Where the software is deliberately interrupted in a number of

ways off, to ensure that the appropriate techniques for restoring any lost data will

function.

Security Testing: Where unauthorized attempts to operate the software, or parts

of it, attempted it might also include attempts to obtain access the data, or harm

the software installation or even the system software. As with all types of security

determined will be able to obtain unauthorized access and the best that can be

achieved is to make this process as difficult as possible.

Stress Testing: Where abnormal demands are made upon the software by

increasing the rate at which it is asked to accept, or the rate t which it is asked to

produce information. More complex tests may attempt to crate very large data sets

or cause the soft wares to make excessive demands on the operating system.

Performance testing: Where the performance requirements, if any, are checked.

These may include the size of the software when installed, type amount of main

memory and/or secondary storage it requires and the demands made of the

operating when running with normal limits or the response time.

Usability Testing: The process of usability measurement was introduced in the

previous chapter. Even if usability prototypes have been tested whilst the

application was constructed, a validation test of the finished product will always

be required.

Alpha and beta testing: This is where the software is released to the actual end

users. An initial release, the alpha release, might be made to selected users who be

expected to report bugs and other detailed observations back to the production

team. Once the application changes necessitated by the alpha phase can be made

to larger more representative set users, before the final release is made to all users.

Page 50: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 39

The final process should be a Software audit where the complete software project

is checked to ensure that it meets production management requirements. This ensures that

all required documentation has been produced, is in the correct format and is of

acceptable quality. The purpose of this review is: firstly to assure the quality of the

production process and by implication construction phase commences. A formal hand

over from the development team at the end of the audit will mark the transition between

the two phases.

Top down testing can proceed in a depth-first or a breadth-first manner. For depth-

first integration each module is tested in increasing detail, replacing more and more levels

of detail with actual code rather than stubs. Alternatively breadth-first would processed

by refining all the modules at the same level of control throughout the application .in

practice a combination of the two techniques would be used. At the initial stages all the

modules might be only partly functional, possibly being implemented only to deal with

non-erroneous data. These would be tested in breadth-first manner, but over a period of

time each would be replaced with successive refinements which were closer to the full

functionality. This allows depth-first testing of a module to be performed simultaneously

with breadth-first testing of all the modules.

The other major category of integration testing is Bottom Up Integration Testing

where an individual module is tested form a test harness. Once a set of individual module

have been tested they are then combined into a collection of modules ,known as builds,

which are then tested by a second test harness. This process can continue until the build

consists of the entire application. In practice a combination of top down and bottom-up

testing would be used. In a large software project being developed by a number of sub-

teams, or a smaller project where different modules were built by individuals. The sub

teams or individuals would conduct bottom-up testing of the modules which they were

constructing before releasing them to an integration team which would assemble them

together for top-down testing.

Validation ensures that the product actually meets the user's needs, and that the

specifications were correct in the first place, while verification is ensuring that the

product has been built according to the requirements and design specifications. Validation

ensures that ‘you built the right thing’. Verification ensures that ‘you built it right’.

Validation confirms that the product, as provided, will fulfil its intended use.

Page 51: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 40

6.2 Testing Methodologies

Software testing methods are traditionally divided into white and black-box testing. These

two approaches are used to describe the point of view that a test engineer takes when

designing test cases.

White-box testing is when the tester has access to the internal data structures and

algorithms including the code that implements these.

Black-box testing treats the software as a "black box"—without any knowledge of

internal implementation.

Grey-box testing involves having knowledge of internal data structures and algorithms

for purposes of designing tests, while executing those tests at the user, or black-box level.

6.3 Testing Levels

Tests are frequently grouped by where they are added in the software development

process, or by the level of specificity of the test. The main levels during the development

process are unit, integration, and systems testing that are distinguished by the test target

without implying a specific process model

6.3.1 Unit testing

Unit testing, also known as component testing refers to tests that verify the functionality

of a specific section of code, usually at the function level. In an object-oriented

environment, this is usually at the class level, and the minimal unit tests include the

constructors and destructors.

These types of tests are usually written by developers as they work on code

(white-box style), to ensure that the specific function is working as expected. One

function might have multiple tests, to catch corner cases or other branches in the code.

Unit testing alone cannot verify the functionality of a piece of software, but rather is used

to assure that the building blocks the software uses work independently of each other.

Page 52: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 41

6.3.2 Integration testing

Integration testing is any type of software testing that seeks to verify the interfaces

between components against a software design. Software components may be integrated

in an iterative way or all together ("big bang"). Normally the former is considered a better

practice since it allows interface issues to be localised more quickly and fixed. Integration

testing works to expose defects in the interfaces and interaction between integrated

components (modules). Progressively larger groups of tested software components

corresponding to elements of the architectural design are integrated and tested until the

software works as a system.

6.3.3 System testing

System testing tests a completely integrated system to verify that it meets its

requirements.

6.4 Unit Testing of Main Modules

Here different modules are tested independently and their functionality is checked. The following

tables show the details about the unit test cases and the results obtained.

6.4.1 Unit testing for user

Test Case ID Unit Test Case 1

Description Unit testing when Hostname is entered correctly

and symptom is not entered.

Input Hostname which is already there in the hive

database

Expected Output Error enter empty fields

Actual output Got the expected output

Remarks Test passed

Table 2. Unit Test Case 1

Page 53: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 42

Test Case ID Unit Test Case 2

Description Unit testing when NO Hostname and Symptom is

entered

Input NO input given

Expected Output Error enter empty fields

Actual output Got the expected output

Remarks Test passed

Table 3. Unit Test Case 2

Test Case ID Unit Test Case 3

Description Unit testing when both Hostname and Symptom is

entered but not available in database

Input Hostname and Symptom of infected crop/fish

Expected output No output. Display same page

Actual output Got the expected output

Remarks Test passed

Table 4.Unit Test Case 3

Test Case ID Unit Test Case 4

Description Unit testing when proper valid Hostname and

Symptom is given

Input Hostname and Symptom of infected crop/fish

Expected output Disease and details of the virus infected to

crop/fish

Actual output Got the expected output

Remarks Test passed

Table 5.Unit Test Case 4

Page 54: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 43

Chapter 7

RESULTS

7.1 Snapshot of browsing HDFS:

Fig.19 Overview of HDFS setup

An overview of HDFS about when the session was started, version of it and some ID

information also.

Page 55: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 44

Fig.20 Summary of HDFS setup

Summary of the setup which shows the status of various nodes.

Page 56: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 45

Fig.21 NameNode status of HDFS

NameNode Status information.

Page 57: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 46

Fig.22 Datanode information of HDFS

DataNode informations like how many datanades are created and how many are under

process are shon in this snapshot.

Page 58: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 47

7.2 Snapshots of the Hadoop and hive interfaces:

Fig.23 Hadoop and Hive initializations

Hadoop and Hive tools are initialised.

Fig.24 Processes running on Hadoop user

Hadoop processes which are running at the current time are shown.

Page 59: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 48

Fig.25 Hive tables

The hive tables which are present inside are shown using show tables; command.

Fig.26 Hosttable rows

Number of rows present in Hosttable are shown in this screenshot.

Page 60: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 49

Fig.27 Query Execution on Hive table

The query where details of one disease from hostname and symptom is shown in this

snapshot.

Page 61: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 50

7.3 Snapshots of the project deployed on IntelliJ IDEA:

Fig.28 Project on IntelliJ

The project named AgriHelper is created which is a web application. It is divided into two

modules Core and Web.

Fig.29 Core module of project

The snapshot shos the various java programs written under core module.

Page 62: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 51

Fig.30 Web module of project

The web module contains the html, css, java script files related to the front end

development of web application.

Page 63: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 52

7.4 Snapshots of the web application/framework:

Fig.31 Front web page of the application

The basic home page of the web application is shown in the snapshot.

Fig.32 Searching for required data

In this snapshot the data is entered and search button is clicked while the searching and

waiting for results notification is shown.

Page 64: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 53

Fig.33 Result

The result for the entered data is shown.

Page 65: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

Big Data Analytics Framework to identify Agriculture/Aquaculture Diseases and recommendation of a solution. 2016-17

Dept. of CSE Page 54

Chapter 8

Conclusion and Future Work

Using Hadoop and HIVE tools big data analytical framework has been developed that

will handle agriculture crop and aquaculture fish disease problems. Developed web

application is useful for farmers and researchers for recommending a solution based on

high similarity symptoms. The developed big data analytics framework is location

specific.

The recommended solution is collected from various government institutions like

GKVK and IIHR. This application is useful for various researches to work with the crop

virus disease and fish general diseases.

Further this project can be enhanced by developing a product by converting this

web application into an android app.

Page 66: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

REFERENCES

[1] Laney D. 3D Data Management: Controlling Data Volume, Velocity And Variety. Meta

Group Inc Application Delivery Strategies. 2001 February; ADS(6), 1-4.

[2] Xue-Wen Chen, Xiaotong Lin. Big Data Deep Learning Challenges and Perspective. IEEE

Access. 2014 May; 2(1), 514-522. Marx, V. 2013. Biology: The Big Challenges Of Big Data. p

255-260. Nature

.498.

[3] Marx V. Biology: The Big Challenges Of Big Data. Nature 2013. 2013 January; 498(7453),

255-260.

[4] David B. Lobell. The use of satellite data for crop yield gap analysis.vol no. 143 (2013) .

[5] J.Ben,Schafter,Joseph A,Konstan.E-Commerce Recommendation Applications:Data Mining

and Knowledge Discovery. Kluwer Academic Publishers. Manufactured in The

Netherlands.2001;115-153.

[6] Haoran Zhang, Xuyang Wei, Tengfei Zou, Zhongliang Li, Guocai Yang. Agriculture Big

Data: Research Status, Challenges And Countermeasures. Proceedings of Computer and

Computing Technologies in Agriculture, China, 2014 September, 137-143.

[7] Mysmartfarm. 2014. (Available online with update at http://Mysmart.Farm/).

[8] AwhereWeather. http://www.Awhere.Com/Products/Weather-Awhere. Date accessed:

18/05/2016.

[9] Phenonet:http://www.csiro.au/en/Research/D61/Areas/Robotics-and-autonomous-

systems/Internet-of-Things/Phenonet. Date accessed: 18/05/2016.

Farmlog.:https://www.Farmlogs.Com/Farm-Management-Features/. Date accessed: 18/05/2016.

[10] Datafloq. https://datafloq.com/read/john-deere-revolutionizing-farming-big-data/511. Date

accessed: 18/05/2016.

[11] Farmeron.:https://www.Farmeron.Com/Dairyfeatures.Aspx. Date accessed: 18/05/2016

[12] C.L. Philip Chen, Chun-Yang Zhang.Data-intensive applications, challenges, techniquesand

technologies: A survey on Big Data.Vol No. 275 (2014) ,314–347.

Page 67: Visvesvaraya Technological University€¦ · CERTIFICATE This is to certify that the project entitled ... storage and processing of dataset of big data using the MapReduce programming

[13] Javier Andreu-Perez, Carmen C. Y. Poon, Robert D. Merrifield, Stephen T. C. Wong,

Guang-Zhong Yang.Big Data for Health.JULY,2015; 19(4).

[14] Er. Rupinder Kaur, Raghu Garg, Dr. Himanshu Aggarwal- Big Data Analytics Framework to

Identify Crop Disease and Recommendation a Solution.