Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di...

26
IEEE Proof Web Version IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012 1 Cloud Computing Enabled Web Processing Service for Earth Observation Data Processing Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di, Senior Member, IEEE Abstract—The OpenGIS Web Processing Service (WPS) can process both simple and complex geospatial tasks including Earth Observation tasks. As the requirements of Earth Observation data, algorithms, calculation models, and daily life become in- creasingly complicated; WPS needs to provide high-performance service-oriented computing capability. This paper proposes a cloud computing enabled WPS framework for Earth Observation data processing. It consists of a client layer and a WPS layer, which further consists of a WPS server layer and a cloud com- puting layer. The cloud computing environment is based on the open-source software Apache Hadoop. The three layers of the proposed cloud computing enabled WPS are outlined, followed by a workow that processes a user’s task using these three layers. Then technological implementation details are explained. An experiment processing Moderate Resolution Imaging Spectro- radiometer (MODIS) data shows that WPS can be enabled in a cloud computing environment. Index Terms—Apache Hadoop, cloud computing, earth observa- tion data processing, service oriented architecture, web processing service. I. INTRODUCTION E ARTH Observation (EO) is most often referring to satel- lite imagery or satellite remote sensing [1], which applied upon atmosphere, land, and ocean. The obtained data of Earth Observation named Earth Observation Data (EOD) is widely used in some elds of society lives and scientic research such as climate, weather, agriculture, ecosystems, biodiversity, water, and disasters migration, forecasting or reduction. For this reason, in recent decades, EOD and EOD Processing have been fairly studied. EOD and many of EOD Processings have these properties: 1) Large volumes of EOD retrieved every day and they are in various data types. There are hundreds of satellites ob- serving the earth every day directed by NASA (National Manuscript received May 29, 2011; revised September 23, 2011; accepted May 17, 2012. This work was supported in part by the National Basic Re- search program of China under Grant 2011CB707101, by the National Na- ture Science Foundation of China program under Grant (41023001, 41171315 and 41021061), by the Program for New Century Excellent Talents in Univer- sity under Grant NCET-11-0394, and by the Chinese 863 program under Grant 2012AA121401. Z. Chen and C. Yang are with the State Key Lab for Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan 430079, China. N. Chen is with the State Key Lab for Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan 430079, China (corresponding author, e-mail: [email protected]). L. Di is with the Center for Spatial Information Science and Systems (CSISS), George Mason University, Fairfax, VA 22032 USA. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/JSTARS.2012.2205372 Aeronautics and Space Administration) and NOAA (Na- tional Oceanic and Atmospheric Administration) in USA, except the satellites of Europe, China, Japan and others. The EOD is obtained over petabyte daily. Meanwhile, there are plenty of data types for EOD like Landsat, SPOT, IKONOS, QuickBird, AVHRR, EOS, ALOS, and also lots of data formats, like ASCII, HDF, PICT, TIFF/GeoTIFF. 2) There are many EOD Processing models and methods. Because of abound of EOD data types, data formats, and data applications, event if they are the same, many EOD Processing models and methods are studied and used. For example, there are several vegetation condition indices for crop condition monitoring just based on Normalized Difference Vegetation Index (NDVI), as Mean Vegetation Condition Index (MVCI), NDVI Ratio to Previous Year (RPNDVI), NDVI Ratio to Median (RMNDVI), Vegeta- tion Condition Index (VCI) [2]. 3) The requirements of EOD and its processing are increasing and diverse. EOD is made use of in many society elds, and will be used in more and more aspects in society lives. For the same EOD, different users may process it for different purposes and with different methods. 4) Stand-alone and Web-based EOD Processing servers both exist in application. A Stand-alone server can only sup- port an EOD Processing within a local environment. Vice versa, a Web-based server can provide a collaborating en- vironment between multiple Web-connected servers. The capacity of EOD Processing for a stand-alone server is limited by its physical storage capacity and Central Pro- cessing Unit (CPU) calculating speed – these two notorious bottlenecks. Web-based EOD Processing can cluster many servers on Web for distributed and high-performance pro- cessing. Though having the above properties, with the development of requirements, the demands of EOD Processing have become increasingly diversied. Thus EOD Processing maybe confront the following difculties: 1) Difculties emerged for some EOD Processings to be invoked, operated, accessed and managed via Web en- vironment. For historic reasons, before the evolution of web storage technology, some EOD centers only save and manage EOD in a local environment, and right now users nd them hard to handle among different servers. Mean- while, with these large-volume data distributed among several locations, users have to encounter a technological problem – how to optimize distributed computing and high-performance computing in the same time. 2) Some EOD Processings fail to be easily shared and inter- preted. This is because many EOD Processings are just 1939-1404/$31.00 © 2012 IEEE

Transcript of Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di...

Page 1: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012 1

Cloud Computing Enabled Web Processing Servicefor Earth Observation Data ProcessingZeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di, Senior Member, IEEE

Abstract—The OpenGIS Web Processing Service (WPS) canprocess both simple and complex geospatial tasks including EarthObservation tasks. As the requirements of Earth Observationdata, algorithms, calculation models, and daily life become in-creasingly complicated; WPS needs to provide high-performanceservice-oriented computing capability. This paper proposes acloud computing enabled WPS framework for Earth Observationdata processing. It consists of a client layer and a WPS layer,which further consists of a WPS server layer and a cloud com-puting layer. The cloud computing environment is based on theopen-source software Apache Hadoop. The three layers of theproposed cloud computing enabled WPS are outlined, followed bya workflow that processes a user’s task using these three layers.Then technological implementation details are explained. Anexperiment processing Moderate Resolution Imaging Spectro-radiometer (MODIS) data shows that WPS can be enabled in acloud computing environment.

Index Terms—Apache Hadoop, cloud computing, earth observa-tion data processing, service oriented architecture, web processingservice.

I. INTRODUCTION

E ARTH Observation (EO) is most often referring to satel-lite imagery or satellite remote sensing [1], which applied

upon atmosphere, land, and ocean. The obtained data of EarthObservation named Earth Observation Data (EOD) is widelyused in some fields of society lives and scientific researchsuch as climate, weather, agriculture, ecosystems, biodiversity,water, and disasters migration, forecasting or reduction. Forthis reason, in recent decades, EOD and EOD Processing havebeen fairly studied. EOD and many of EOD Processings havethese properties:1) Large volumes of EOD retrieved every day and they arein various data types. There are hundreds of satellites ob-serving the earth every day directed by NASA (National

Manuscript received May 29, 2011; revised September 23, 2011; acceptedMay 17, 2012. This work was supported in part by the National Basic Re-search program of China under Grant 2011CB707101, by the National Na-ture Science Foundation of China program under Grant (41023001, 41171315and 41021061), by the Program for New Century Excellent Talents in Univer-sity under Grant NCET-11-0394, and by the Chinese 863 program under Grant2012AA121401.Z. Chen and C. Yang are with the State Key Lab for Information Engineering

in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University,Wuhan 430079, China.N. Chen is with the State Key Lab for Information Engineering in Surveying,

Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan430079, China (corresponding author, e-mail: [email protected]).L. Di is with the Center for Spatial Information Science and Systems (CSISS),

George Mason University, Fairfax, VA 22032 USA.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JSTARS.2012.2205372

Aeronautics and Space Administration) and NOAA (Na-tional Oceanic and Atmospheric Administration) in USA,except the satellites of Europe, China, Japan and others.The EOD is obtained over petabyte daily. Meanwhile,there are plenty of data types for EOD like Landsat, SPOT,IKONOS, QuickBird, AVHRR, EOS, ALOS, and also lotsof data formats, like ASCII, HDF, PICT, TIFF/GeoTIFF.

2) There are many EOD Processing models and methods.Because of abound of EOD data types, data formats, anddata applications, event if they are the same, many EODProcessing models and methods are studied and used. Forexample, there are several vegetation condition indicesfor crop condition monitoring just based on NormalizedDifference Vegetation Index (NDVI), as Mean VegetationCondition Index (MVCI), NDVI Ratio to Previous Year(RPNDVI), NDVI Ratio to Median (RMNDVI), Vegeta-tion Condition Index (VCI) [2].

3) The requirements of EOD and its processing are increasingand diverse. EOD ismade use of inmany society fields, andwill be used in more and more aspects in society lives. Forthe same EOD, different users may process it for differentpurposes and with different methods.

4) Stand-alone and Web-based EOD Processing servers bothexist in application. A Stand-alone server can only sup-port an EOD Processing within a local environment. Viceversa, a Web-based server can provide a collaborating en-vironment between multiple Web-connected servers. Thecapacity of EOD Processing for a stand-alone server islimited by its physical storage capacity and Central Pro-cessing Unit (CPU) calculating speed – these two notoriousbottlenecks. Web-based EOD Processing can cluster manyservers on Web for distributed and high-performance pro-cessing.

Though having the above properties, with the developmentof requirements, the demands of EOD Processing have becomeincreasingly diversified. Thus EOD Processing maybe confrontthe following difficulties:1) Difficulties emerged for some EOD Processings to beinvoked, operated, accessed and managed via Web en-vironment. For historic reasons, before the evolution ofweb storage technology, some EOD centers only save andmanage EOD in a local environment, and right now usersfind them hard to handle among different servers. Mean-while, with these large-volume data distributed amongseveral locations, users have to encounter a technologicalproblem – how to optimize distributed computing andhigh-performance computing in the same time.

2) Some EOD Processings fail to be easily shared and inter-preted. This is because many EOD Processings are just

1939-1404/$31.00 © 2012 IEEE

Page 2: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

2 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

complied with national or local organizations’ standards,and need more widely recognition.

3) Some EOD Processings are not flexible and extensiveenough. Lot of EOD Processing systems are only suitableto special models or methods, but not flexible and exten-sive enough to process diverse EOD data types and dataformats, EOD Processing models and methods, and therequirements of users.

4) Some processing fails in its performance. Due to the hugesize of data, the calculation and processing of EOD cancreate undesired time delays. The volumes of EOD arehuge, and the calculating performance can not meet timerequirement, especially for real-time or near real-time ap-plication, like emergency rescues and disaster relief work.

In this article, an EOD processing which can overcome all thefour types of difficulties above should be a Web-based, inter-national standard compatible, and flexible/extensive EOD pro-cessing. To obtain such EOD processing, Web service tech-nology, Open Geospatial Consortium (OGC) Web ProcessingService (WPS) specification, and cloud computing technologymay provide us a new way of thinking and realizing.Web service technology is widely used. The W3C defines

Web service as follow: “A Web service is a software system de-signed to support interoperable machine-to-machine interactionover a network. It has an interface described in a machine-pro-cessable format (specifically WSDL). Other systems interactwith the Web service in a manner prescribed by its descrip-tion using SOAP-messages, typically conveyed using HTTPwith an XML serialization in conjunction with other Web-re-lated standards.” [3] A Web service is self-describing, reusable,and highly portable. Its advantages are apparent: it is easily con-structed, rapid, low-cost, secure, and reliable. Web service tech-nology is a useful technology in many aspects of daily life andscience research.The OGC has developed many Web service interface stan-

dards for geospatial information sharing and interoperation. Ithas developed dozens of standards such as Web Map Service(WMS), Web Feature Service (WFS), Web Coverage Service(WCS), and WPS [4]. WPS defines a standardized interfacethat facilitates the publishing of geospatial processes, and thediscovery of and binding to those processes by clients. WMS,WFS, and WCS emphasize data services, while WPS focuseson data/task processing services.Though there is little consensus on the definition of Cloud

computing [5], [6], a good feature of Cloud computing is: itis an Internet-scale computing, service-oriented computingand high-performance computing. Foster et al. [7] comparedGrids and Clouds from a variety of perspectives: architecture,business models, programming models, applications, etc., andreached a conclusion that some advantages of cloud computingare: paying the cost for software by consumption, Internet-scalecomputing, and service-oriented computing. As services, threemain levels of Cloud services are generally agreed upon in anX-as-a-Service manner (X is short for Infrastructure, Platform,or Software) [5], [8]. Meanwhile, Cloud computing is evolvedout of Grid computing and relies on Grid computing as itbackbone and infrastructure [5], so it has a high performancecomputing capabilities, and is also applied in some aspects [9],

[10]. As Cloud computing has above feature and successes inmarket, lots of software frameworks for Cloud environment aredeveloped. The Apache Hadoop is one of these frameworks.The Apache Hadoop is open-source software for reliable,scalable, distributed computing, and its software library is aframework that allows for the distributed processing of largedata sets across clusters of computers using a simple program-ming model [11]. It can be deployed for a Cloud environment[12].The objective of this paper is to propose a cloud computing

enabled WPS framework for Earth Observation data pro-cessing. The main work of this paper is to couple WPS withApache Hadoop, and evaluate the feasibility of implementinga WPS framework in the cloud computing environment byusing Apache Hadoop for EOD Processing. The advance of thisframework for EOD Processing compared with the Processingsmentioned above is it covers good properties (Web-based,international standard compatible, and flexible/extensive) atthe same time to overcome the difficulties illuminated above.Use Hadoop to establish a Cloud environment for it is free,open-source, and widely and successfully applied by hundredsof thousands of institutions and companies [13].The rest of the paper is organized as follows: related work

is overviewed in Section II. Section III introduces WPS andHadoop. Sections IV and V separately presents the frameworkand workflow of cloud computing enabled WPS. In Section VI,experiments are done to show that the proposed framework andworkflow is feasible. Finally, Section VII summarizes the con-clusions and identifies future work.

II. RELATED WORK

EOD Processing is an ongoing object of study in recentdecades. As early as 1993, the European EOD Processing andinterpretation services were analyzed the sector and conditionsfor its development [14]. Later, obtaining EOD, low-level pro-cessing, and EOD center managing were concerned [15]–[17].Automating the processing of EOD was developed in Terres-trial Observation and Prediction System using a planner-basedagent to automatically generate and execute data-flow pro-grams to processing the EOD [18]. The processing should bewritten in a specified DPADL language, and the developershould know it. This may cause confusion for many users.Automatic Earth observation data service based on reusablegeo-processing workflow and real-time EOD processing hadbeen addressed [19], [20], but it less focused the performanceof this system under huge volumes of data. To meet the storageand computational requirements of EOD, some research effortswere placed on grid technologies. Aloisio et al. [21] showedhow grid technologies and high performance computing can beused dynamically and efficiently for on-demand processing. Itwas a grid-enabled, high performance digital library of remotesensing image. Aloisio et al. [22] studied grid computing envi-ronment for EOD Processing and integrated jobs as a workflow.EOforge was a generic open framework for EOD Processingsystems [23]. Granell et al. used distributed geoprocessingservices for managing EOD [24]. Gorgan et al. developed EODProcessing platform for EO application development [25],[26]. Clouding computing for satellite data processing on high

Page 3: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 3

end compute clusters based on Hadoop was evaluated [27].Some of these were not standard-based or not flexible enough.Recently, WPS facilitate EO Processing by providing standardinterfaces [28]–[30].WPS has been developed and utilized [31]–[35]. 52n WPS

[36] is a good WPS framework. It has a pluggable architec-ture for processing and data encoding. A Web process can beadded into 52n WPS using a plug-in method that the processmust implement as an abstract process interface. Now, 52nWPSis mainly deployed on stand-alone servers. It has been furtherevaluated in the grid/cloud computing [34], [37]. Deegree WPS[38] fully supports OpenGIS WPS 1.0.0 specification. It pro-vides a configurable basis to plug in processes. It uses a processprovider layer to integrate popular geo-processing frameworkssuch as Sextante [39], FME [40], and GRASS [41]. It is de-ployed mainly in a stand-alone computing environment. WPSfor EOD Processing is not seen much in literature.These WPS frameworks are utilized mainly in a stand-alone

computing environment. Besides this utility, WPS also needsto be implemented in a distributed computing environment.Actually, much processing of EOD is complex. WPS canprocess both simple and complex EOD, for the processes ofWPS can be any algorithm, calculation, or model that op-erates on spatially referenced data include EOD [4]. Giventhe large amounts of EOD, increasingly complex calculationmodels, and untrustworthy multiple heterogeneous processingresources and tasks, the high-performance implementations ofWPS processing are badly needed. Several high-performancecomputing and technological methods, such as grid computing,distributed computing, and parallel computing [34], [35] havebeen evaluated for WPS implementations, while little researchhas been performed on enabling WPS in a cloud computingenvironment. The Amazon’s Elastic Computing Cloud (EC2)[42], Google’s AppEngine [43] and Microsoft’s Azure Plat-form [44] support cloud computing features at platforms,software packages and services levels. However, those cloudcomputing services are commercial and not completely freeand open source. Instead, the Apache Hadoop core providesan open-source framework for cloud computing, as well as adistributed file system [12]. The goal of the Apache Hadoopproject is to develop open-source software for reliable, scal-able, distributed computing [11]. Hundreds of institutions andcompanies, including such giant software companies, such asIBM, Yahoo, and Facebook are either directly using Hadoopor constructing services on top of Hadoop [13]. We expect thatenabling a free and open-source WPS computing platform onHadoop is very promising.

III. WPS AND APACHE HADOOP

A. Overview of WPS

WPS is one of the OGC implementation specifications. Itprovides standard interfaces to process tasks submitted by aclient on the Web for function sharing and interoperation. Thereare there mandatory operations in WPS: GetCapabilities, De-scribeProcess, and Execute.1) The GetCapabilities operation allows clients to retrieveservice metadata from a server. It is a common operation in

OGC Web Service (OWS). The request parameters in thisoperation are version, service, and request. The responsedocument contains service identification, service provider,operations metadata, and process offerings.

2) The DescribeProcess operation allows clients to requestthe description of one or more processes that can be ex-ecuted by the Execute operation. A key request parameteris process Identifier, which is the identifier of a process inWPS. The response document shows the input parametersand the output format of the process. Knowing these pa-rameters, the user knows how to request and execute a task.

3) The Execute operation allows clients to run a specifiedprocess using the input parameters described by the De-scribeProcess response document and obtain the output re-sults. The Execute operation is the core part of WPS.

The general steps of handling a task combining these oper-ations are as follows. First, use GetCapabilities to obtain themetadata of a WPS and the processes with which the WPScan deal. Then choose one of these processes and request De-scribeProcess operation to see what its inputs and outputs. Fi-nally execute the Execute operation according to the inputs re-quired.

B. Overview of Apache Hadoop

Apache Hadoop contains subprojects: Hadoop Common,Hadoop Distributed File System (HDFS), and MapReduce.Hadoop Common is a set of utilities that support the Hadoop

subprojects, including FileSystem, Remote Procedure Call(RPC), and serialization libraries.HDFS is a distributed file system in a mass computer server

cluster. It provides high throughput access to application dataeven in a bargain-priced computer cluster. In HDFS, there aretwo types of node (a node is a Java Virtual Machine or a com-puter) for managing and storing data called NameNode andDataNode respectively. NameNode serves as both a directorynamespace manager and “inode table” for HDFS. Generallyspeaking, there is either a single NameNode running or thereis also a secondary backup/failsafe NameNode in an HDFS de-ployment. DataNode is the actual place where data is stored ina set of blocks. A data block has a configured size with a de-fault of 64 megabytes. A data set larger than that size will bepartitioned into blocks. In an HDFS deployment, there are al-ways thousands of DataNodes. The organization of NameNodeand DataNodes is a master-slave structure. When working, theclient uses ClientProtocol to interact with NameNode for HDFSservice. NameNode implements the DatanodeProtocol interfaceand DataNode uses DatanodeProtocol to communicate with Na-meNode. DataNodes spend their lives in an endless loop ofreporting their conditions or asking the NameNode for some-thing to do with what is called the heartbeat method. The heart-beat method is the remote method called by the DataNodes thatperiodically lets the NameNode know that DataNode is stillalive. DataNodes can delete blocks or copy blocks to/from otherDataNodes.Hadoop MapReduce is a programming model and software

framework for writing applications that rapidly process vastamounts of data in parallel on large clusters of computing nodes.

Page 4: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

4 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

Fig. 1. Hadoop MapReduce workflow.

Dean and Ghemawat [45] first introduced MapReduce. Its prin-ciple is that when it processes a task there are two core steps.One is the map phase, in which a key/value pair is processed togenerate a set of intermediate key/value pairs, and the other isreduce phase, in which all intermediate values associated withthe same intermediate key are processed. The purpose of themap phase is to divide a task into several independent subtasksto run in parallel, and the reduce phase collects all the subtaskresults to obtain the result of the whole task. In Hadoop,MapRe-duce is running on HDFS. It has two important Trackers namedJobTracker and TaskTracker to process and manage a task co-operatively. JobTracker is responsible for startup, tracking, andinvoking the tasks submitted by the client. TaskTracker is re-sponsible for managing local data, processing data, collectingthe results, reporting status, and the processing of a task to Job-Tracker. TaskTracker is deployed on DataNode, but JobTrackercan be deployed on NameNode or a different server. Like Na-meNode and DataNodes, there are always a JobTracker andmany TaskTrackers. Fig. 1 shows howHadoopMapReduce pro-cesses a job.When JobTracker receives a job (also a user may call it a

task), it will dispatch the job into several TaskTrackers. Task-Tracker splits the needed data into pieces. The Map functionreceives a key/value pair to generate a set of intermediate key/value pairs. The intermediate key/value pairs with the same keyare combined and partitioned them according the number ofreduce tasks/output files. Shuffle fetches the relevant partitionof the output of all the Mappers as the inputs of reduce, andthen Sort separates the inputs by keys, since different Mappersmay have the same key as output. After Shuffle and Sort, all thekey/value pairs are different, having form. Fi-nally, all the pairs are reduced and the result output.Hadoop MapReduce is a complex programming framework,

but it is very simple to write applicable MapReduce program-ming for a user. The steps (MAPREDUCESTEPS) in the pro-gram are as follows:MapReduce main steps{Step 1: configure the parameters.Step 2: create a job and set a job name.Step 3: set jar by class name to copy all the programming

code the jobs needs and dispatch it to other Task-Trackers.

Step 4: set map, combine, partition, and reduce class.

Step 5: set the input/output key/value class of the map andreduce functions.

Step 6: set the input and output paths of the job.Step 7: wait for the completion of the job.}In an application, the user need only develop the classes

shown in step 3 to step 6.

C. Performance of Hadoop

The performance of Hadoop has been tested and reported. In2008, “One of Yahoo’s Hadoop clusters sorted 1 terabyte of datain 209 seconds, which beat the previous record of 297 seconds inthe annual general purpose (Daytona) terabyte sort benchmark.The sort benchmark, which was created in 1998 by Jim Gray,specifies the input data (10 billion 100 byte records), whichmustbe completely sorted andwritten to disk” [46]. The cluster statis-tics were 910 nodes, 2 quad core Xeons @ 2.0 GHz per a node,4 SATA disks per node, 8G RAM per node, Red Hat EnterpriseLinux Server, and Sun Java JDK 1.6. There were 1800 mapsand 1800 reduces in this sort. In 1998, there were over 13,000Hadoop nodes in Yahoo, but now the number is over 40,000.Hundreds of thousands of jobs now run in a month.Some groups [47]–[49] have tested/analyzed/evaluated/sim-

ulated the performance of Hadoop MapReduce or proposedtheir methods. These groups tested the effect on the perfor-mance of MapReduce when the parameters changed. Goodparameter tuning improves performance. HDFS is highlyfault-tolerant, designed to be deployed on low-cost hardware,with high throughput access to large data sets [50].In fact, the PoweredBy [13] website lists hundreds of users

of Hadoop, with several to thousands of nodes and data fromKB to TB size. All those applications are a good affirmation ofHadoop.The applications show that Hadoop is suitable for high-per-

formance computing; it can run in a cluster with thousands ofservers, and process vast amounts of data.

IV. DESIGN

As mentioned in Section I, a smart EO Processing system isneeded with four obvious properties: Web-based, internationalstandard compatible, flexible/extensive, and high performance

Page 5: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 5

Fig. 2. Architecture of cloud computing enabled WPS.

computing. To achieve this system,Web service and Cloud com-puting technologies as well as WPS specification are integrated.There are at least two things to be considered in the architec-ture of this integrated system: the first one is it should be a Ser-vice Oriented Architecture (SOA) to make the components areloosing couple and machine and platform independent to allowworldwide use on the Internet. The other one is changing a bitthe internal structure of WPS. In a stand-alone server, WPS re-ceives request and processes a task in the same logic layer. Butin this architecture, the WPS is spit into two distributed parts.One is WPS request and response server as a WPS server layer;and the other is Cloud environment layer which is response forprocess task. The key point of this seamlessly is coupling a flex-ible WPS with Hadoop for EO Processing.Fig. 2 shows the system architecture of a WPS platform in

a cloud computing environment, which consists of two layers:Client and WPS. The WPS layer is a layer providing WebProcessing Service. This layer can be further divided into twolayers: the WPS server and Cloud Computing.The client layer is the application layer, where users dis-

tributed all over the Web use WPS.The WPS server layer is deployed on only one WPS server

or a WPS server cluster. The WPS server in this layer is a Webservice server that receives a WPS standard request and returnsits standard response, but does not do any processing. OneWPSserver can provide one kind of EOD Processing service or sev-eral; meanwhile, one type of EOD Processing service can bedeployed in a WPS server or even several in the whole cluster.The Cloud Computing layer is the cloud computing envi-

ronment for executing all EOD Processing tasks submitted toWPS server(s). A client submits a task to a WPS server, theserver only verifies all the input parameters of the EOD Pro-cessing are correct, and then it submits the task to the cloudcomputing layer to process. A WPS provider can use a differentcloud computing environment, such as Hadoop, Amazon’s EC2,and Google cloud. In this paper, Hadoop is considered to be thecloud computing environment, for the fact that Hadoop is anopen source and yet good-performance framework successfullyused by many as being deployed on low-cost computers andrequiring little for hardware, and MapReduce in Hadoop is agood parallel programming model suitable for WPS. ChoosingHadoop as WPS processing environment never means no otherchoice. Appling WPS on more high-performance computing/Cloud environment is our target. It has two reasons: the first

one is different environment may have especial advantages indifferent application, and the other one is WPS needing devel-oped and studied on more application framework.In this architecture, one of the critical issues is how to make

WPS flexible enough for diverse EO Processing. The other ishow to integrate WPS with Hadoop. For the former, WPS needssupport different kinds of EO processing algorithms. For thelatter, WPS invokes Hadoop likes running itself. The parameterinput, programmanagement, status control, and result obtainingof Hadoop are able to do by WPS. In order to try these twoproblems, a rewrite method and WPS interaction with Hadoopare introduced, and see them in Section V.

V. IMPLEMENTATION

In the architecture of cloud computing enabled WPS, theClient requests WPS; the WPS server is responsible for col-lecting the requests/tasks and then sending the job to a cloudcomputing layer; the cloud computing layer runs the job. Thekey implementation of this architecture is the client interactingwith the WPS server and the WPS server interacting with thecloud computing layer. These two interactions are reflected inthe concrete implementation of cloud computing enabled WPS,as shown in Fig. 3.Fig. 3 shows aWPS using aWPS server and aHadoop cluster.

The WPS server implements three WPS interface operationsGetCapabilities, DescribeProcess, and Execute; also, it providesGetResult and GetStatus operations for users to obtain the re-sults of a task and its status. All those operations are exposed tousers by the Web Services Description Language (WSDL). TheHadoop cluster is deployed for cloud computing. NameNode,DataNode, JobTracker, and TaskTracker are also deployed.In order to explain the concrete implementation of each detail

above, assume a user wants to use WPS to process a task calledTASK_A. Several steps as shown in Fig. 3 occur in the systemfor TASK_A.Steps 1 and 2 are request/response. A user requests the GetCa-

pabilities operation to obtain the metadata that this WPS servercan provide. From this GetCapabilities response document, theuser can find the process service ID. Then the user uses this ID toinvoke the DescribeProcess operation to find the input parame-ters and the output result format. Generally speaking, the GetCa-pabilities operation and the DescribeProcess operation are usedfor the Execute operation. Knowing what the input and outputof a task, a user can invoke the Execute operation to run the task.

Page 6: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

6 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

In these operations,WSDL is used to explore these operationsas Web service methods. WSDL provides a model and XMLformat for describing services, and it allows separation of thedescription of the abstract function offered by a service fromdescription of its details. In general, aWSDL instance is dividedinto seven parts: Types,Message, Operation, portType, Binding,Port, and Service. Fig. 4 shows the uniform WSDL of WPS.In this WSDL instance, there are three kinds of portTypes

(HTTP GET, POST, and SOAP) and their corresponding bind-ings, several operations, and a service. All the input and outputof the operations are pre-defined as a message that is a compo-nent of WSDL. Users can invoke those operations by parsingWSDL.Steps 3, 4 and 5 execute a task using the Execute operation.

The Execute operation has two responsibilities: one is to submitinput data to Hadoop through a data submitter, and the otheris to submit a MapReduce task using a MapReduce submitter.The first step is to ensure that all the EOD are in HDFS, theonly place that Hadoop can access. The data submitter is like aHadoop HDFS client, and uses the Application ProgrammingInterface (API) of the Hadoop file system to interact withHadoop HDFS. A Java class named DistributedFileSystemis used to tell the NameNode when data will be written toHadoop HDFS. Then, NameNode records the written EODinformation. A Java class named DFSOutputStream is returnedby DistributedFileSystem. The data submitter uses DFSOut-putStream to write the EOD to Hadoop HDFS. If the EOD isactually in Hadoop HDFS, there is no need to invoke the datasubmitter. The implementation method of the data submitter isthe dataSubmit method. After all the data has been prepared,the MapReduce submitter submits a task to JobTracker to runthe task. The programming for writing and submitting a job toJobTracker was introduced in Section II.To help users develop a WPS process conveniently and

easily, this paper introduces a rewrite method. A rewritemethod rewrites some of the defined methods and classes for atask. Fig. 5 shows rewrite methods used in some componentsin processing a task for a flexible WPS.In Fig. 5, the Execute operation request includes the process

ID, input, and output parameters. Using the process ID, a spe-cial Parser can be chosen for the process. This Parser is usedfor parsing process ID, input, and output parameters. Then theMapReduce submitter executes a task. Finally, Generator en-capsulates the response. A process ID is associated with a typeof task, a Parser, and a Generator. The relationships can be con-figured in a configurable file, which is created at initiation. TheParser and Generator must be rewritten in accordance with thetype of task. The MapReducer submitter is a conceptual ag-gregate of rewrite methods and classes needed for submittinga task to Hadoop. The methods are dataSubmit, setParameters,MapReduce, and run; the classes are MapReduce classes. Themeaning and operation of these methods and classes are:1) The dataSubmit method is a rewrite when EOD must becopied to HDFS, and is the data submitter implemented.

2) The setParameters method is a rewrite for setting the inputand output parameters or other parameters to the configu-ration of Hadoop. The purpose of the rewrite method is todeliver input and output parameters to MapReduce classes

and methods. The setParameters method has fixed inputparameters which are Hadoop configuration interface andinput/output parameters from client request.

3) The MapReduce classes are those needed in MAPRE-DUCESTEPS for running a MapReduce program. Someparameters used in these classes may set by the setParam-eters method.

4) The MapReduce methods are map and reduce methods.Rewriting should be mandatory. They are the core methodsof the MapReduce process, and both extend the MapRe-duceBase class and separately implement the Mapper in-terface and Reducer interfaces. They are also needed inMAPREDUCESTEPS.

5) A run method is the method used to run a MapReduceprocess. The input parameters of this method are theclient’s input parameters, and output parameters arethe client’s output parameters. In this method, the set-Parameters method, the dataSubmit methods, and theMAPREDUCESTEPS are invoked in sequence.

Step 6 is the JobTracker, which dispatches the task to Task-Trackers to run. The relation between JobTracker and Task-Trackers was explained in Section II.Steps 7, 8, and 9 are obtaining the status information of the

task. Because WPS processing tasks usually take a long time,an asynchronous mechanism is suitable for avoiding long-termwaits and reducing network congestion. General speaking, thereare two asynchronous communication mechanisms: pull andpush. In the pull mechanism, the client sends a request and thenperiodically checks whether a task is completed. This mecha-nism may have a high processing overhead on the server. Thepush mechanism is to notify the client when the result is ready.The push mechanism causes less traffic on theWeb. Pull is spec-ified inWPS 1.0 while both push and pull are taken advantage inWPS 2.0 which is in discussion. When the user submits an Exe-cute operation, an immediate response is returned to the client.There is a status checking operation called GetStatus and thetask ID available to the user. The user can check the task statusat any time before it completes. When the task is finished, theresult URL is shown to the user in the status response. The Get-Status operation is a wrapper, which can invoke a task statuswith the task ID.Results are obtained and read in steps 10 to 13. The GetRe-

sult operation will be invoked once the task has been finished– it works as a client interface for HDFS with HDFS program-ming API being called to read data. The Java Class Distributed-FileSystem handles the interaction with NameNode in returningthe results of a DataNode in an FSDataInoutStream object. Theread method of FSDataInputStream can be used to read all theresult data. Finally, WPS returns the result to the client.

VI. EXPERIMENT

To test the feasibility of cloud computing enabled WPS forEOD Processing, an experiment has been done. Seemingly, thewhole performance of the framework is determined by the per-formance of Hadoop, since the WPS server is only a mediathat receives task assignments from client and forwards themto Hadoop, and the real processing only takes place in Hadoop.The WPS server receives an XML document via the SOAP/

Page 7: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 7

Fig. 3. Implementation of cloud computing enabled WPS.

HTTP protocol and parses it. This process requires at most afew seconds. For a very complex task, such as processing EODat a high volume, it may take a long time to finish on ApacheHadoop which sometimes mean up to several days. The perfor-mance of Hadoop has been tested, and the results are as stated inSection III. The key point with which this experiment was con-cerned was whether both the client interaction with the WPSserver and the WPS server interaction with Hadoop (hereinaftercalled the two interactions) were smooth without any problems.To demonstrate the interactions, an experimental scenario calledNDVIWPS which was a WPS project to calculate NormalizedDifference Vegetation Index (NDVI) using MODIS data wascarried out.

A. Experimental Background

NDVIWPS is the initial preparation work on a project namedVegetation Condition Project (VCP) sponsored by the NationalAgriculture Statistics Service (NASS) of the United States De-partment of Agriculture (USDA). The intent of VCP is to de-velop a web-based, service-oriented, automated system to over-come the shortcomings of some existing systems. The VCP isused to evaluate, monitor, and forecast the vegetation conditionof the 48 continental states of America. The vegetation condi-tion is evaluated by some scientific statistics indexes such asthe Normalized Difference Vegetation Index (NDVI), and theVegetation Condition Index (VCI). The calculated equations forNDVI and VCI are separate as (1) and (2), respectively.

(1)

(2)

In (1), NIR is the near-infrared band data and IR is the infraredband data of a pixel. In (2), NDVImin is theminimumNDVI andNDVImax is the maximum NDVI for a pixel during a specifictime. The present source of VCP data is the Moderate Resolu-tion Imaging Spectroradiometer (MODIS). At present, the VCPdata is sourced from MODIS, namely the “MODIS/Terra Sur-face Reflectance Daily L2G Global 250m SIN Grid v005”. Thisdataset provides daily coverage of 48 continental states in 25HDF data files which is about 2 gigabytes. So to speak, all theneeded MODIS data from year 2000 to 2011 will be added up to100,000 HDF files with their sizes over than 8 terabytes. Givenso many files and such a large size, providing a web service thatis to compute with high performance and publish results effec-tively, is indeed a challenge. These problems have motivatedtests of cloud computing enabled WPS.

B. Experimental Instance

A WPS server and Apache Hadoop version 0.21.0 aredeployed in a single-node computer and run in a pseudo-dis-tributed mode where each Hadoop daemon runs in a separateJava process. The WPS server is based on 52N WPS, butwith some improvements: 1) adding the actual GetResult andGetStatus SOAP/HTTP Web service interfaces; 2) addingthe Hadoop processing interface and improving the rewritemethod. In this instance, calculating daily NDVI from 25 HDFfiles in 7 days (from 05/04/2010 to 05/10/2010), calculatingweekly NDVI from 7 days (from 05/04/2010 to 05/10/2010),and transforming these daily and weekly NDVI from the SINcoordinate system to Latitude/Longitude coordinate systemand Albers coordinate system, are tested. The Two Interactionsof these three tests are detailed in Section VI.

Page 8: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

8 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

Fig. 4. The uniform WSDL of WPS.

Fig. 5. Rewrite method for flexible WPS.

1) Client Interaction With the WPS Server: The first step ofclient interaction with the WPS server is for a user to retrievethe WSDL to know the operations, bindings, and services,depicted in Fig. 4. Then the getCapabilities operation is re-quested to obtain the identifiers of processes, for example“edu.gmu.csiss.wps.DailyNDVI”, “edu.gmu.csiss.wps.Week-lyNDVI”, and “edu.gmu.csiss.wps.CoordinateConvert”. Third,using these identifiers request describeProcess operation to ob-tain the input and output parameters of their execute operationseparately. Here, the input parameters will be the input filespaths, while the output parameters will be the output paths ofresult files.

After knowing the inputs and outputs of process, Execute op-eration can be now carried out. An asynchronous mechanism isused for this operation. An instantaneous response is returnedto the client; an example response can be seen in Fig. 6 whichis the result from calculating daily NDVI. The statusLocationpoints to the URL where the task status can be retrieved. Fig. 7shows the process.After submitting a task to WPS, the user may invoke Get-

Status filling the information for task ID of this task to knowthe status of the specific job. When a job is finished, the usermay request for the result. The result will be read from HDFSas described in next section.

Page 9: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 9

Fig. 6. Instantaneous response.

2) WPS Server Interacting With Hadoop: Invoking an Ex-ecute operation will submit a MapReduce job, before the sub-mitting either the input data has already resided in HDFS or hasbeen copied to HDFS. When a file is uploaded to HDFS, thecreate method of DistributedFileSystem is invoked. The param-eters of create method are the path of the file, file system per-mission, data buffer size, number of replication, data block size,and progress. The path is the input path, the number of repli-cations may be set to 1, and all the other parameters have thedefault values. The return parameter is FSDataOutputStream.The create method of FSDataOutputStream is then invoked.The data will be automatically copied to HDFS. Reading datafrom HFDS is the reverse process. First, a LocatedBlocks Ob-ject is obtained fromNameNode by invoking RPC, and then FS-DataInputStreamObject created. The read method of FSDataIn-putStream is used to read data.The MapReduce of the three experiments follows the method

mentioned in Section III. Figs. 8–12 show their MapReduceprocess.Fig. 8 is the daily NDVI MapReduce process. The input path

is “daily” and output path is “dailyndvi”; both these paths arerelative to the HDFS working directory. All HDF files in theinput path are split into format. The key is the nameof the day in string form, and the value is the HDF file name.The date of a HDF file is implied by its name; for example,

“MOD09GQ.A2010124.H06V03.005.20100126173551.hdf”

is included in 2010.05.04, for “124” after “A2010” is the Julianday, and “124” means the 124th day of 2010. The 124th dayof 2010 is 05/04/2010. Map process calculates the NDVI ofevery DHF file with its NIR and IR bands. Prefix “ndvi_” isattached the NDVI file name. After map process, the NDVIfiles with the same key are collected in a list for the reduceprocess. In Fig. 8, the seven (from 05/04/2010 to 05/10/2010)days are distinguished and listed together. The reduce processcalculates every day’s NDVI. Finally, all reduce results aresent to the output path (dailyndvi). Calculating the weeklyNDVI has a similar MapReduce process shown in Fig. 9. In theweekly NDVI MapReduce process, “2010.05.04_2010.05.10”represents a week. The map process in Fig. 9 directly addsevery NDVI file to the week list to where it belongs.Figs. 10 and 11 show the coordinate transformation of daily

and weekly NDVI. They only convert their coordinate systemfrom that of HDF format to that of GeoTIFF format, and that iswhy only the map process needs to perform the transformationtask while the reduce process does not get to use it.

In Fig. 12, from left to right, each part of the image representsone process of the three MapReduces. In the first part, all NDVIof the original HDF files are calculated; and then daily NDVIare combined in the second part; in the third part, weekly NDVIis obtained from 7 days’ NDVI, where each pixel in the weeklyNDVI is the maximum value of that pixel in the 7 days. Thefourth part is to obtain the image with the Albers coordinatesystem, converted from SIN coordinate system.

VII. DISCUSSION AND CONCLUSION

This paper presents the feasibility of enabling WPS withina cloud-computing environment for EOD Processing using theApache Hadoop Platform. The system created uses uniformWSDL to describe the WPS operation, employs a process inter-face to interact with Hadoop, and implements an asynchronouscomputing scenario. A WPS service which is capable of calcu-lating NDVI from multiple files and also can converting fromthe SIN coordinate system into the Albers coordinate systemis implemented in this study. The experiment results show thatWPS services can be developed on Apache Hadoop. Some ofthe significant advantages of this approach are as follows:1) Design a standard, flexible, and high-performance websystem for EOD Processing. It follows OGC WPS specifi-cation; it is standard. It uses rewrite method for integratingdifferent EOD Processing including different data type,data formats, calculating model, and calculating method.Different user can apply different EOD Processing fordifferent application. Thus, it is flexible and extensive.Apache Hadoop framework is a high-performance frame-work, for this reason, it has a potential high-performancecapability, though the performance is not tested in theexperiment. This will be a future work.

2) Introduce a multiple layered method Coupling WPS withHadoop. Cloud computing enabled WPS can be imple-mented in a multiple layered method, as WPS is dividedinto to layers: the standard service interface layer (WPSserver layer) which interacts with the client for parsingstandard requests and returning standard responses, and thelogic layer (cloud computing layer) which is for logicalcomputing. The cloud computing application is always de-ploying or copying programming to the computing nodes.While cloud computing enabled WPS does not mean thatthe whole WPS must be deployed or copied to each com-puting node, only the computing program logic. A mul-tiply layered method facilitatesWPS deployment, configu-ration, andmanagement. In this paper, just aWPS node and

Page 10: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

10 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

Fig. 7. Status information.

a Hadoop cloud computing environment are used. WPSneed only provide an interface to interact with Hadoop,it does not need to ruin anything in Hadoop. The inter-face rewrite method flexible enough for tasks to invokeHadoop is expained in Section V. The three experimentsprocessing different tasks in Section VI prove this methodcan work. Furthermore, a multiple layered method benefitsfor Hadoop applications in Web service. Hadoop is mainlyas an application program, and can not be used directly as aWeb service. WPS in this paper is a good example of Webservice application of Hadoop

3) Realize a flexibly method for WPS submitting a MapRe-duce task. The key process of cloud computing enabledWPS is the MapReduce process. In a task submitted toWPS, a critical step is rewriting the Map and Reduceprocesses. As mentioned in Section VI, the MapReduceprocesses of the three experiments are not completelythe same. Different tasks have different MapReduce pro-cesses. Some processing tasks need only the map processor reduce process. For example, for the coordinate trans-form, only the map process is needed. MapReduce is alsokey part for the computing performance. There are dozensof parameters in Hadoop configuration for MapReduce.How the parameters affect the performance of WPS iscomplex and is not addressed in this paper. This questionis for the future work.

The experiment shows the feasibility of WPS coupling withHadoop. This WPS framework can be broader used for EODprocessing and other applications. One of the advantages ofMapReduce model is parallel computing. In an ideal situation,the EOD processing can be partitioned by parallel executingsteps is apt to MapReduce model. The experimental instanceis an example. But in fact, an EOD processing only be exe-cuted by sequence steps also can use MapReduce model. Thedifference of these two is the former may show a good perfor-mance. In reality, many EOD processings are including hugevolumes of data and some sub-process steps which are apt toparallel computing. Challenges are how to design the parallelcomputing model and how to set the correct parameters.

Future work will focus on: (1) configuring the optimal param-eters of Hadoop for different types of EOD Processings. Onepurpose is validating the performance of Hadoop, MapReduce,and the HDFS to provide a robust and high-performance en-vironment for WPS processing, especially for some complextasks. The other is at this point evaluating and comprising ofadvantages and disadvantages of Hadoop with other possibili-ties, such as classical batch processing, and serial computationfifth etc. (2) extending the use of WPS for EOD Processing andother applications, and finding out its advantages and disadvan-tages.

REFERENCES[1] P. Dana, P. Silviu, N. Marian, F. Marc, and Z. Daniela, “Earth observa-

tion data processing in distributed systems,” Informatica, vol. 34, no.4, pp. 463–476, Oct. 2010.

[2] Z. Yang, L. Di, G. Yu, and Z. Chen, “Vegetation condition indices forcrop vegetation condition monitoring,” in Proc. 2011 IEEE Int. Geo-science and Remote Sensing Symp. (IGARSS), Vancouver, BC, Canada,Jul. 24–29, 2011.

[3] The Web Services Glossary W3C, 2004 [Online]. Available: http://www.w3.org/TR/ws-gloss/

[4] OGC Web Processing Service Specification (Version 1.0.0), , 2007,OGC Standard, 87.

[5] K. Stanoevska-Slabeva, T. Wozniak, and S. Ristol, Grid and CloudComputing A Business Prespective on Technology and Applications.Heidelberg, Dordrecht, London, New York: Springer, 2010.

[6] J. Geelan, “Twenty one experts define cloud computing,” Virtualiza-tion, Aug. 2008 [Online]. Available: http://virtualization.sys-con.com/node/612375

[7] I. Foster, Y. Zhao, I. Raicu, and S. Lu, “Cloud computing and gridcomputing 360-degree compared,” in Proc. IEEE Grid Computing En-vironments Workshop, Austin, TX, Nov. 12–16, 2008.

[8] The NIST Definition of Cloud Computing (Draft), Recommendationsof the National Institute of Standards and Technology, 2011, vol. 7.

[9] J. Ekanayake and G. Fox, “Cloud computing,” Lecture Notes of theInstitute for Computer Sciences, Social Informatics and Telecommuni-cations Engineering, vol. 34, no. 1, pp. 20–38, 2010.

[10] C. Vecchiola, S. Pandey, and R. Buyya, “High-performance cloud com-puting: A view of scientific applications,” in 10th Int. Symp. PervasiveSystems, Algorithms, and Networks (ISPAN), Kaohsiung, Taiwan, Dec.14–16, 2009.

[11] Hadoop Website, Apache, 2011 [Online]. Available: http://hadoop.apache.org/

[12] J. Venner, Pro Hadoop. New York: Springer-Verlag, 2009, vol. 442.[13] Hadoop Wiki Website, Apache, 2011 [Online]. Available: http://wiki.

apache.org/hadoop/PoweredBy#F

Page 11: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 11

Fig. 8. Daily NDVI MapReduce.

Fig. 9. Weekly NDVI MapReduce.

Fig. 10. Weekly NDVI coordinate transform.

[14] A. Bounfour and E. F. Lambin, “The European Earth observation dataprocessing and interpretation services: Analysis of the sector and con-ditions for its development,” Int. J. Remote Sens., vol. 14, no. 4, pp.635–654, Feb. 1993.

[15] W. Cudlip, D. R. Mantripp, C. L. Wrench, H. D. Griffiths, D. V.Sheehan, M. Lester, R. P. Leigh, and T. R. Robinson, “Corrections foraltimeter low-level processing at the Earth Observation Data Centre,”Int. J. Remote Sens., vol. 15, no. 4, pp. 889–914, Feb. 1994.

Page 12: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

12 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

Fig. 11. Daily NDVI coordinate transform.

Fig. 12. Images during the MapReduces.

[16] M. S. Hutchins, H. K. Wilson, D. Bass, and P. Waggett, “Overviewof the development of the earth observation data centre and its pro-cessing and archiving facilities,” Int. J. Remote Sens., vol. 15, no. 4,pp. 741–759, Feb. 1994.

[17] R. L. Weaver and V. J. Troisi, “Remote sensing data availabilityfrom the Earth Observation System (EOS) via the Distributed ActiveArchive Center (DAAC) at NSIDC,” in Proc. 1996 Int. Geoscienceand Remote Sensing Symp. (IGARSS’96), New York, May 27–31,1996.

[18] K. Golden, W. Pang, R. Nemani, and P. Votava, “Automating the pro-cessing of earth observation data,” in 7th Int. Symp. Artificial Intelli-gence, Robotics and Automation for Space, Nara, Japan, May 19–23,2003.

[19] N. Chen, L. Di, J. Gong, and G. Yu, “Automatic on-demand data feedservice for autochem based on reusable geo-processing workflow,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. (JSTARS), vol.3, no. 4, pp. 418–426, Dec. 2010.

[20] N. Chen, Z. Chen, L. Di, and J. Gong, “An efficient method for near-real-time on-demand retrieval of remote sensing observation,” IEEE J.Sel. Topics Appl. Earth Observ. Remote Sens. (JSTARS), vol. 4, no. 3,pp. 615–625, Sep. 2011.

[21] G. Aloisio andM. Cafaro, “A dynamic Earth observation system,”Par-allel Computing, vol. 29, no. 10, pp. 1357–1362, Oct. 2003.

[22] G. Aloisio, M. Cafaro, G. Carteni, I. Epicoco, G. Quarta, and S. Raolil,“GridFlow for earth observation data processing,” in Proc. 2005 Int.Conf. Grid Computing and Applications, Las Vegas, NV, Jun. 20–23,2005.

[23] C. Gomez, L. M. Gonzalez, and J. Prieto, “EOforge: Generic openframework for earth observation data processing systems,” Emergingand Future Technologies for Space Based Operations Support to NATOMilitary Operations, pp. P5-1–P5-10, 2006.

[24] C. Granell, L. Diaz, and M. Gould, “Managing Earth Observation datawith distributed geoprocessing services,” in Proc. 2007 IEEE Int. Geo-science and Remote Sensing Symp. (IGARSS’07), Piscataway, NJ, Jul.23–28, 2007.

[25] D. Gorgan, V. Bacu, T. Stefanut, D. Rodila, and D. Mihon, “Grid basedsatellite image processing platform for Earth observation applicationdevelopment,” in 2009 IEEE Int. Workshop on Intelligent Data Acqui-sition and Advanced Computing Systems: Technology and Applications(IDAACS 2009), Piscataway, NJ, Sep. 21–23, 2009.

[26] D. Gorgan, V. Bacu, T. Stefanut, D. Rodila, and D. Mihon, “Earth Ob-servation application development based on the Grid oriented ESIPsatellite image processing platform,” Computer Standards and Inter-faces, doi 10.1016/j.csi.2011.02.002.

[27] N. Golpayegani and M. Halem, “Cloud computing for satellite dataprocessing on high end compute clusters,” in 2009 IEEE Int. Conf.Cloud Computing, Bangalore, India, Sep. 21–25, 2009.

[28] C. Granell, L. Diaz, and M. Gould, “Managing Earth observationdata with distributed geoprocessing services,” in Proc. IEEE Int.Geoscience and Remote Sensing Symp. (IGARSS’07), Barcelona,Spain, Jul. 23–27, 2007.

[29] R. Gerlach, C. Schmullius, and S. Nativi, “Establishing a Web Pro-cessing Service for online analysis of Earth observation time seriesdata,”Geophysical Research Abstracts, vol. 10, p. EGU2008-A-09593,2008.

[30] S. Falke, E. Dorner, B. Dunn, and D. Sullivan, “Processing servicesin earth observation sensor web information architectures,” in Proc.Earth Science Technology Conf. 2008, NASA, College Park, MD, Jun.24–26, 2008.

[31] T. Foerster and J. E. Stoter, “Establishing an OGC web processing ser-vice for generalization processes,” in ICA Workshop of the Commis-sion on Map Generalization and Multiple Representation, Portland,OR, Jun. 25, 2006.

[32] Deegree Website, Deegree, 2011 [Online]. Available: http://www.dee-gree.org/

[33] B. Schaeffer, “Towards a TransactionalWeb Processing Service (WPS-T),” in GI-Days, Münster, Germany, Jun. 16–18, 2008.

[34] B. Baranski, “Grid computing enabled Web Processing Service,” inGI-Days, Münster, Germany, Jun. 16–18, 2008.

Page 13: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Web

Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 13

[35] S. Lanig, A. Schilling, B. Stollberg, and A. Zipf, “Towards stan-dards-based processing of digital elevation models for grid computingthrough web processing service (WPS),” in ICCSA 2008, Perugia,Italy, Jun.-Jul. 30–3, 2008.

[36] 52 North WPS Website, 52 North, 2011 [Online]. Available: http://52north.org/communities/geoprocessing/wps/index.html

[37] B. Schaeffer, “Behind the buzz of cloud computing – 52North OpenSource Geoprocessing Software in the Google Cloud,” in Abstract,FOSS4G, 2009 Free and Open Source Software for Geospatial Conf.,Sydney, Australia, Oct. 20–23, 2009.

[38] Deegree WPSWebsite, Deegree, 2011 [Online]. Available: http://wiki.deegree.org/deegreeWiki/deegree3/ProcessingService

[39] Sextante Website, Sextante, 2011 [Online]. Available: http://www.osor.eu/studies/sextante-a-geographic-information-system-for-the-spanish-region-of-extremadura

[40] FME Website, FME, 2011 [Online]. Available: http://www.safe.com/[41] GRASS Website, GRASS, 2004 [Online]. Available: http://grass.

fbk.eu/[42] Amazon Website, Amazon, 2011 [Online]. Available: http://aws.

amazon.com/ec2[43] Google Website, Google, 2011 [Online]. Available: http://code.google.

com/appengine[44] Microsoft Website, Microsoft, 2011 [Online]. Available: http:// www.

microsoft.com/windowsazure[45] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on

large clusters,” in Proc. 6th Conf. Operating Systems Design & Imple-mentation, San Francisco, CA, Dec. 6–8, 2004.

[46] Yahoo Website, Yahoo, 2008 [Online]. Available: http://developer.yahoo.com/blogs/hadoop/posts/2008/07/apache_hadoop_wins_ter-abyte_sort_benchmark/

[47] G. Wang, A. R. Butt, P. Pandey, and K. Gupta, “Using realistic simu-lation for performance analysis of mapreduce setups,” in Proc. LSAP,Munich, Germany, Jun. 11–13, 2009.

[48] G. Wang, A. R. Butt, P. Pandey, and K. Gupta, “A simulation ap-proach to evaluating design decisions in MapReduce setup,” in Int.Symp. Modelling, Analysis and Simulation of Computer and Telecom-munication Systems, London, U.K., Sep. 21–23, 2009.

[49] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, “An analysis oftraces from a production mapreduce cluster,” in Proc. 10th IEEE/ACMInt. Conf. Cluster, Cloud and Grid Computing, Melbourne, Australia,May 17–20, 2010.

[50] D. Borthakur, “The Hadoop Distributed File System Architecture andDesign,” 2007 [Online]. Available: http://hadoop.apache.org/common/docs/r0.18.0/hdfs_design.pdf

Zeqiang Chen received the B.Sc. degree in geog-raphy from Hauzhong Normal University, China, in2006 and the M.S. and Ph.D. degrees in geographicalinformation systems from Wuhan University, China,in 2008 and 2012, respectively.He was a Research Associate at the Center for

Spatial Information Science and Systems (CSISS),George Mason University, Fairfax, VA. His currentresearch interests include Sensor Web and SmartCity.

Nengcheng Chen received the B.Sc. degree ingeodesy from Wuhan Technical University ofSurveying and Mapping, China, in 1997, the M.S.degree in geographical information systems fromWuhan University in 2000, and the Ph.D. degree inphotogrammetry and remote sensing from WuhanUniversity in 2003.He was a post-doctoral research associate with

the Center for Spatial Information Science andSystems, George Mason University, Greenbelt, MD,from 2006 to 2008. Currently, he is a Professor

of geographic information science of the State Key Lab for InformationEngineering in Surveying, Mapping and Remote Sensing, Wuhan University,Wuhan, Hubei, China. His research interests include Smart Planet, Sensor Web,Semantic Web, Digital Antarctica, Smart City, and Web GIS.Prof. Chen is a member of the International Association of Chinese Profes-

sionals in Geographic Information Sciences (CPGIS). He was the chair of 2010CPGIS Young Scholar Summer Workshop.

Chao Yang received the B.S. and M.S. degreesfrom East China Institute of Technology in 2006and 2009, respectively. Currently, he is pursuingthe Ph.D. degree at the State Key Laboratory forInformation Engineering in Surveying, Mapping andRemote Sensing (LIESMARS) at Wuhan University,China.Since 2010, he has been a Research Associate at

the Center for Spatial Information Science and Sys-tems (CSISS), George Mason University, Fairfax,VA. His research topics include sensor web, satellite

geometry calibration, cloud computing, and SOA technology.

Liping Di received the B.Sc. degree in remotesensing from Zhejiang University, China, in 1982,the M.S. degree in remote sensing/ computer ap-plications from the Chinese Academy of Science,Beijing, in 1985, and the Ph.D. degree in geographyfrom the University of Nebraska-Lincoln in 1991.He was a Research Scientist at the Chinese

Academy of Science from 1985 to 1986 and theNOAA National Geophysical Data Center from1991 to 1994. He served as a Principal Scientist from1994 to 1997, and a Chief Scientist from 1997 to

2000 at Raytheon ITSS. Currently, he is a Professor of geographic informationscience and the Director of the Center for Spatial Information Science andSystems, George Mason University, Fairfax, VA. His research interests includeremote sensing, geographic information science and standards, spatial datainfrastructure, global climate and environment changes, and advanced Earthobservation technology.

Page 14: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012 1

Cloud Computing Enabled Web Processing Servicefor Earth Observation Data ProcessingZeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di, Senior Member, IEEE

Abstract—The OpenGIS Web Processing Service (WPS) canprocess both simple and complex geospatial tasks including EarthObservation tasks. As the requirements of Earth Observationdata, algorithms, calculation models, and daily life become in-creasingly complicated; WPS needs to provide high-performanceservice-oriented computing capability. This paper proposes acloud computing enabled WPS framework for Earth Observationdata processing. It consists of a client layer and a WPS layer,which further consists of a WPS server layer and a cloud com-puting layer. The cloud computing environment is based on theopen-source software Apache Hadoop. The three layers of theproposed cloud computing enabled WPS are outlined, followed bya workflow that processes a user’s task using these three layers.Then technological implementation details are explained. Anexperiment processing Moderate Resolution Imaging Spectro-radiometer (MODIS) data shows that WPS can be enabled in acloud computing environment.

Index Terms—Apache Hadoop, cloud computing, earth observa-tion data processing, service oriented architecture, web processingservice.

I. INTRODUCTION

E ARTH Observation (EO) is most often referring to satel-lite imagery or satellite remote sensing [1], which applied

upon atmosphere, land, and ocean. The obtained data of EarthObservation named Earth Observation Data (EOD) is widelyused in some fields of society lives and scientific researchsuch as climate, weather, agriculture, ecosystems, biodiversity,water, and disasters migration, forecasting or reduction. Forthis reason, in recent decades, EOD and EOD Processing havebeen fairly studied. EOD and many of EOD Processings havethese properties:1) Large volumes of EOD retrieved every day and they arein various data types. There are hundreds of satellites ob-serving the earth every day directed by NASA (National

Manuscript received May 29, 2011; revised September 23, 2011; acceptedMay 17, 2012. This work was supported in part by the National Basic Re-search program of China under Grant 2011CB707101, by the National Na-ture Science Foundation of China program under Grant (41023001, 41171315and 41021061), by the Program for New Century Excellent Talents in Univer-sity under Grant NCET-11-0394, and by the Chinese 863 program under Grant2012AA121401.Z. Chen and C. Yang are with the State Key Lab for Information Engineering

in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University,Wuhan 430079, China.N. Chen is with the State Key Lab for Information Engineering in Surveying,

Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan430079, China (corresponding author, e-mail: [email protected]).L. Di is with the Center for Spatial Information Science and Systems (CSISS),

George Mason University, Fairfax, VA 22032 USA.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JSTARS.2012.2205372

Aeronautics and Space Administration) and NOAA (Na-tional Oceanic and Atmospheric Administration) in USA,except the satellites of Europe, China, Japan and others.The EOD is obtained over petabyte daily. Meanwhile,there are plenty of data types for EOD like Landsat, SPOT,IKONOS, QuickBird, AVHRR, EOS, ALOS, and also lotsof data formats, like ASCII, HDF, PICT, TIFF/GeoTIFF.

2) There are many EOD Processing models and methods.Because of abound of EOD data types, data formats, anddata applications, event if they are the same, many EODProcessing models and methods are studied and used. Forexample, there are several vegetation condition indicesfor crop condition monitoring just based on NormalizedDifference Vegetation Index (NDVI), as Mean VegetationCondition Index (MVCI), NDVI Ratio to Previous Year(RPNDVI), NDVI Ratio to Median (RMNDVI), Vegeta-tion Condition Index (VCI) [2].

3) The requirements of EOD and its processing are increasingand diverse. EOD ismade use of inmany society fields, andwill be used in more and more aspects in society lives. Forthe same EOD, different users may process it for differentpurposes and with different methods.

4) Stand-alone and Web-based EOD Processing servers bothexist in application. A Stand-alone server can only sup-port an EOD Processing within a local environment. Viceversa, a Web-based server can provide a collaborating en-vironment between multiple Web-connected servers. Thecapacity of EOD Processing for a stand-alone server islimited by its physical storage capacity and Central Pro-cessingUnit (CPU) calculating speed – these two notoriousbottlenecks. Web-based EOD Processing can cluster manyservers on Web for distributed and high-performance pro-cessing.

Though having the above properties, with the developmentof requirements, the demands of EOD Processing have becomeincreasingly diversified. Thus EOD Processing maybe confrontthe following difficulties:1) Difficulties emerged for some EOD Processings to beinvoked, operated, accessed and managed via Web en-vironment. For historic reasons, before the evolution ofweb storage technology, some EOD centers only save andmanage EOD in a local environment, and right now usersfind them hard to handle among different servers. Mean-while, with these large-volume data distributed amongseveral locations, users have to encounter a technologicalproblem – how to optimize distributed computing andhigh-performance computing in the same time.

2) Some EOD Processings fail to be easily shared and inter-preted. This is because many EOD Processings are just

1939-1404/$31.00 © 2012 IEEE

Page 15: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

2 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

complied with national or local organizations’ standards,and need more widely recognition.

3) Some EOD Processings are not flexible and extensiveenough. Lot of EOD Processing systems are only suitableto special models or methods, but not flexible and exten-sive enough to process diverse EOD data types and dataformats, EOD Processing models and methods, and therequirements of users.

4) Some processing fails in its performance. Due to the hugesize of data, the calculation and processing of EOD cancreate undesired time delays. The volumes of EOD arehuge, and the calculating performance can not meet timerequirement, especially for real-time or near real-time ap-plication, like emergency rescues and disaster relief work.

In this article, an EOD processing which can overcome all thefour types of difficulties above should be a Web-based, inter-national standard compatible, and flexible/extensive EOD pro-cessing. To obtain such EOD processing, Web service tech-nology, Open Geospatial Consortium (OGC) Web ProcessingService (WPS) specification, and cloud computing technologymay provide us a new way of thinking and realizing.Web service technology is widely used. The W3C defines

Web service as follow: “A Web service is a software system de-signed to support interoperable machine-to-machine interactionover a network. It has an interface described in a machine-pro-cessable format (specifically WSDL). Other systems interactwith the Web service in a manner prescribed by its descrip-tion using SOAP-messages, typically conveyed using HTTPwith an XML serialization in conjunction with other Web-re-lated standards.” [3] A Web service is self-describing, reusable,and highly portable. Its advantages are apparent: it is easily con-structed, rapid, low-cost, secure, and reliable. Web service tech-nology is a useful technology in many aspects of daily life andscience research.The OGC has developed many Web service interface stan-

dards for geospatial information sharing and interoperation. Ithas developed dozens of standards such as Web Map Service(WMS), Web Feature Service (WFS), Web Coverage Service(WCS), and WPS [4]. WPS defines a standardized interfacethat facilitates the publishing of geospatial processes, and thediscovery of and binding to those processes by clients. WMS,WFS, and WCS emphasize data services, while WPS focuseson data/task processing services.Though there is little consensus on the definition of Cloud

computing [5], [6], a good feature of Cloud computing is: itis an Internet-scale computing, service-oriented computingand high-performance computing. Foster et al. [7] comparedGrids and Clouds from a variety of perspectives: architecture,business models, programming models, applications, etc., andreached a conclusion that some advantages of cloud computingare: paying the cost for software by consumption, Internet-scalecomputing, and service-oriented computing. As services, threemain levels of Cloud services are generally agreed upon in anX-as-a-Service manner (X is short for Infrastructure, Platform,or Software) [5], [8]. Meanwhile, Cloud computing is evolvedout of Grid computing and relies on Grid computing as itbackbone and infrastructure [5], so it has a high performancecomputing capabilities, and is also applied in some aspects [9],

[10]. As Cloud computing has above feature and successes inmarket, lots of software frameworks for Cloud environment aredeveloped. The Apache Hadoop is one of these frameworks.The Apache Hadoop is open-source software for reliable,scalable, distributed computing, and its software library is aframework that allows for the distributed processing of largedata sets across clusters of computers using a simple program-ming model [11]. It can be deployed for a Cloud environment[12].The objective of this paper is to propose a cloud computing

enabled WPS framework for Earth Observation data pro-cessing. The main work of this paper is to couple WPS withApache Hadoop, and evaluate the feasibility of implementinga WPS framework in the cloud computing environment byusing Apache Hadoop for EOD Processing. The advance of thisframework for EOD Processing compared with the Processingsmentioned above is it covers good properties (Web-based,international standard compatible, and flexible/extensive) atthe same time to overcome the difficulties illuminated above.Use Hadoop to establish a Cloud environment for it is free,open-source, and widely and successfully applied by hundredsof thousands of institutions and companies [13].The rest of the paper is organized as follows: related work

is overviewed in Section II. Section III introduces WPS andHadoop. Sections IV and V separately presents the frameworkand workflow of cloud computing enabled WPS. In Section VI,experiments are done to show that the proposed framework andworkflow is feasible. Finally, Section VII summarizes the con-clusions and identifies future work.

II. RELATED WORK

EOD Processing is an ongoing object of study in recentdecades. As early as 1993, the European EOD Processing andinterpretation services were analyzed the sector and conditionsfor its development [14]. Later, obtaining EOD, low-level pro-cessing, and EOD center managing were concerned [15]–[17].Automating the processing of EOD was developed in Terres-trial Observation and Prediction System using a planner-basedagent to automatically generate and execute data-flow pro-grams to processing the EOD [18]. The processing should bewritten in a specified DPADL language, and the developershould know it. This may cause confusion for many users.Automatic Earth observation data service based on reusablegeo-processing workflow and real-time EOD processing hadbeen addressed [19], [20], but it less focused the performanceof this system under huge volumes of data. To meet the storageand computational requirements of EOD, some research effortswere placed on grid technologies. Aloisio et al. [21] showedhow grid technologies and high performance computing can beused dynamically and efficiently for on-demand processing. Itwas a grid-enabled, high performance digital library of remotesensing image. Aloisio et al. [22] studied grid computing envi-ronment for EOD Processing and integrated jobs as a workflow.EOforge was a generic open framework for EOD Processingsystems [23]. Granell et al. used distributed geoprocessingservices for managing EOD [24]. Gorgan et al. developed EODProcessing platform for EO application development [25],[26]. Clouding computing for satellite data processing on high

Page 16: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 3

end compute clusters based on Hadoop was evaluated [27].Some of these were not standard-based or not flexible enough.Recently, WPS facilitate EO Processing by providing standardinterfaces [28]–[30].WPS has been developed and utilized [31]–[35]. 52n WPS

[36] is a good WPS framework. It has a pluggable architec-ture for processing and data encoding. A Web process can beadded into 52n WPS using a plug-in method that the processmust implement as an abstract process interface. Now, 52nWPSis mainly deployed on stand-alone servers. It has been furtherevaluated in the grid/cloud computing [34], [37]. Deegree WPS[38] fully supports OpenGIS WPS 1.0.0 specification. It pro-vides a configurable basis to plug in processes. It uses a processprovider layer to integrate popular geo-processing frameworkssuch as Sextante [39], FME [40], and GRASS [41]. It is de-ployed mainly in a stand-alone computing environment. WPSfor EOD Processing is not seen much in literature.These WPS frameworks are utilized mainly in a stand-alone

computing environment. Besides this utility, WPS also needsto be implemented in a distributed computing environment.Actually, much processing of EOD is complex. WPS canprocess both simple and complex EOD, for the processes ofWPS can be any algorithm, calculation, or model that op-erates on spatially referenced data include EOD [4]. Giventhe large amounts of EOD, increasingly complex calculationmodels, and untrustworthy multiple heterogeneous processingresources and tasks, the high-performance implementations ofWPS processing are badly needed. Several high-performancecomputing and technological methods, such as grid computing,distributed computing, and parallel computing [34], [35] havebeen evaluated for WPS implementations, while little researchhas been performed on enabling WPS in a cloud computingenvironment. The Amazon’s Elastic Computing Cloud (EC2)[42], Google’s AppEngine [43] and Microsoft’s Azure Plat-form [44] support cloud computing features at platforms,software packages and services levels. However, those cloudcomputing services are commercial and not completely freeand open source. Instead, the Apache Hadoop core providesan open-source framework for cloud computing, as well as adistributed file system [12]. The goal of the Apache Hadoopproject is to develop open-source software for reliable, scal-able, distributed computing [11]. Hundreds of institutions andcompanies, including such giant software companies, such asIBM, Yahoo, and Facebook are either directly using Hadoopor constructing services on top of Hadoop [13]. We expect thatenabling a free and open-source WPS computing platform onHadoop is very promising.

III. WPS AND APACHE HADOOP

A. Overview of WPS

WPS is one of the OGC implementation specifications. Itprovides standard interfaces to process tasks submitted by aclient on theWeb for function sharing and interoperation. Thereare there mandatory operations in WPS: GetCapabilities, De-scribeProcess, and Execute.1) The GetCapabilities operation allows clients to retrieveservice metadata from a server. It is a common operation in

OGC Web Service (OWS). The request parameters in thisoperation are version, service, and request. The responsedocument contains service identification, service provider,operations metadata, and process offerings.

2) The DescribeProcess operation allows clients to requestthe description of one or more processes that can be ex-ecuted by the Execute operation. A key request parameteris process Identifier, which is the identifier of a process inWPS. The response document shows the input parametersand the output format of the process. Knowing these pa-rameters, the user knows how to request and execute a task.

3) The Execute operation allows clients to run a specifiedprocess using the input parameters described by the De-scribeProcess response document and obtain the output re-sults. The Execute operation is the core part of WPS.

The general steps of handling a task combining these oper-ations are as follows. First, use GetCapabilities to obtain themetadata of a WPS and the processes with which the WPScan deal. Then choose one of these processes and request De-scribeProcess operation to see what its inputs and outputs. Fi-nally execute the Execute operation according to the inputs re-quired.

B. Overview of Apache Hadoop

Apache Hadoop contains subprojects: Hadoop Common,Hadoop Distributed File System (HDFS), and MapReduce.Hadoop Common is a set of utilities that support the Hadoop

subprojects, including FileSystem, Remote Procedure Call(RPC), and serialization libraries.HDFS is a distributed file system in a mass computer server

cluster. It provides high throughput access to application dataeven in a bargain-priced computer cluster. In HDFS, there aretwo types of node (a node is a Java Virtual Machine or a com-puter) for managing and storing data called NameNode andDataNode respectively. NameNode serves as both a directorynamespace manager and “inode table” for HDFS. Generallyspeaking, there is either a single NameNode running or thereis also a secondary backup/failsafe NameNode in an HDFS de-ployment. DataNode is the actual place where data is stored ina set of blocks. A data block has a configured size with a de-fault of 64 megabytes. A data set larger than that size will bepartitioned into blocks. In an HDFS deployment, there are al-ways thousands of DataNodes. The organization of NameNodeand DataNodes is a master-slave structure. When working, theclient uses ClientProtocol to interact with NameNode for HDFSservice. NameNode implements the DatanodeProtocol interfaceand DataNode uses DatanodeProtocol to communicate with Na-meNode. DataNodes spend their lives in an endless loop ofreporting their conditions or asking the NameNode for some-thing to do with what is called the heartbeat method. The heart-beat method is the remote method called by the DataNodes thatperiodically lets the NameNode know that DataNode is stillalive. DataNodes can delete blocks or copy blocks to/from otherDataNodes.Hadoop MapReduce is a programming model and software

framework for writing applications that rapidly process vastamounts of data in parallel on large clusters of computing nodes.

Page 17: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

4 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

Fig. 1. Hadoop MapReduce workflow.

Dean and Ghemawat [45] first introduced MapReduce. Its prin-ciple is that when it processes a task there are two core steps.One is the map phase, in which a key/value pair is processed togenerate a set of intermediate key/value pairs, and the other isreduce phase, in which all intermediate values associated withthe same intermediate key are processed. The purpose of themap phase is to divide a task into several independent subtasksto run in parallel, and the reduce phase collects all the subtaskresults to obtain the result of the whole task. In Hadoop,MapRe-duce is running on HDFS. It has two important Trackers namedJobTracker and TaskTracker to process and manage a task co-operatively. JobTracker is responsible for startup, tracking, andinvoking the tasks submitted by the client. TaskTracker is re-sponsible for managing local data, processing data, collectingthe results, reporting status, and the processing of a task to Job-Tracker. TaskTracker is deployed on DataNode, but JobTrackercan be deployed on NameNode or a different server. Like Na-meNode and DataNodes, there are always a JobTracker andmany TaskTrackers. Fig. 1 shows howHadoopMapReduce pro-cesses a job.When JobTracker receives a job (also a user may call it a

task), it will dispatch the job into several TaskTrackers. Task-Tracker splits the needed data into pieces. The Map functionreceives a key/value pair to generate a set of intermediate key/value pairs. The intermediate key/value pairs with the same keyare combined and partitioned them according the number ofreduce tasks/output files. Shuffle fetches the relevant partitionof the output of all the Mappers as the inputs of reduce, andthen Sort separates the inputs by keys, since different Mappersmay have the same key as output. After Shuffle and Sort, all thekey/value pairs are different, having form. Fi-nally, all the pairs are reduced and the result output.Hadoop MapReduce is a complex programming framework,

but it is very simple to write applicable MapReduce program-ming for a user. The steps (MAPREDUCESTEPS) in the pro-gram are as follows:MapReduce main steps{Step 1: configure the parameters.Step 2: create a job and set a job name.Step 3: set jar by class name to copy all the programming

code the jobs needs and dispatch it to other Task-Trackers.

Step 4: set map, combine, partition, and reduce class.

Step 5: set the input/output key/value class of the map andreduce functions.

Step 6: set the input and output paths of the job.Step 7: wait for the completion of the job.}In an application, the user need only develop the classes

shown in step 3 to step 6.

C. Performance of Hadoop

The performance of Hadoop has been tested and reported. In2008, “One of Yahoo’s Hadoop clusters sorted 1 terabyte of datain 209 seconds, which beat the previous record of 297 seconds inthe annual general purpose (Daytona) terabyte sort benchmark.The sort benchmark, which was created in 1998 by Jim Gray,specifies the input data (10 billion 100 byte records), whichmustbe completely sorted andwritten to disk” [46]. The cluster statis-tics were 910 nodes, 2 quad core Xeons @ 2.0 GHz per a node,4 SATA disks per node, 8G RAM per node, Red Hat EnterpriseLinux Server, and Sun Java JDK 1.6. There were 1800 mapsand 1800 reduces in this sort. In 1998, there were over 13,000Hadoop nodes in Yahoo, but now the number is over 40,000.Hundreds of thousands of jobs now run in a month.Some groups [47]–[49] have tested/analyzed/evaluated/sim-

ulated the performance of Hadoop MapReduce or proposedtheir methods. These groups tested the effect on the perfor-mance of MapReduce when the parameters changed. Goodparameter tuning improves performance. HDFS is highlyfault-tolerant, designed to be deployed on low-cost hardware,with high throughput access to large data sets [50].In fact, the PoweredBy [13] website lists hundreds of users

of Hadoop, with several to thousands of nodes and data fromKB to TB size. All those applications are a good affirmation ofHadoop.The applications show that Hadoop is suitable for high-per-

formance computing; it can run in a cluster with thousands ofservers, and process vast amounts of data.

IV. DESIGN

As mentioned in Section I, a smart EO Processing system isneeded with four obvious properties: Web-based, internationalstandard compatible, flexible/extensive, and high performance

Page 18: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 5

Fig. 2. Architecture of cloud computing enabled WPS.

computing. To achieve this system,Web service and Cloud com-puting technologies as well as WPS specification are integrated.There are at least two things to be considered in the architec-ture of this integrated system: the first one is it should be a Ser-vice Oriented Architecture (SOA) to make the components areloosing couple and machine and platform independent to allowworldwide use on the Internet. The other one is changing a bitthe internal structure of WPS. In a stand-alone server, WPS re-ceives request and processes a task in the same logic layer. Butin this architecture, the WPS is spit into two distributed parts.One is WPS request and response server as a WPS server layer;and the other is Cloud environment layer which is response forprocess task. The key point of this seamlessly is coupling a flex-ible WPS with Hadoop for EO Processing.Fig. 2 shows the system architecture of a WPS platform in

a cloud computing environment, which consists of two layers:Client and WPS. The WPS layer is a layer providing WebProcessing Service. This layer can be further divided into twolayers: the WPS server and Cloud Computing.The client layer is the application layer, where users dis-

tributed all over the Web use WPS.The WPS server layer is deployed on only one WPS server

or a WPS server cluster. The WPS server in this layer is a Webservice server that receives a WPS standard request and returnsits standard response, but does not do any processing. OneWPSserver can provide one kind of EOD Processing service or sev-eral; meanwhile, one type of EOD Processing service can bedeployed in a WPS server or even several in the whole cluster.The Cloud Computing layer is the cloud computing envi-

ronment for executing all EOD Processing tasks submitted toWPS server(s). A client submits a task to a WPS server, theserver only verifies all the input parameters of the EOD Pro-cessing are correct, and then it submits the task to the cloudcomputing layer to process. AWPS provider can use a differentcloud computing environment, such as Hadoop, Amazon’s EC2,and Google cloud. In this paper, Hadoop is considered to be thecloud computing environment, for the fact that Hadoop is anopen source and yet good-performance framework successfullyused by many as being deployed on low-cost computers andrequiring little for hardware, and MapReduce in Hadoop is agood parallel programming model suitable for WPS. ChoosingHadoop as WPS processing environment never means no otherchoice. Appling WPS on more high-performance computing/Cloud environment is our target. It has two reasons: the first

one is different environment may have especial advantages indifferent application, and the other one is WPS needing devel-oped and studied on more application framework.In this architecture, one of the critical issues is how to make

WPS flexible enough for diverse EO Processing. The other ishow to integrate WPS with Hadoop. For the former, WPS needssupport different kinds of EO processing algorithms. For thelatter, WPS invokes Hadoop likes running itself. The parameterinput, programmanagement, status control, and result obtainingof Hadoop are able to do by WPS. In order to try these twoproblems, a rewrite method and WPS interaction with Hadoopare introduced, and see them in Section V.

V. IMPLEMENTATION

In the architecture of cloud computing enabled WPS, theClient requests WPS; the WPS server is responsible for col-lecting the requests/tasks and then sending the job to a cloudcomputing layer; the cloud computing layer runs the job. Thekey implementation of this architecture is the client interactingwith the WPS server and the WPS server interacting with thecloud computing layer. These two interactions are reflected inthe concrete implementation of cloud computing enabled WPS,as shown in Fig. 3.Fig. 3 shows aWPS using aWPS server and a Hadoop cluster.

The WPS server implements three WPS interface operationsGetCapabilities, DescribeProcess, and Execute; also, it providesGetResult and GetStatus operations for users to obtain the re-sults of a task and its status. All those operations are exposed tousers by the Web Services Description Language (WSDL). TheHadoop cluster is deployed for cloud computing. NameNode,DataNode, JobTracker, and TaskTracker are also deployed.In order to explain the concrete implementation of each detail

above, assume a user wants to use WPS to process a task calledTASK_A. Several steps as shown in Fig. 3 occur in the systemfor TASK_A.Steps 1 and 2 are request/response. A user requests theGetCa-

pabilities operation to obtain the metadata that this WPS servercan provide. From this GetCapabilities response document, theuser can find the process service ID. Then the user uses this ID toinvoke the DescribeProcess operation to find the input parame-ters and the output result format. Generally speaking, theGetCa-pabilities operation and the DescribeProcess operation are usedfor the Execute operation. Knowing what the input and outputof a task, a user can invoke the Execute operation to run the task.

Page 19: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

6 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

In these operations,WSDL is used to explore these operationsas Web service methods. WSDL provides a model and XMLformat for describing services, and it allows separation of thedescription of the abstract function offered by a service fromdescription of its details. In general, aWSDL instance is dividedinto seven parts: Types,Message, Operation, portType, Binding,Port, and Service. Fig. 4 shows the uniform WSDL of WPS.In this WSDL instance, there are three kinds of portTypes

(HTTP GET, POST, and SOAP) and their corresponding bind-ings, several operations, and a service. All the input and outputof the operations are pre-defined as a message that is a compo-nent of WSDL. Users can invoke those operations by parsingWSDL.Steps 3, 4 and 5 execute a task using the Execute operation.

The Execute operation has two responsibilities: one is to submitinput data to Hadoop through a data submitter, and the otheris to submit a MapReduce task using a MapReduce submitter.The first step is to ensure that all the EOD are in HDFS, theonly place that Hadoop can access. The data submitter is like aHadoop HDFS client, and uses the Application ProgrammingInterface (API) of the Hadoop file system to interact withHadoop HDFS. A Java class named DistributedFileSystemis used to tell the NameNode when data will be written toHadoop HDFS. Then, NameNode records the written EODinformation. A Java class named DFSOutputStream is returnedby DistributedFileSystem. The data submitter uses DFSOut-putStream to write the EOD to Hadoop HDFS. If the EOD isactually in Hadoop HDFS, there is no need to invoke the datasubmitter. The implementation method of the data submitter isthe dataSubmit method. After all the data has been prepared,the MapReduce submitter submits a task to JobTracker to runthe task. The programming for writing and submitting a job toJobTracker was introduced in Section II.To help users develop a WPS process conveniently and

easily, this paper introduces a rewrite method. A rewritemethod rewrites some of the defined methods and classes for atask. Fig. 5 shows rewrite methods used in some componentsin processing a task for a flexible WPS.In Fig. 5, the Execute operation request includes the process

ID, input, and output parameters. Using the process ID, a spe-cial Parser can be chosen for the process. This Parser is usedfor parsing process ID, input, and output parameters. Then theMapReduce submitter executes a task. Finally, Generator en-capsulates the response. A process ID is associated with a typeof task, a Parser, and a Generator. The relationships can be con-figured in a configurable file, which is created at initiation. TheParser and Generator must be rewritten in accordance with thetype of task. The MapReducer submitter is a conceptual ag-gregate of rewrite methods and classes needed for submittinga task to Hadoop. The methods are dataSubmit, setParameters,MapReduce, and run; the classes are MapReduce classes. Themeaning and operation of these methods and classes are:1) The dataSubmit method is a rewrite when EOD must becopied to HDFS, and is the data submitter implemented.

2) The setParameters method is a rewrite for setting the inputand output parameters or other parameters to the configu-ration of Hadoop. The purpose of the rewrite method is todeliver input and output parameters to MapReduce classes

and methods. The setParameters method has fixed inputparameters which are Hadoop configuration interface andinput/output parameters from client request.

3) The MapReduce classes are those needed in MAPRE-DUCESTEPS for running a MapReduce program. Someparameters used in these classes may set by the setParam-eters method.

4) The MapReduce methods are map and reduce methods.Rewriting should bemandatory. They are the core methodsof the MapReduce process, and both extend the MapRe-duceBase class and separately implement the Mapper in-terface and Reducer interfaces. They are also needed inMAPREDUCESTEPS.

5) A run method is the method used to run a MapReduceprocess. The input parameters of this method are theclient’s input parameters, and output parameters arethe client’s output parameters. In this method, the set-Parameters method, the dataSubmit methods, and theMAPREDUCESTEPS are invoked in sequence.

Step 6 is the JobTracker, which dispatches the task to Task-Trackers to run. The relation between JobTracker and Task-Trackers was explained in Section II.Steps 7, 8, and 9 are obtaining the status information of the

task. Because WPS processing tasks usually take a long time,an asynchronous mechanism is suitable for avoiding long-termwaits and reducing network congestion. General speaking, thereare two asynchronous communication mechanisms: pull andpush. In the pull mechanism, the client sends a request and thenperiodically checks whether a task is completed. This mecha-nism may have a high processing overhead on the server. Thepush mechanism is to notify the client when the result is ready.The pushmechanism causes less traffic on theWeb. Pull is spec-ified inWPS 1.0 while both push and pull are taken advantage inWPS 2.0 which is in discussion. When the user submits an Exe-cute operation, an immediate response is returned to the client.There is a status checking operation called GetStatus and thetask ID available to the user. The user can check the task statusat any time before it completes. When the task is finished, theresult URL is shown to the user in the status response. The Get-Status operation is a wrapper, which can invoke a task statuswith the task ID.Results are obtained and read in steps 10 to 13. The GetRe-

sult operation will be invoked once the task has been finished– it works as a client interface for HDFS with HDFS program-ming API being called to read data. The Java Class Distributed-FileSystem handles the interaction with NameNode in returningthe results of a DataNode in an FSDataInoutStream object. Theread method of FSDataInputStream can be used to read all theresult data. Finally, WPS returns the result to the client.

VI. EXPERIMENT

To test the feasibility of cloud computing enabled WPS forEOD Processing, an experiment has been done. Seemingly, thewhole performance of the framework is determined by the per-formance of Hadoop, since the WPS server is only a mediathat receives task assignments from client and forwards themto Hadoop, and the real processing only takes place in Hadoop.The WPS server receives an XML document via the SOAP/

Page 20: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 7

Fig. 3. Implementation of cloud computing enabled WPS.

HTTP protocol and parses it. This process requires at most afew seconds. For a very complex task, such as processing EODat a high volume, it may take a long time to finish on ApacheHadoop which sometimes mean up to several days. The perfor-mance of Hadoop has been tested, and the results are as stated inSection III. The key point with which this experiment was con-cerned was whether both the client interaction with the WPSserver and the WPS server interaction with Hadoop (hereinaftercalled the two interactions) were smooth without any problems.To demonstrate the interactions, an experimental scenario calledNDVIWPS which was a WPS project to calculate NormalizedDifference Vegetation Index (NDVI) using MODIS data wascarried out.

A. Experimental Background

NDVIWPS is the initial preparation work on a project namedVegetation Condition Project (VCP) sponsored by the NationalAgriculture Statistics Service (NASS) of the United States De-partment of Agriculture (USDA). The intent of VCP is to de-velop a web-based, service-oriented, automated system to over-come the shortcomings of some existing systems. The VCP isused to evaluate, monitor, and forecast the vegetation conditionof the 48 continental states of America. The vegetation condi-tion is evaluated by some scientific statistics indexes such asthe Normalized Difference Vegetation Index (NDVI), and theVegetation Condition Index (VCI). The calculated equations forNDVI and VCI are separate as (1) and (2), respectively.

(1)

(2)

In (1), NIR is the near-infrared band data and IR is the infraredband data of a pixel. In (2), NDVImin is the minimumNDVI andNDVImax is the maximum NDVI for a pixel during a specifictime. The present source of VCP data is the Moderate Resolu-tion Imaging Spectroradiometer (MODIS). At present, the VCPdata is sourced from MODIS, namely the “MODIS/Terra Sur-face Reflectance Daily L2G Global 250m SIN Grid v005”. Thisdataset provides daily coverage of 48 continental states in 25HDF data files which is about 2 gigabytes. So to speak, all theneededMODIS data from year 2000 to 2011 will be added up to100,000 HDF files with their sizes over than 8 terabytes. Givenso many files and such a large size, providing a web service thatis to compute with high performance and publish results effec-tively, is indeed a challenge. These problems have motivatedtests of cloud computing enabled WPS.

B. Experimental Instance

A WPS server and Apache Hadoop version 0.21.0 aredeployed in a single-node computer and run in a pseudo-dis-tributed mode where each Hadoop daemon runs in a separateJava process. The WPS server is based on 52N WPS, butwith some improvements: 1) adding the actual GetResult andGetStatus SOAP/HTTP Web service interfaces; 2) addingthe Hadoop processing interface and improving the rewritemethod. In this instance, calculating daily NDVI from 25 HDFfiles in 7 days (from 05/04/2010 to 05/10/2010), calculatingweekly NDVI from 7 days (from 05/04/2010 to 05/10/2010),and transforming these daily and weekly NDVI from the SINcoordinate system to Latitude/Longitude coordinate systemand Albers coordinate system, are tested. The Two Interactionsof these three tests are detailed in Section VI.

Page 21: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

8 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

Fig. 4. The uniform WSDL of WPS.

Fig. 5. Rewrite method for flexible WPS.

1) Client Interaction With the WPS Server: The first step ofclient interaction with the WPS server is for a user to retrievethe WSDL to know the operations, bindings, and services,depicted in Fig. 4. Then the getCapabilities operation is re-quested to obtain the identifiers of processes, for example“edu.gmu.csiss.wps.DailyNDVI”, “edu.gmu.csiss.wps.Week-lyNDVI”, and “edu.gmu.csiss.wps.CoordinateConvert”. Third,using these identifiers request describeProcess operation to ob-tain the input and output parameters of their execute operationseparately. Here, the input parameters will be the input filespaths, while the output parameters will be the output paths ofresult files.

After knowing the inputs and outputs of process, Execute op-eration can be now carried out. An asynchronous mechanism isused for this operation. An instantaneous response is returnedto the client; an example response can be seen in Fig. 6 whichis the result from calculating daily NDVI. The statusLocationpoints to the URL where the task status can be retrieved. Fig. 7shows the process.After submitting a task to WPS, the user may invoke Get-

Status filling the information for task ID of this task to knowthe status of the specific job. When a job is finished, the usermay request for the result. The result will be read from HDFSas described in next section.

Page 22: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 9

Fig. 6. Instantaneous response.

2) WPS Server Interacting With Hadoop: Invoking an Ex-ecute operation will submit a MapReduce job, before the sub-mitting either the input data has already resided in HDFS or hasbeen copied to HDFS. When a file is uploaded to HDFS, thecreate method of DistributedFileSystem is invoked. The param-eters of create method are the path of the file, file system per-mission, data buffer size, number of replication, data block size,and progress. The path is the input path, the number of repli-cations may be set to 1, and all the other parameters have thedefault values. The return parameter is FSDataOutputStream.The create method of FSDataOutputStream is then invoked.The data will be automatically copied to HDFS. Reading datafrom HFDS is the reverse process. First, a LocatedBlocks Ob-ject is obtained fromNameNode by invoking RPC, and then FS-DataInputStreamObject created. The read method of FSDataIn-putStream is used to read data.The MapReduce of the three experiments follows the method

mentioned in Section III. Figs. 8–12 show their MapReduceprocess.Fig. 8 is the daily NDVI MapReduce process. The input path

is “daily” and output path is “dailyndvi”; both these paths arerelative to the HDFS working directory. All HDF files in theinput path are split into format. The key is the nameof the day in string form, and the value is the HDF file name.The date of a HDF file is implied by its name; for example,

“MOD09GQ.A2010124.H06V03.005.20100126173551.hdf”

is included in 2010.05.04, for “124” after “A2010” is the Julianday, and “124” means the 124th day of 2010. The 124th dayof 2010 is 05/04/2010. Map process calculates the NDVI ofevery DHF file with its NIR and IR bands. Prefix “ndvi_” isattached the NDVI file name. After map process, the NDVIfiles with the same key are collected in a list for the reduceprocess. In Fig. 8, the seven (from 05/04/2010 to 05/10/2010)days are distinguished and listed together. The reduce processcalculates every day’s NDVI. Finally, all reduce results aresent to the output path (dailyndvi). Calculating the weeklyNDVI has a similar MapReduce process shown in Fig. 9. In theweekly NDVI MapReduce process, “2010.05.04_2010.05.10”represents a week. The map process in Fig. 9 directly addsevery NDVI file to the week list to where it belongs.Figs. 10 and 11 show the coordinate transformation of daily

and weekly NDVI. They only convert their coordinate systemfrom that of HDF format to that of GeoTIFF format, and that iswhy only the map process needs to perform the transformationtask while the reduce process does not get to use it.

In Fig. 12, from left to right, each part of the image representsone process of the three MapReduces. In the first part, all NDVIof the original HDF files are calculated; and then daily NDVIare combined in the second part; in the third part, weekly NDVIis obtained from 7 days’ NDVI, where each pixel in the weeklyNDVI is the maximum value of that pixel in the 7 days. Thefourth part is to obtain the image with the Albers coordinatesystem, converted from SIN coordinate system.

VII. DISCUSSION AND CONCLUSION

This paper presents the feasibility of enabling WPS withina cloud-computing environment for EOD Processing using theApache Hadoop Platform. The system created uses uniformWSDL to describe the WPS operation, employs a process inter-face to interact with Hadoop, and implements an asynchronouscomputing scenario. A WPS service which is capable of calcu-lating NDVI from multiple files and also can converting fromthe SIN coordinate system into the Albers coordinate systemis implemented in this study. The experiment results show thatWPS services can be developed on Apache Hadoop. Some ofthe significant advantages of this approach are as follows:1) Design a standard, flexible, and high-performance websystem for EOD Processing. It follows OGCWPS specifi-cation; it is standard. It uses rewrite method for integratingdifferent EOD Processing including different data type,data formats, calculating model, and calculating method.Different user can apply different EOD Processing fordifferent application. Thus, it is flexible and extensive.Apache Hadoop framework is a high-performance frame-work, for this reason, it has a potential high-performancecapability, though the performance is not tested in theexperiment. This will be a future work.

2) Introduce a multiple layered method Coupling WPS withHadoop. Cloud computing enabled WPS can be imple-mented in a multiple layered method, as WPS is dividedinto to layers: the standard service interface layer (WPSserver layer) which interacts with the client for parsingstandard requests and returning standard responses, and thelogic layer (cloud computing layer) which is for logicalcomputing. The cloud computing application is always de-ploying or copying programming to the computing nodes.While cloud computing enabled WPS does not mean thatthe whole WPS must be deployed or copied to each com-puting node, only the computing program logic. A mul-tiply layered method facilitates WPS deployment, configu-ration, andmanagement. In this paper, just aWPS node and

Page 23: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

10 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

Fig. 7. Status information.

a Hadoop cloud computing environment are used. WPSneed only provide an interface to interact with Hadoop,it does not need to ruin anything in Hadoop. The inter-face rewrite method flexible enough for tasks to invokeHadoop is expained in Section V. The three experimentsprocessing different tasks in Section VI prove this methodcan work. Furthermore, a multiple layered method benefitsfor Hadoop applications in Web service. Hadoop is mainlyas an application program, and can not be used directly as aWeb service. WPS in this paper is a good example of Webservice application of Hadoop

3) Realize a flexibly method for WPS submitting a MapRe-duce task. The key process of cloud computing enabledWPS is the MapReduce process. In a task submitted toWPS, a critical step is rewriting the Map and Reduceprocesses. As mentioned in Section VI, the MapReduceprocesses of the three experiments are not completelythe same. Different tasks have different MapReduce pro-cesses. Some processing tasks need only the map processor reduce process. For example, for the coordinate trans-form, only the map process is needed. MapReduce is alsokey part for the computing performance. There are dozensof parameters in Hadoop configuration for MapReduce.How the parameters affect the performance of WPS iscomplex and is not addressed in this paper. This questionis for the future work.

The experiment shows the feasibility of WPS coupling withHadoop. This WPS framework can be broader used for EODprocessing and other applications. One of the advantages ofMapReduce model is parallel computing. In an ideal situation,the EOD processing can be partitioned by parallel executingsteps is apt to MapReduce model. The experimental instanceis an example. But in fact, an EOD processing only be exe-cuted by sequence steps also can use MapReduce model. Thedifference of these two is the former may show a good perfor-mance. In reality, many EOD processings are including hugevolumes of data and some sub-process steps which are apt toparallel computing. Challenges are how to design the parallelcomputing model and how to set the correct parameters.

Future work will focus on: (1) configuring the optimal param-eters of Hadoop for different types of EOD Processings. Onepurpose is validating the performance of Hadoop, MapReduce,and the HDFS to provide a robust and high-performance en-vironment for WPS processing, especially for some complextasks. The other is at this point evaluating and comprising ofadvantages and disadvantages of Hadoop with other possibili-ties, such as classical batch processing, and serial computationfifth etc. (2) extending the use of WPS for EOD Processing andother applications, and finding out its advantages and disadvan-tages.

REFERENCES[1] P. Dana, P. Silviu, N. Marian, F. Marc, and Z. Daniela, “Earth observa-

tion data processing in distributed systems,” Informatica, vol. 34, no.4, pp. 463–476, Oct. 2010.

[2] Z. Yang, L. Di, G. Yu, and Z. Chen, “Vegetation condition indices forcrop vegetation condition monitoring,” in Proc. 2011 IEEE Int. Geo-science and Remote Sensing Symp. (IGARSS), Vancouver, BC, Canada,Jul. 24–29, 2011.

[3] The Web Services Glossary W3C, 2004 [Online]. Available: http://www.w3.org/TR/ws-gloss/

[4] OGC Web Processing Service Specification (Version 1.0.0), , 2007,OGC Standard, 87.

[5] K. Stanoevska-Slabeva, T. Wozniak, and S. Ristol, Grid and CloudComputing A Business Prespective on Technology and Applications.Heidelberg, Dordrecht, London, New York: Springer, 2010.

[6] J. Geelan, “Twenty one experts define cloud computing,” Virtualiza-tion, Aug. 2008 [Online]. Available: http://virtualization.sys-con.com/node/612375

[7] I. Foster, Y. Zhao, I. Raicu, and S. Lu, “Cloud computing and gridcomputing 360-degree compared,” in Proc. IEEE Grid Computing En-vironments Workshop, Austin, TX, Nov. 12–16, 2008.

[8] The NIST Definition of Cloud Computing (Draft), Recommendationsof the National Institute of Standards and Technology, 2011, vol. 7.

[9] J. Ekanayake and G. Fox, “Cloud computing,” Lecture Notes of theInstitute for Computer Sciences, Social Informatics and Telecommuni-cations Engineering, vol. 34, no. 1, pp. 20–38, 2010.

[10] C. Vecchiola, S. Pandey, and R. Buyya, “High-performance cloud com-puting: A view of scientific applications,” in 10th Int. Symp. PervasiveSystems, Algorithms, and Networks (ISPAN), Kaohsiung, Taiwan, Dec.14–16, 2009.

[11] Hadoop Website, Apache, 2011 [Online]. Available: http://hadoop.apache.org/

[12] J. Venner, Pro Hadoop. New York: Springer-Verlag, 2009, vol. 442.[13] Hadoop Wiki Website, Apache, 2011 [Online]. Available: http://wiki.

apache.org/hadoop/PoweredBy#F

Page 24: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 11

Fig. 8. Daily NDVI MapReduce.

Fig. 9. Weekly NDVI MapReduce.

Fig. 10. Weekly NDVI coordinate transform.

[14] A. Bounfour and E. F. Lambin, “The European Earth observation dataprocessing and interpretation services: Analysis of the sector and con-ditions for its development,” Int. J. Remote Sens., vol. 14, no. 4, pp.635–654, Feb. 1993.

[15] W. Cudlip, D. R. Mantripp, C. L. Wrench, H. D. Griffiths, D. V.Sheehan, M. Lester, R. P. Leigh, and T. R. Robinson, “Corrections foraltimeter low-level processing at the Earth Observation Data Centre,”Int. J. Remote Sens., vol. 15, no. 4, pp. 889–914, Feb. 1994.

Page 25: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

12 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 6, DECEMBER 2012

Fig. 11. Daily NDVI coordinate transform.

Fig. 12. Images during the MapReduces.

[16] M. S. Hutchins, H. K. Wilson, D. Bass, and P. Waggett, “Overviewof the development of the earth observation data centre and its pro-cessing and archiving facilities,” Int. J. Remote Sens., vol. 15, no. 4,pp. 741–759, Feb. 1994.

[17] R. L. Weaver and V. J. Troisi, “Remote sensing data availabilityfrom the Earth Observation System (EOS) via the Distributed ActiveArchive Center (DAAC) at NSIDC,” in Proc. 1996 Int. Geoscienceand Remote Sensing Symp. (IGARSS’96), New York, May 27–31,1996.

[18] K. Golden, W. Pang, R. Nemani, and P. Votava, “Automating the pro-cessing of earth observation data,” in 7th Int. Symp. Artificial Intelli-gence, Robotics and Automation for Space, Nara, Japan, May 19–23,2003.

[19] N. Chen, L. Di, J. Gong, and G. Yu, “Automatic on-demand data feedservice for autochem based on reusable geo-processing workflow,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. (JSTARS), vol.3, no. 4, pp. 418–426, Dec. 2010.

[20] N. Chen, Z. Chen, L. Di, and J. Gong, “An efficient method for near-real-time on-demand retrieval of remote sensing observation,” IEEE J.Sel. Topics Appl. Earth Observ. Remote Sens. (JSTARS), vol. 4, no. 3,pp. 615–625, Sep. 2011.

[21] G. Aloisio and M. Cafaro, “A dynamic Earth observation system,” Par-allel Computing, vol. 29, no. 10, pp. 1357–1362, Oct. 2003.

[22] G. Aloisio, M. Cafaro, G. Carteni, I. Epicoco, G. Quarta, and S. Raolil,“GridFlow for earth observation data processing,” in Proc. 2005 Int.Conf. Grid Computing and Applications, Las Vegas, NV, Jun. 20–23,2005.

[23] C. Gomez, L. M. Gonzalez, and J. Prieto, “EOforge: Generic openframework for earth observation data processing systems,” Emergingand Future Technologies for Space Based Operations Support to NATOMilitary Operations, pp. P5-1–P5-10, 2006.

[24] C. Granell, L. Diaz, and M. Gould, “Managing Earth Observation datawith distributed geoprocessing services,” in Proc. 2007 IEEE Int. Geo-science and Remote Sensing Symp. (IGARSS’07), Piscataway, NJ, Jul.23–28, 2007.

[25] D. Gorgan, V. Bacu, T. Stefanut, D. Rodila, and D.Mihon, “Grid basedsatellite image processing platform for Earth observation applicationdevelopment,” in 2009 IEEE Int. Workshop on Intelligent Data Acqui-sition and Advanced Computing Systems: Technology and Applications(IDAACS 2009), Piscataway, NJ, Sep. 21–23, 2009.

[26] D. Gorgan, V. Bacu, T. Stefanut, D. Rodila, and D. Mihon, “Earth Ob-servation application development based on the Grid oriented ESIPsatellite image processing platform,” Computer Standards and Inter-faces, doi 10.1016/j.csi.2011.02.002.

[27] N. Golpayegani and M. Halem, “Cloud computing for satellite dataprocessing on high end compute clusters,” in 2009 IEEE Int. Conf.Cloud Computing, Bangalore, India, Sep. 21–25, 2009.

[28] C. Granell, L. Diaz, and M. Gould, “Managing Earth observationdata with distributed geoprocessing services,” in Proc. IEEE Int.Geoscience and Remote Sensing Symp. (IGARSS’07), Barcelona,Spain, Jul. 23–27, 2007.

[29] R. Gerlach, C. Schmullius, and S. Nativi, “Establishing a Web Pro-cessing Service for online analysis of Earth observation time seriesdata,”Geophysical Research Abstracts, vol. 10, p. EGU2008-A-09593,2008.

[30] S. Falke, E. Dorner, B. Dunn, and D. Sullivan, “Processing servicesin earth observation sensor web information architectures,” in Proc.Earth Science Technology Conf. 2008, NASA, College Park, MD, Jun.24–26, 2008.

[31] T. Foerster and J. E. Stoter, “Establishing an OGC web processing ser-vice for generalization processes,” in ICA Workshop of the Commis-sion on Map Generalization and Multiple Representation, Portland,OR, Jun. 25, 2006.

[32] Deegree Website, Deegree, 2011 [Online]. Available: http://www.dee-gree.org/

[33] B. Schaeffer, “Towards a Transactional Web Processing Service (WPS-T),” in GI-Days, Münster, Germany, Jun. 16–18, 2008.

[34] B. Baranski, “Grid computing enabled Web Processing Service,” inGI-Days, Münster, Germany, Jun. 16–18, 2008.

Page 26: Zeqiang Chen, Nengcheng Chen, Chao Yang, and Liping Di ...swe.whu.edu.cn/cnc_web/paper/19.pdfproposed cloud computing enabled WPS are outlined, followed by aworkflow that processes

IEEE

Pro

of

Prin

t Ver

sion

CHEN et al.: CLOUD COMPUTING ENABLED WEB PROCESSING SERVICE FOR EARTH OBSERVATION DATA PROCESSING 13

[35] S. Lanig, A. Schilling, B. Stollberg, and A. Zipf, “Towards stan-dards-based processing of digital elevation models for grid computingthrough web processing service (WPS),” in ICCSA 2008, Perugia,Italy, Jun.-Jul. 30–3, 2008.

[36] 52 North WPS Website, 52 North, 2011 [Online]. Available: http://52north.org/communities/geoprocessing/wps/index.html

[37] B. Schaeffer, “Behind the buzz of cloud computing – 52North OpenSource Geoprocessing Software in the Google Cloud,” in Abstract,FOSS4G, 2009 Free and Open Source Software for Geospatial Conf.,Sydney, Australia, Oct. 20–23, 2009.

[38] Deegree WPSWebsite, Deegree, 2011 [Online]. Available: http://wiki.deegree.org/deegreeWiki/deegree3/ProcessingService

[39] Sextante Website, Sextante, 2011 [Online]. Available: http://www.osor.eu/studies/sextante-a-geographic-information-system-for-the-spanish-region-of-extremadura

[40] FME Website, FME, 2011 [Online]. Available: http://www.safe.com/[41] GRASS Website, GRASS, 2004 [Online]. Available: http://grass.

fbk.eu/[42] Amazon Website, Amazon, 2011 [Online]. Available: http://aws.

amazon.com/ec2[43] Google Website, Google, 2011 [Online]. Available: http://code.google.

com/appengine[44] Microsoft Website, Microsoft, 2011 [Online]. Available: http:// www.

microsoft.com/windowsazure[45] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on

large clusters,” in Proc. 6th Conf. Operating Systems Design & Imple-mentation, San Francisco, CA, Dec. 6–8, 2004.

[46] Yahoo Website, Yahoo, 2008 [Online]. Available: http://developer.yahoo.com/blogs/hadoop/posts/2008/07/apache_hadoop_wins_ter-abyte_sort_benchmark/

[47] G. Wang, A. R. Butt, P. Pandey, and K. Gupta, “Using realistic simu-lation for performance analysis of mapreduce setups,” in Proc. LSAP,Munich, Germany, Jun. 11–13, 2009.

[48] G. Wang, A. R. Butt, P. Pandey, and K. Gupta, “A simulation ap-proach to evaluating design decisions in MapReduce setup,” in Int.Symp. Modelling, Analysis and Simulation of Computer and Telecom-munication Systems, London, U.K., Sep. 21–23, 2009.

[49] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, “An analysis oftraces from a production mapreduce cluster,” in Proc. 10th IEEE/ACMInt. Conf. Cluster, Cloud and Grid Computing, Melbourne, Australia,May 17–20, 2010.

[50] D. Borthakur, “The Hadoop Distributed File System Architecture andDesign,” 2007 [Online]. Available: http://hadoop.apache.org/common/docs/r0.18.0/hdfs_design.pdf

Zeqiang Chen received the B.Sc. degree in geog-raphy from Hauzhong Normal University, China, in2006 and the M.S. and Ph.D. degrees in geographicalinformation systems from Wuhan University, China,in 2008 and 2012, respectively.He was a Research Associate at the Center for

Spatial Information Science and Systems (CSISS),George Mason University, Fairfax, VA. His currentresearch interests include Sensor Web and SmartCity.

Nengcheng Chen received the B.Sc. degree ingeodesy from Wuhan Technical University ofSurveying and Mapping, China, in 1997, the M.S.degree in geographical information systems fromWuhan University in 2000, and the Ph.D. degree inphotogrammetry and remote sensing from WuhanUniversity in 2003.He was a post-doctoral research associate with

the Center for Spatial Information Science andSystems, George Mason University, Greenbelt, MD,from 2006 to 2008. Currently, he is a Professor

of geographic information science of the State Key Lab for InformationEngineering in Surveying, Mapping and Remote Sensing, Wuhan University,Wuhan, Hubei, China. His research interests include Smart Planet, Sensor Web,Semantic Web, Digital Antarctica, Smart City, and Web GIS.Prof. Chen is a member of the International Association of Chinese Profes-

sionals in Geographic Information Sciences (CPGIS). He was the chair of 2010CPGIS Young Scholar Summer Workshop.

Chao Yang received the B.S. and M.S. degreesfrom East China Institute of Technology in 2006and 2009, respectively. Currently, he is pursuingthe Ph.D. degree at the State Key Laboratory forInformation Engineering in Surveying, Mapping andRemote Sensing (LIESMARS) at Wuhan University,China.Since 2010, he has been a Research Associate at

the Center for Spatial Information Science and Sys-tems (CSISS), George Mason University, Fairfax,VA. His research topics include sensor web, satellite

geometry calibration, cloud computing, and SOA technology.

Liping Di received the B.Sc. degree in remotesensing from Zhejiang University, China, in 1982,the M.S. degree in remote sensing/ computer ap-plications from the Chinese Academy of Science,Beijing, in 1985, and the Ph.D. degree in geographyfrom the University of Nebraska-Lincoln in 1991.He was a Research Scientist at the Chinese

Academy of Science from 1985 to 1986 and theNOAA National Geophysical Data Center from1991 to 1994. He served as a Principal Scientist from1994 to 1997, and a Chief Scientist from 1997 to

2000 at Raytheon ITSS. Currently, he is a Professor of geographic informationscience and the Director of the Center for Spatial Information Science andSystems, George Mason University, Fairfax, VA. His research interests includeremote sensing, geographic information science and standards, spatial datainfrastructure, global climate and environment changes, and advanced Earthobservation technology.