MPROVEMENT IN LAND COVER AND CROP CLASSIFICATION …arxiv.org/pdf/2004.12880v2.pdfAleem Khaliq...

IMPROVEMENT IN LAND COVER AND CROP CLASSIFICATIONBASED ON TEMPORAL FEATURES LEARNING FROM SENTINEL-2

DATA USING RECURRENT-CONVOLUTIONAL NEURALNETWORK (R-CNN)

Vittorio MazziaDepartment of Electronics and Telecommunications

Politecnico di Torino10124 Turin, Italy

[email protected]

Aleem KhaliqDepartment of Electronics and Telecommunications


[email protected]

Marcello ChiabergeDepartment of Electronics and Telecommunications


[email protected]

May 6, 2020

ABSTRACT

Understanding the use of current land cover, along with monitoring change over time, is vital foragronomists and agricultural agencies responsible for land management. The increasing spatial andtemporal resolution of globally available satellite images, such as provided by Sentinel-2, createsnew possibilities for researchers to use freely available multi-spectral optical images, with decametricspatial resolution and more frequent revisits for remote sensing applications such as land cover andcrop classification (LC&CC), agricultural monitoring and management, environment monitoring.Existing solutions dedicated to cropland mapping can be categorized based on per-pixel based andobject-based. However, it is still challenging when more classes of agricultural crops are consideredat massive scale. In this paper, a novel and optimal deep learning model for pixel-based LC&CCis developed and implemented based on Recurrent Neural Networks (RNN) in combination withConvolutional Neural Networks (CNN) using multi temporal sentinel-2 imagery of central north partof Italy, which has diverse agricultural system dominated by economic crop types. The proposedmethodology is capable of automated features extraction by learning time correlation of multipleimages, which reduces manual feature engineering and modeling crops phenological stages. Fifteenclasses, including major agricultural crops, were considered in this study. We also tested otherwidely used traditional machine learning algorithms for comparison such as support vector machineSVM, random forest (RF), Kernal SVM, and gradient boosting machine, also called XGBoost. Theoverall accuracy achieved by our proposed Pixel R-CNN was 96.5%, which showed considerableimprovements in comparison with existing mainstream methods. This study showed that Pixel R-CNNbased model offers a highly accurate way to assess and employ time-series data for multi-temporalclassification tasks.

Keywords satellite imagery · deep Learning · pixel-based crops classification · recurrent neural networks ·convolutional neural networks.

arX

iv:2

004.

1288

0v2

[cs

.CV

] 5

May

202

0

PIXEL R-CNN

1 Introduction

Significant increase in population around the globe, demanding increase in agricultural productivity and thus preciseland cover and crop classification and spatial distribution of various crops are becoming significant for govern-ments, policymakers and farmers to improve decision-making process to manage agricultural practices and needs[Gomez(2016)]. Crop maps are produced relatively at large scale, ranging from global[Wu(2014)], countrywide[Jin(2019)], and local level[Yang(2007), Wang(2016)]. The growing need for agriculture in the management ofsustainable natural resources becomes essential for the development of effective cropland mapping and monitoring[Matton(2015)]. Group on Earth Observations (GEO), with its Integrated Global Observing Strategy (IGOS), alsoemphases on an operational system for monitoring global land covers and mapping spatial distribution of crops byusing remote sensing imagery. spatial information of the crop maps have been the main source for crop growthmonitoring[Huang(2015), Battude(2016), Guan(2016)], water resources managment [Toureiro(2017)], decsion makingfor policy makers to ensure food security [Yu(2017)].Satellite and Geographic Information System (GIS) data have been an important source factor in establishing andimproving the current systems that are responsible for developing and maintaining land cover and agricultural maps[Boryan(2011)]. Freely available satellite data offers one of the most applied sources for mapping agricultural landand assessing important indices that describe conditions of crop fields [Xie(2008)]. Recently launched sentinel-2 isequipped with multispectral imager that can provide up to 10m per pixel spatial resolution with the revisit time of 5days, which offers great opportunity to be exploited in remote sensing domain.Multispectral time series data acquired from MODIS and LANDSAT have been widely used in many agriculturalapplications such as crop yield prediction [Johnson(2016)], landcover and crop classification [Senf(2015)], leaf areaindex estimation [Jin(2016)], plant height estimation [Liaqat(2017)], vegetation variability assessment [Senf(2017)]and many more. Two different data sources can also be used together to extract more features that lead to improvingresults. For example, Landsat-8 and sentinel-1 used together for LC&CC [Kussul(2016)].There are some supervised or unsupervised algorithms for mapping cropland using mono or multi-temporal images[Xiong(2017), Yan(2015)]. Multi-temporal images have already proven to gain better performance than mono temporalmapping methods [Gomez(2016), Xiao(2018)]. The imagery used for only key phenological stages s proved to besufficient for crop area estimation [Gallego(2008), Khaliq(2018)]. It has also found in [Zhou(2013)], that reducingtime-series length affects the average accuracy of the classifier. Crop patterns were established using EnhancedVegetation Index derived from 250 meters MODIS-Terra time series data and used to classify some major crops likecorn, cotton, and soybean in Brazil [Arvor(2011)]. Centimetric resolution imagery is available at the cost of high priceof commercial satellite imagery or with the extensive UAV flight campaigns to cover large area during the whole cropcycle to get better spatial and temporal details. However, most of the studies used moderate spatial resolution(10-30m)freely available satellite imagery for land cover mapping due to their high spectral and temporal resolution which isdifficult in case of UAV and high resolution satellite imagery.Other than multispectral time series data, several vegetation indices (VIs) derived from different spectral bands have beenexploited and used to enrich the feature space for vegetation assessment and monitoring [Zhong(2012), Wardlow(2012)].VIs such as normalized difference vegetation index (NDVI), normalized difference water index (NDWI), enhancedvegetation indexes (EVI), textural features, such as grey level co-occurrence matrix (GLCM), statistical features, such asmean, standard deviation, inertial moment are the features more frequently used for crop classification. It is possible toincrease the accuracy of the algorithms also using ancillary data such as elevation, census data, road density, or coverage.Nevertheless, all these derived features, along with the phenological metrics involve a huge volume of data, which mayincrease computational complexity with little improvement in accuracy [Löw(2013)]. Several feature selection methodshave been proposed [Löw(2013)] to deal with this problem. In [Hao(2015)], various features have been derived fromMODIS time series and best feature selection has been made using random forest algorithm.LC&CC can also be classified as pixel-based or object-based. Object-based image analysis (OBIA), described byBlaschke, that segmentation of satellite images into homogeneous image segments can be achieved with high-resolutionsensors [Blaschke(2010)]. Various object-based classification has been proposed to produce crop maps using satelliteimagery [Novelli(2016), Long(2013), Li(2015)].In this work, we proposed a unique deep neural network architecture for LC&CC, which comprises of RecurrentNeural Network (RNN) that extracts temporal correlations from time series of sentinel-2 data in combination withConvolutional Neural Network (CNN) that analyzes and encapsulate the crops pattern through its filters. The remainderof this paper is organized as follows. Section II briefs about related work done for the LC&CC along with an overviewof RNN and CNN. Section III provides an overview of the raw data collected and exploited during the research. SectionIV provides detailed information on the proposed model and the training strategies. Section V contains a completedescription of the experiments, results, and discussion along with the comparison with previous state-of-the-art results.Finally, Section VI draws some conclusions.

2

PIXEL R-CNN

2 Related work

2.1 Temporal feature representation

There are various studies proposed in the past to address LC&CC. A more common approach adopted for classificationtasks is to extract temporal features and phenological metrics from the VIs time series derived from remotely sensedimagery. There are also some simple statistics and threshold-based procedures used to calculate vegetation relatedmetrics such as Maximum VI and time of peak VI [Walker(2014), Walker(2015)], which have improved classificationaccuracy when compared to using only VI as features [Simonneaux(2008)]. More complex methods have been adaptedto extract temporal features and patterns to address the vegetation phenology [Shao(2016)]. Further, time series of VIrepresented by a set of functions [Galford(2008)], linear regression [Funk(2009)], Markov model [Siachalou(2015)]and curve-fitting functions. Sigmoid function has been exploited by [Xin(2015), Xin(2016)], and achieved betterresults due to its robustness and ease to derive phenological features for the characterization of vegetation variability[Dannenberg(2015)]. Although above-mentioned methods of temporal feature extraction offer many alternatives andflexibilities in deployment to assess vegetation dynamics, in practice, there are some important factors such as manuallydesigned model and feature extraction, intra-class variability, uncertain atmospheric conditions, empirical seasonalpatterns, which make the selection of such methods more difficult. Thus, an appropriate approach is needed to fullyutilize the sequential information from time series of VI to extract temporal patterns. As our proposed DNN architectureis based on pixel classification, therefore the following subsections will provide relevant studies and description

2.2 Pixel based crops classification

A detailed review of the state-of-the-art supervised pixel-based methods for land cover mapping was performed in[Khatami(2016)]. It was found that support vector machine (SVM) for mono temporal image classification was themost efficient in terms of overall accuracy (OA) of about 75%. The second approach was neural networks (NN) basedclassifier with almost the same OA 74%. SVM is complex and resource-consuming for time series multispectraldata applications with broad area classification. Another common approach in the remote sensing applications is therandom forest (RF)-based classifiers [Gislason(2006)]. Nevertheless, multiple features should be derived to feed the RFclassifier for effective use. Deep Learning (DL) is a branch of machine learning, and it is a powerful tool that is beingwidely used in solving a wide range of problems related to signal processing, computer vision, image processing, imageunderstanding, and natural language processing [LeCun(2015)]. The main idea is to discover not only the mappingfrom representation to output but also the representation itself. That is achieved by breaking a complex problem into aseries of simple mappings, each described by a different layer of the model, and then composing them in a hierarchicalfashion. A large number of state of the art models, frameworks, architecture, and benchmark databases of referenceimagery exist for image classification domain.

2.3 Recurrent neural network (RNN)

Sequence data analysis is an important aspect in many domains, ranging from natural language processing, handwritingrecognition, image captioning, to robot automation. In recent years, Recurrent Neural Networks (RNNs) have provento be a fundamental tool for sequence learning [Sutskever(2014)], allowing to represent information from contextwindow of hundreds of elements. Moreover, the research community has, over the years, come up with differenttechniques to overcome the difficulty of training over many time steps. For example, long short-term memory (LSTM)[Sutskever(2000)] and gated recurrent unit (GRU) [Chung(2014)], based architectures have proven ground-breakingachievements [Chung(2015), Bahdanau(2014)], in comparison to standard RNN models. In remote sensing applications,RNNs are commonly used when sequential data analysis is needed. For example, Lyu et al. [Lyu(2016)] employedRNN to use sequential properties such as spectral correlation and intra-bands variability of multispectral data. Theyfurther used LSTM model to learn a combined spectral-temporal feature representation from an image pair acquired attwo different dates for change detection [Lyu(2018)]

2.4 Convolutional neural network (CNN)

Convolutional Neural Networks (CNNs) date back decades [LeCun(1998)], emerging from the study of the brain’svisual cortex [Hubel(1962)] and classical concepts of computer vision theory [Hubel(2010)], [Hubel(2015)]. Sincethe 1990s, these have been applied successfully in image classification [LeCun(1998)]. However, due to technicalconstraints such as mainly lack of hardware performance, the large amount of data, and theoretical limitations, CNNsdid not scale to large applications. Nevertheless, Geoffrey Hinton and his team demonstrated at the annual ImageNetILSVRC [Krizhevsky(2012)] competition the feasibility to train large architectures capable of learning several layersof features with increasingly abstract internal representations [Zeiler(2014)]. Since that breakthrough achievement,

3

PIXEL R-CNN

CNNs became the ultimate symbol of the Deep Learning [LeCun(2015)] revolution, incarnating all those concepts thatunderpin the entire novel movement.In recent years, DL was widely used in data mining and remote sensing applications. In particular, image classificationstudies exploited several DL architectures due to their flexibility in feature representation, and automation capabilityfor end-to-end learning. In DL models, features can be automatically extracted for classification tasks without featurecrafting algorithms by integrating autoencoders [Wan(2017), Mou(2017)]. 2D CNNs have been broadly used in remotesensing studies to extract spatial features from high-resolution images for object detection and image segmentation[Kampffmeyer(2016), Kampffmeyer(2017), Audebert(2018)]. In crop classification, 2D convolution in the spatialdomain performed better than 1D convolution in the spectral-domain [Kussul(2017)]. These studies formed multipleconvolutional layers to extract spatial and spectral features from remotely sensed imagery.

3 Study area and data

The study site near Carpi, Emilia-Romagna, situated in center-north part of Italy with centralcoordinates44◦47′01′′N, 10◦59′37′′E was considered for LC& CC shown in Figure 1. The Emilia-Romagnaregion is one of the most fertile plains of Italy. An area almost 2640 km2 was considered, which covers diverse cropland. The major crop fields in this region are Maize, Lucerne, Barley, Wheat, and Vineyards. The yearly averagedtemperature and precipitation are 14◦C and 843 mm for this region. Most of the farmers practice single cropping inthis area.

Table 1: Bands used in this study.

Bands used Description Centralwavelength(µm)

Resolution(m)

Band 2 Blue 0.49 10Band 3 Green 0.56 10Band 4 Red 0.665 10Band 8 Near infrared 0.705 10NDVI (Band8-Band4)/ (Band8+Band4) - 10

To know about the spatial distribution of crops, we deeply studied Land Use Cover Area frame statistical Survey(LUCAS) and extracted all the information we need for ground truth data. LUCAS was carried out by Eurostat to be ableto monitor the agriculture, climate change, biodiversity, forest, and water for almost all over the Europe [Kussul(2017)].

The technical reference document of LUCAS-2015 was used to prepare the ground truth data. Microdata that containsspatial information of crops and several land cover types along with the geo-coordinates for the considered region wasimported in Quantum Geographic information system (QGIS) software, an Open-source software used for visualization,editing, analysis of geographical data. The selection of pixel was made manually by overlapping images and LUCASdata, so a proper amount of ground truth pixels were extracted for training and testing the algorithm. The sentinel-2mission consists of twin polar-orbiting satellites launched by European Space Agency (ESA) in 2015 and can be usedin various application areas such as land cover change detection, natural disaster monitoring, forest monitoring, andmost importantly in agricultural monitoring and management [Kussul(2017)].

It is equipped with multi-spectral optical sensors that capture 13 bands of different wavelengths. We used onlyhigh-resolution bands that have 10 meter/pixel resolution shown in Table 1. It also has high revisit time ( ten days atthe equator and five days with twin satellites (Sentinel-2A, Sentinel-2B). It became more popular in remote sensingcommunity due to fact that it possesses various key features such as, free access to data products available at ESASentinel Scientific Data Hub with reasonable spatial resolution (which is 10m for Red, Green, Blue and Near Infraredbands), high revisit time and reasonable spectral resolution among other available free data sources. In our study,we used ten multitemporal sentinel-2 images reported in Table. 2, which are well co-registered from July-2015 toJuly-2016 with close to zero cloud coverage. The initial image selection was performed based on the cloudy pixelcontribution at the granule level. This pre-screening was followed by further visual inspection of scenes and resulted ina multi-temporal layer stack of ten images. Sentinel Application Platform (SNAP) v5.0 along with sen2core v 2.5.1were used to apply radiometric and geometric corrections to acquire Bottom of Atmosphere (BOA) Level 2A imagesfrom Top of Atmosphere (TOA) Level 1C. Further details about geometric, radiometric correction algorithms used insen2cor can be found in [Kaufman(1988)]. Bands with 10 meter/pixel along with the derived Normalized DifferenceVegetation Index (NDVI) were used for experiments, as shown in Table 1.

4

PIXEL R-CNN

Figure 1: The study site is located in Carpi, region Emilia-Romagna is shown with the geo-coordinates (WGS84). RGBimage composite derived from sentinel-2 imagery acquired in August-2015 is shown and the yellow marker showinggeo-locations of ground truth land cover extracted from Land Use and Coverage Area frame Survey (LUCAS-2015).

Table 2: Sentinel-2 data acquisition.

Date Doy Sensing Orbit # Cloud pixelpercentage

7/4/2015 185 22-Descending 08/3/2015 215 22-Descending 0.3849/2/2015 245 22-Descending 4.795

9/12/2015 255 22-Descending 7.39710/22/2015 295 22-Descending 7.6062/19/2016 50 22-Descending 5.83/20/2016 80 22-Descending 19.8664/29/2016 120 22-Descending 18.616/18/2016 170 22-Descending 15.527/18/2016 200 22-Descending 0

5

PIXEL R-CNN

Figure 2: Few examples of zoomed in part of crop classes considered as ground truth. Shape files are used to extractpixels for reference data.

4 Convolutional and recurrent neural networks for pixel-based crops classification

4.1 Formulation

A single multi-temporal, multi-spectral pixel can be represented as a two-dimensional matrix X(i) ∈ Rt∗b where tand b are the number of time steps and spectral bands, respectively. Our goal is to compute from X(i) a probabilitydistribution F (X(i)) consisting of K probabilities, where K is equal to the number of classes. In order to achieve thisobjective, we propose a compact representation learning architecture composed of three main building blocks:

• Time correlation representations - this operation extracts temporal correlations from multi-spectral, temporalpixels X(i) exploiting a sequence-to-sequence recurrent neural network based on Long Short-Term Memory(LSTM) cells. A final Time-Distributed layer is used to compress and maintain a sequence like structure,preserving the multidimensionality nature of the data. In this way, it is possible to take advantage of temporaland spectral correlations simultaneously.

• Temporal pattern extraction - this operation consists of a series of convolutional operations followed byrectifier activation functions that non linearly maps each elaborated temporal and spectral patterns onto highdimensional representations. So, RNN output temporal sequences are processed by a subsequent cascade offilters, which in a hierarchical fashion, extracts essential features for the successive stage.

• Multiclass classification - this final operation maps the feature space with a probability distribution F (X(i))with K different probabilities, where K, as previously stated, is equal to the number of classes.

The comprehensive pipeline of operations constitutes a lightweight, compact architecture able to non-linearly mapmulti-temporal information with its intrinsic nature, achieving results considerably better than previous state-of-the-art solutions. Human brain mental imagery studies [Mellet(1996)], where images are a form of internal neuralrepresentation, inspired the presented architecture. Moreover, the joint effort of RNN and CNN distributes the knowledgerepresentation through the entire model, exploiting one of the most powerful characteristics of deep learning knownas distributed learning. An overview of the overall model, dubbed Pixel R-CNN, is depicted in Fig. 3. Each pixel isextracted contemporary from all images taken at different time steps t with all its spectral bands b. In this way, it ispossible to create an instance X(i), which can feed the first layer of the network. Firstly, the model extracts temporalrepresentations from the input sample. Subsequently, these temporal features are further enriched by the convolutionallayers that patterns in a hierarchical manner. The overall model act as a function F (X(i)) that map the input samplewith its related probabilities K. So, evaluating the probability distribution is possible to identify the class of belongingto the input sample.

It is worth to notice that this model is known as unrolled through time representation. Indeed, only after all time stepshave been processed, the CNN is able to analyze and transform the temporal pattern. In the following subsections, weare going to describe in detail each individual block.

6

PIXEL R-CNN

Figure 3: An overview of the Pixel R-CNN model used for classification. Given a multi-temporal, multi-spectral inputpixel X(i), the first layer of LSTM units extracts sequences of temporal patterns. A stack of convolutional layershierarchically processes the temporal information.

4.1.1 Time correlation representation

Nowadays, a popular strategy in time series data analysis is the use of RNNs that have proven excellent results in manyfields of the application over the years. Looking at the simplest possible RNN shown in Fig. 4, composed of just onelayer, it looks very similar to a feedforward neural network, except it also has a connection going backward. Indeed, thelayer is not only fed by an input vector x(i), but it also receives h(i) (cell state), which is equal to the output neuronitself, y(i). So, at each time step t, this recurrent layer receives an input 1-D array x(i)t as well as its own output fromthe previous time step, y(i)(t−1). In general, since the output of a recurrent neuron at time step t is a function of all inputsfrom previous time steps, it has, intuitively, a sort of memory that influences all successive outputs. In this example, it isstraightforward to compute a cell’s output, as shown in Eq. 1.

y(i)t = φ(x

(i)t ·Wx + y

(i)(t−1) ·Wy + b) (1)

Figure 4: A recurrent layer and its unrolled through time representation. A multi-temporal, multi-spectral pixel X(i) ismade by a sequence of time steps, x(i)t , that along the previous output h(i) feed the next iteration of the network.

7

PIXEL R-CNN

Figure 5: LSTM with peephole connections. A time step t of a multi-spectral pixel x(i)t is processed by the memorycell which decides what to add and forgot in the long-term state c(t) and what discard for the present state y(i)t .

where, in the context of this research, x(i)t ∈ R(1∗b) is a single time step of a pixel with ninputs equal to the number ofspectral bands b. y(i)t and y(i)(t−1) are the output of the layer at time t and t− 1, respectively, Wx and Wy are the weights

matrices. It is important to point out that yt as x(i)t are vectors and they can have an arbitrary number of elements, butthe representation Fig. 4 does not change. Simply, all neurons are hidden in the depth dimension. Unfortunately, thebasic cell just described suffer from major limitations, but most of all are the fact that, during training, the gradientof the loss function gradually fades away. For this reason, for the time correlation representation, we adopted a moreelaborated cell known as peephole LSTM unit, see Fig. 5. That is an improved variation of the concept proposed in1997 by Sepp Hochreiter and Jurgen Schmidhuber [Sutskever(2000)]. The key idea is that the network can learn whatto store in a long-term state, c(t) what to throw away and what to use for the current state h(t) and y(t) that, as for thebasic unit, are equal. That is performed with simple element-wise multiplications working as ”valves” for the fluxesof information. Those elements, V1, V2 and V3 are controlled by fully connected (FC) layers that have as input thecurrent input state x(t) and the previous short-term memory term h(t−1). Moreover, for the peephole LSTM cell, theprevious long-term state c(t−1) is added as an input to the FC of the forgot gate,V1, and the input gate, V2. Finally, thecurrent long-term state ct is added as an input to the FC of the output gate. All "gates controllers" have sigmoid asactivation functions (green boxes) instead of tanh ones to process the signals themselves (red boxes). So, to summarize,a peephole LSTM block has three signals as input and output; two are the standard input state x(t) and cell outputy(t). Instead, c and h are the long and short-term state, respectively, that the unit, utilizing its internal controllers andvalves, can feed with useful information. Formally, as for the basic cell seen before, Eq (2). Eq (7). summarizes how tocompute the cell’s long-term state, its short-term state, and its output at each time step for a single instance.

i(t) = σ(WTci · c(t−1) +WT

hi · h(t−1) +WTxi · x

(i)(t) + bi) (2)

f(t) = σ(WTcf · c(t−1) +WT

hf · h(t−1) +WTxf · x

(i)(t) + bf ) (3)

o(t) = σ(WTco · c(t) +WT

ho · h(t−1) +WTxo · x

(i)(t) + bo) (4)

g(t) = tanh(WThg · h(t−1) +WT

xg · x(i)(t) + bg) (5)

c(t) = f(t) ⊗ c(t−1) + i(t) ⊗ g(t) (6)y(t) = h(t) = o(t) ⊗ tanh(c(t)) (7)

In conclusion, multi-temporal, multi-spectral pixel X(i) is processed by a first layer of LSTM peephole cells obtaininga cumulative output Y (i)

(lstm). Finally, a TimeDistributedDense layer is applied which executes simply a Dense function

8

PIXEL R-CNN

Input

LSTM

LSTM

x0(i)

y0(i)

h0(i)

c0(i)

x1(i)

y1(i)

h1(i)

c1(i)

xt(i)

yt(i)

LSTM LSTM

X(i)

Y(i)(lstm)

Figure 6: Pixel R-CNN first layer for time correlations extraction. Peephole LSTM cells extracts temporal representa-tions from input instances X(i) ∈ Rt∗b. The output matrix Y (i)

(lstm) feeds a TimeDistributedDense layer, that preservesthe multidimensional nature of the processed data extracting multi-spectral patterns.

across every output over time, using the same set of weights, preserving the multidimensional nature of the processeddata Eq.(8). In Fig. 6 is presented a graphical representation of the first layer of the network. LSTM cells extracttemporal representations from input samples X(i) with x(i)t ∈ R(1∗b) as columns. The output matrix Y (i)

(lstm) feeds thesubsequent TimeDistributedDense layer.

FtimeD(Flstm(X(i))) = (W · Y (i)(lstm) +B) (8)

4.1.2 Temporal patterns extraction

The first set of layers extract a 2-dimensional tensor Y (i)(timeD) for each instance. In the second operation, after a simple

reshaping operation in order to increase the dimensionality of the input tensor from two to three and beeing able toapply the following operations, we map each of these 3-dimensional array Y(i)

(reshape) into an higher dimensional space.That is accomplished by two convolutional operations, built on top of each other, that hierarchically apply learnedfilters, extracting gradually more abstract representations. More formally, the temporal patterns extraction is expressed,for example, for the first convolutional layer, as an operation Fconv1

Fconv1(FtimeD(Flstm(X(i)))) = max(0,W1 ∗ Y(i)(reshape) +B1) (9)

where W1 and B1 represent filters and biases, respectively, and ’∗’ is the convolutional operation. W1, contains n1filters with kernel dimension f1 x f1 x c, where f1 is the spatial size of a filter and c is the number of input channels.As common for CNN, rectified linear unit (ReLU), max(0,x), has been chosen as activation function for both layersunits. In Fig. 7 is depicted a graphical scheme of this section of the model. So, summarizing, output matrix Y (i)

(timeD)

9

PIXEL R-CNN

ReshapeY(i)

Conv2DY(i)

(reshape)

(conv2)

f1

f1 c

n1

n2

cf2

f2

1

Figure 7: Pixel R-CNN convolutional layers. Firstly, output from TimeDistributedDense layer Y (i)(timeD) is reshaped in a

3-dimensional tensor Y(i)(reshape) and then it feeds a stack of two convolutional layers that progressively reduce the first

two dimensions, gradually extracting higher level represenetations.

of the TimeDistributedDense layer, after adding an extra dimension, feeds a stack of two convolutional networks thatprogressively reduce the first two dimensions, gradually extracting higher level representations and generating highdimensional arrays. Moreover, being the n1 and n2 filters shared across all units, the same operation carried out with asimilarly-sized dense fully connected layers would require a much greater number of parameters and computationalpower. Instead, the synergy of RNN and CNN opens the possibility to elaborate the overall temporal canvas in anoptimize and efficient way.

4.1.3 Multiclass classification

In the last stage of the network, the extracted feature tensor Y(i)(conv2), after removing the extra dimensions with a simple

flatten operation, is mapped to a probability distribution consisting of K probabilities, where K is equal to the numberof classes. This is achieved by a weighted sum followed by a softmax activation function:

p̂k = σ(s(x))k =exp sk(x)∑Kj=1 exp sj(x)

for j = 1..,K (10)

where s(x)= WT .y(i)(flatten−conv2) + B is a vector containing the scores of each class for the input vector

y(i)(flatten−conv2), that after the flatten operaton is a 1-dimensional array. Weights W and bias B are learned, dur-

ing the training process, in such a way to classify arrays of the high dimensional space into the K different classes. So,p̂k is the estimated probability that the extracted feature vector y(i)(flatten−conv2) belongs to class k given the scores ofeach class for that instance.

10

PIXEL R-CNN

4.2 Training

Learning the overall mapping function F requires the estimation of all network parameters Θ of the three differentmodel parts. This is simply achieved through minimizing the loss between each pixel class prediction F (X(i)) and thecorresponding ground truth y(i) with a supervised learning strategy. So, given a data set with n pixel samples {Xi}andthe respective true classes set{yi}, we use categorical cross-entropy as the loss function:

J(Θ) = −1/n

n∑i=1

K∑k=1

y(i)k log(p̂

(i)k ) (11)

where y(i)k cancels all classes loss except for the true one. Equation (11). is minimized using AMSGrad optimizer[Reddi(2019)], an adaptive learning rate method which modifies the basic ADAM optimizer [Kingma(2019)] algorithm.The overall algorithm update rule without the debiasing step is:

mt = β1mt−1 + (1− β1)gt (12)

vt = β2vt−1 + (1− β2)g2t (13)

v̂t = max(v̂t−1, vt) (14)

θt+1 = θt −η√v̂t + ε

mt (15)

Equation (12). and Eq. (13). are the exponential decay of the gradient and gradient squared, respectively. Instead, withthe Eq. (14)., keeping a higher vt term results in a much lower learning rate, η , fixing the exponential moving averageand preventing to converge to a sub-optimal point of the cost function. Moreover, we use a technique known as cosineaneling in order to cyclically vary the learning rate value between certain boundary values [Smith(2017)]. This valuecan be obtained with a preliminary training procedure, linearly increasing the learning rate while observing the value ofthe loss function. Finally, we employ, as only regularization methodology, "Dropout" [Srivastava(2014)] in the timerepresentation stage, inserted between the LSTM and Time-Distributed layer. This simple tweak allows training a morerobust and resilient to noise temporal patterns extraction stage. Indeed, forcing CNN to work without relying on certaintemporal activations can greatly improve the abstraction of the generated representations distributing the knowledgeacross all available units.

5 Experimental Results and discussion

We first processed raw data in order to create a set of n pixel samples X ={Xi} with the related ground truth labelsY={yi}. Then, in order to have a visual inspection of the data set, principal component analysis (PCA), one of the mostpopular dimensionality reduction algorithms, have been applied to project the training set onto a lower tri-dimensionalhyperplane. Finally, quantitative and qualitative results are discussed with a detail description of the architecturesettings.

5.1 Training data

Sample pixels require to be extracted from the raw data and then reordered in order to feed the devised architecture.Indeed, the first RNN stage requires data points to be collected in slices of time series. So, we separated labeled pixelsfrom raw data, and we divided them in chunks of data, forming a tri-dimensional tensor X ∈ Ri×t×b for the successivepre-processing pipeline. In Fig. 8, a visual representation of the data set tensor X generation, where fixing the firstdimension Xi,:,: there are the individual pixel samples X(i) with t = 9 time steps and b = 5 spectral bands. It is worthto notice that the number of time steps and bands are completely an arbitrary choice dictated by the raw data availability.

Subsequently, we adopted a simple pipeline of two steps to pre-process the data. Stratified sampling has been applied inorder to divide the data set tensor X, with shape (92116, 9, 5), in a training and test set. Due to the natural unbalancednumber of instances per class present in the data set Table. 3, this is an important step in order to preserve the samepercentage in the two sets. After selecting a split percentage for the training of 60%, we obtained two tensors Xtrain

and Xtest with shape (55270, 9, 5) and (36846, 9, 5), respectively. Secondly, as common practice, in order to facilitatethe training, we adopted standard scaling, (x− µ)/σ, to normalize the two sets of data points.

11

PIXEL R-CNN

t0 t1 t9

W

H

X(i)

Figure 8: Overview of the tensor X ∈ Ri×t×b generation. The first dimension i represents the collected instances X(i),the second t the different time steps, and finally the last one b the five spectral bands, red, green, blue, near-infrared andNDVI. On the top, labeled pixels are extracted simultaneously, from all-time steps and bands starting from the rawsatellite images. Then, Xi,j,k are reshaped in order to set up the X(i) = Xi,:,: of the data set tensor X.

Table 3: Land cover types contribution in the reference data.

Class Pixels PercentageTomatoes 3020 3.20%Artificials 9343 10.14%Trees 7384 8.01%Rye 4382 4.75%Wheat 12826 13.92%Soya 5836 6.33%Apple 849 0.92%Peer 495 0.53%Temp Grass 1744 1.89%Water 2451 2.66%Lucerne 17942 19.47%Drum Wheat 1188 1.28%Vineyard 6110 6.63%Barley 2549 2.76%Maize 15997 17.37%Total 92116 100%

5.2 Dataset Visualization

In order to explore and visualize the generated set of points, we exploit Principle Component Analysis (PCA), reducingthe high dimensionality of the data set. For this operation, we considered the components t and b as features of ourdata points. So, applying Singular Value Decomposition (SVD) and then selecting the first three principal components,Wd = (c1, c2, c3), it was possible to plot the different classes in a tri-dimensional space, having a visual representationof the projected data points. In Fig. 9 the projected data points are plotted in tri-dimensional space. Except for waterbodies, it is worth to point out how much intra-class variance is present. Indeed, most of the classes lay on more thanone hyperplane, demonstrating the difficulty of the task and the studied data set. Finally, it was possible to analyze theexplained variance ratio varying the number of dimensions. From Fig. 10, it is worth to notice that approaching highercomponents, the explained variance trend stops growing fast. So, that can be considered as the intrinsic dimensionalityof the data set. Due to this fact, it is reasonable to assume that reducing the number of time steps would not significantlyaffect the overall results.

12

PIXEL R-CNN

Figure 9: Visual representation of the data points projected in the tri-dimensional space using PCA. The three principalcomponents took into account preserve 64.5% of the original data set variance.

Figure 10: Pareto Chart of the explained variance as a function of the number of components.

5.3 Experimental settings

In this section, we examine the settings of the final network architecture. The basic Pixel R-CNN model, shown in Fig.3, is the result of a careful design aimed at obtaining the best performance in terms of accuracy and computational cost.Indeed, the final model is a lightweight model with 30,936 trainable parameters (less than 1 MB), fast and more accuratethan the existing state-of-the-art solutions. With the suggested approach, we employed only an RNN layer with 32output units for each peephole LSTM cell randomly turned off, with a probability p = 0.2, by a Dropout regularizationoperation. For all experiments, peephole LSTM has shown an improvement in overall accuracy around 0.8% overstandard LSTM cells. Then, Time Distributed Dense transforms Y (i)

(lstm) in a 9x9 square matrix that feed a stack of twoCNN layers with a number of features n1 = 16 and n2 = 32, respectively. The first layer as a filter size of f1=3 and thesecond one f2 = 7 producing a one-dimensional output array. Finally, a fully connected layer with SoftMax activationfunction maps Y (i)

(conv2) to the probability of the K = 15 different classes. Except for the final layer, we adopted ReLUas activation functions. In order to find the best training hyperparameters for the optimizer, we used 10% of the training

13

PIXEL R-CNN

set to perform a random search evaluation, with few epochs, in order to select the most promising parameters. Then,after this first preliminary phase, the analysis has been focused only on the most promising hyperparameters value,fine-tuning them with a grid search strategy.

So, for the AMSGrad optimizer we set β1 = 0.86, β2 = 0.98 and ε = 10−9. Finally, as previously introduced, with apreliminary procedure, we linearly increased the learning rate of η while observing the value of the loss function inorder to estimate the initial value of this important hyperparameter. In conclusion, we fed our model with more than62000 samples for 150 epochs with a batch size of 128 while cyclically varying the learning rate value with a cosineaneling strategy. All tests have been carried out with the TensorFlow framework on a workstation with 64 GB RAM,Intel Core i7-9700K CPU, and an Nvidia 2080 Ti GPU.

5.4 Classification

Performance of the classifier was evaluated by user’s accuracy (UA), producer’s accuracy (PA), overall accuracy (OA),and the kappa coefficient (K) shown in the confusion matrix see Table. 4, which is the most common metric that hasbeen used for classification tasks [Kussul(2016), Congalton(1991), Conrad(2014), Skakun(2015)]. Overall accuracyindicates the overall performance of our proposed Pixel R-CNN architecture by calculating a ratio between the correctlyclassified total number of pixels and total ground truth pixels for all classes. The diagonal elements of the matrixrepresent the pixels that were classified correctly for each class. Individual class accuracy was calculated by dividingthe number of correctly classified pixels in each category by the total number of pixels in the corresponding row calledUser’s accuracy, and columns called Producer’s accuracy. PA indicates the probability that a certain crop type on theground is classified as such. UA represents the probability that a pixel classified in a given class belongs to that class.

Figure 11: (a). Final classified map using Pixel R-CNN, (b). zoomed in region of the classified map, and (c). RawSentinel-2 RGB composite of the zoomed region.

Our proposed pixel-based Pixel R-CNN method achieved OA=96.5% and Kappa=0.914 with 15 number of classes fora diverse large scale area, which exhibits significant improvement as compared to other mainstream methods. Waterbodies and trees stand highest in terms of UA with 99.1% and 99.3%, respectively. That is mainly due to intra-classvariability and the minor change of NIR band reflectance over the time, which was easily learned by our Pixel R-CNN.Most of the classes, including the major types of crops such as Maize, Wheat, Lucerne, Vineyard, Soya, Rye, andBarley, were classified with more than 95% UA. Grassland being the worst class, which was classified with the PA =65% and UA = 63%. The major confusion of grassland class was with Lucerne and Vineyard. It is worth mentioningthat, Artificial class, which belongs to roads, buildings, urban areas, represents mixed nature of pixel reflectances, wasaccurately detected with UA = 97% and PA=99%.

For class Apple, obtained PA was 86% while UA = 68%, which shows that 86% of the ground truth pixels wereidentified as Apple, but only 68% of the pixels classified as Apple in the classification were actually belonged to

14

PIXEL R-CNN

Table 4: Obtained confusion matrix.

Ground Truth Classified ClassesTotal PA

TM AR TR RY WH SY AP PR GL WT LN DW VY BL MZ

Tomatoes (TM) 1096 0 0 0 4 11 0 0 0 0 0 0 0 0 0 1111 98%Artificial (AR) 0 3752 8 1 2 0 2 1 9 9 12 2 6 0 4 3808 99%Trees (TR) 0 31 2967 1 0 0 0 3 10 0 17 0 2 0 0 3031 98%Rye (RY) 0 1 0 1960 25 0 0 0 0 0 0 0 0 5 0 1991 98%Wheat (WH) 38 7 0 221 4981 6 0 0 10 0 14 1 2 38 42 5360 93%Soya (SY) 3 0 0 0 3 1226 0 0 0 0 11 0 3 0 41 1287 95%Apple (AP) 0 0 0 0 0 0 142 0 0 0 2 0 21 0 0 165 86%Peer (PR) 0 0 11 0 0 0 27 124 0 0 0 0 6 0 0 168 73%Grassland (GL) 0 39 3 7 0 1 0 0 239 0 72 0 3 0 4 368 65%Water (WT) 0 0 0 0 0 0 0 0 0 906 0 0 0 0 0 906 100%Lucerne (LN) 0 0 0 2 0 2 0 0 48 0 7250 0 26 0 10 7338 98%Durum.Wheat (W) 0 4 0 0 0 0 0 0 2 0 0 322 0 0 0 328 98%Vineyard (VY) 11 7 4 4 11 1 50 1 21 0 93 0 2139 0 7 2349 91%Barley (BL) 0 1 0 2 24 0 0 0 1 0 1 0 0 817 0 846 96%Maize (MZ) 17 14 0 0 10 24 0 3 10 0 16 1 6 0 7689 7790 99%Total 1165 3856 2993 2198 5060 1271 221 132 350 915 7488 326 2214 860 7797UA 94% 97% 99% 89% 98% 96% 64% 93% 68% 99% 96% 99% 96% 95% 98%

Table 5: An overview and performance of recent studies.Study Details

Sensor Features Classifier Accuracy ClassesOur Sentinel-2 BOA Reflectances Pixel R-CNN 96.50% 15Rußwurm and Körner[Lawrence(2018)],2018

Sentinel-2 TOA Reflectances Recurrent Encoders 90% 17

Skakun et al. [Skakun(2015)], 2016 Radarsat-2 + Landsat-8 Optical+SAR NN and MLPs 90% 11Conrad et al. [Conrad(2014)], 2014 RapidEye Vegetation Indices RF and OBIA 86% 9Vuolo et al. [Vuolo(2018)], 2018 Sentinel-2 Optical RF 91-95% 9Hao et al. [Hao(2015)], 2015 MODIS Stat + phenological RF 89% 6J.M. Pena-Barragán [Peña(2011)], 2011 ASTER Vegetation Indices OBIA+DT 79% 13

class Apple. Some Pixels (see Table. 4 ) belongs to Peer and Vineyard were mistakenly classified as Apple. Thefinal classified map is shown in Fig. 11 with the example of the zoomed part and actual RGB image. To the best ofour knowledge, a multi-temporal benchmark dataset is not available to compare classification approaches on equalfootings. There are some data sets available online for crop classification without having ground truth of other landcover types such as Trees, Artificial land (build ups), Water bodies, Grassland. Therefore it is difficult to compareclassification approaches on equal footings. Indeed, a direct quantitative comparison of the classification performed inthese studies is difficult due to various dependencies such as the number of evaluated ground truth samples, the extentof the considered study area, and the number of classes to be evaluated. Nonetheless, we provided an overview of recentstudies and their performances of the study domain by their applied approaches, the number of considered classes,used sensors, and achieved overall accuracy in Table. 5. Hao et al. [Hao(2015)], achieved 89% OAA by using RFclassifier on the extracted phenological features from MODIS time-series data. They determined that good classificationaccuracies can be achieved with handcrafted features and classification algorithms if the temporal resolution of thedata is sufficient. Though, the MODIS sensor data is not suitable for classification of the areas of large homogeneousregions due to its low spatial resolution (500m). Conrad et al. [Lawrence(2018)] used high spatial resolution data fromthe RapidEye sensor and achieved 90% OAA for nine considered classes. In [Conrad(2014)], features from opticaland SAR were extracted and used by the committee of neural networks of multilayer perceptrons to classify a diverseagriculture region considerably. Recurrent encoders have been employed in [Bergstra(2012)] to classify a large areafor 17 considered classes using high spatial resolution (10m) sentinel-2 data and achieved 90% OAA,which provedthat recurrent encoders are useful to capture the temporal information of spectral features that leads to higher accuracy.Voulo et al. [Skakun(2015)] also used sentinel-2 data and achieved a maximum 95% classification accuracy using RFclassifier but nine classes were considered.

In conclusion, it is interesting to notice neuron activation inside the network during the classification process. Indeed, itis possible to plot unit values when the network receives specific inputs and compare how model behaviors change. InFig. 12 four samples, belonging to the same class "artificials," feed the model creating individual activations in the inputlayer. Even if they all belong to the same class, the four instances took into account present a noticeable variance. Either

15

PIXEL R-CNN

(a) (b)

Figure 12: Visual representation of the activation of the internal neurons of Pixel R-CNN, where darker color are valuesclose to zero and vice versa. (a). four samples of the same class "artificials", (b). related activations inside the networkat the output of the TimeDistributedDense layer Y(timeD). It is interesting to notice how the four inputs are prettydifferent from each other, but the network representations already at this level are similar.

the spectral features in a specific time instance (rows) or their temporal variation (columns) present different patternsthat make them difficult to classify. However, already after the first LSTM layer with the TimeDistributedDense block,the resulting 9x9 matrices Y(timeD) have a clear pattern that can be used by the following layers to classify the differentinstances in their respective classes. So, the network during the training process learns to identify specific temporalschemes, that allows making strong distributed and disentangle representations.

5.5 Non deep learning classifiers

We tried four other traditional classifiers on the same dataset for comparison, which are Support Vector Machine (SVM),Kernal SVM, Random Forest (RF), and XGBoost. These are well-known classifiers for their high performances andalso considered as baseline models in classification tasks [Srivastava(2014)]. SVM can perform nonlinear classificationusing kernel functions by separating hyperplanes. A widely used RF classifier is an ensemble of decision trees based onthe bagging approach [Fernández(2014), Shi(2016)]. XGBoost is state of the art classifier based on gradient boostingmodel of decision trees, which attracted much attention in the machine learning community. RF and SVM havebeen widely used in remote sensing applications [Löw(2013), Hao(2015), Gislason(2006)]. Each classifier involveshyperparameters that need to be tuned at the time of classification model development.

We followed "random search" approach to optimize major hyperparameters [Lawrence(2006)]. Best values of hyper-parameters were selected based on classification accuracy achieved for the validation set, and are highlighted withbold letters in Table. 6. Further details about hyper parameters and achieved overall accuracy (OA) for SVM, KernalSVM, RF, and XGBoost are reported in Table. 6 From these non-deep learning classifiers, SVM stands highest withOA = 79.6% while RF , Kernel SVM, and XGBoost achieved 77.5%, 76.8% and 77.2% respectively. From the resultspresented in Table. 6, our proposed Pixel R-CNN based classifier achieved OA = 96.5%, which is far better results thanthe non deep learning classifiers. Learning temporal and spectral correlations from multi-temporal images consideringlarge data set is challenging for traditional non-deep learning techniques. The introduction of deep learning models inthe remote sensing domain brought more flexibility to exploit temporal features in such a way that it can increase theamount of information to gain much better and reliable results for classification tasks.

6 Conclusion

In this study, we developed a novel deep learning model with Recurrent and Convolutional Neural Network calledPixel R-CNN to perform Land Cover and Crop Classification by using multitemporal decametric sentinel-2 imagery ofcentral north part of Italy. Our proposed Pixel R-CNN based architecture exhibits significant improvement as comparedto other mainstream methods by achieving 96.5% overall accuracy with kappa=0.914 for 15 number of classes. We also

16

PIXEL R-CNN

Table 6: Comparison of Pixel R-CNN with non-deep learning classifiers.

Model Parameters OASVM C: 0.01, 0.1, 1, 10, 100, 1000 79.50%

Kernel: linear

Kernel SVM C: 0.01, 0.1, 1, 10, 100, 1000 76.20%Kernel: rbfGamma: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6,0.7, 0.8

Random Forest n_estimators: 10, 20, 100, 200,500 max_depth: 5, 10, 15, 30min_samples_split: 3, 5, 10, 15, 30min_samples_leaf: 1, 3, 5, 10

77.90%

XGBoost learning_rate: 0.01, 0.02, 0.05, 0.1 77.60%gamma: 0.05, 0.1, 0.5, 1max_depth: 3, 7, 9, 20, 25min_child_weight: 1, 5, 7, 9subsamples: 0.5, 0.7, 1colsample_bytree: 0.5, 0.7, 1reg_labda: 0.01, 0.1, 1reg_alpha: 0, 0.1, 0.5, 1

Pixel R-CNN Mentioned in experimental settings 96.50%

tested widely used non-deep learning classifiers such as SVM, RF, SVM kernel, and XGBoost to compare with ourproposed classifier and revealed that these methods are less effective, especially when the temporal features extractionis the key to increase classification accuracy. The main advantage of our architecture is the capability of automatedfeatures extraction by learning time correlation of multiple images, which reduces manual feature engineering andmodeling crops phenological stages.

Acknowledgments

This work has been developed with the contribution of the Politecnico di Torino Interdepartmental Centre for ServiceRobotics PIC4SeR (https://pic4ser.polito.it) and SmartData@Polito (https://smartdata.polito.it).

References

[Gomez(2016)] Gomez, C., White, J. C., Wulder, M. A. Optical remotely sensed time series data for land coverclassification: A review. ISPRS J. Photogramm. Remote Sens. 2016, 116, 55-72.

[Wu(2014)] Wu, W.-B.; Yu, Q.-Y.; Peter, V.H.; You, L.-Z.; Yang, P.; Tang, H.-J. How Could Agricultural Land SystemsContribute to Raise Food Production Under Global Change? J Integr. Agric. 2014, 13, 1432–1442.

[Jin(2019)] Jin, Z.; Azzari, G.; You, C.; Di Tommaso, S.; Aston, S.; Burke, M.; Lobell, D.B. Smallholder maize areaand yield mapping at national scales with Google Earth Engine. Remote Sens. Environ. 2019, 228, 115–128.

[Yang(2007)] Yang, P.; Wu, W.-B.; Tang, H.-J.; Zhou, Q.-B.; Zou, J.-Q.; Zhang, L. Mapping Spatial and TemporalVariations of Leaf Area Index for Winter Wheat in North China. Agric. Sci. China, 2007, 6, 1437–1443.

[Wang(2016)] Wang, L.A.; Zhou, X.; Zhu, X.; Dong, Z.; Guo, W. Estimation of biomass in wheat using random forestregression algorithm and remote sensing data. Crop. J. 2016, 4, 212–219.

[Guan(2016)] Guan, K.; Berry, J.A.; Zhang, Y.; Joiner, J.; Guanter, L.; Badgley, G.; Lobell, D.B. Improving themonitoring of crop productivity using spaceborne solar-induced fluorescence. Glob. Chang. Biol. 2016, 22,716–726.

[Matton(2015)] Matton, N., Canto, G., Waldner, F., Valero, S., Morin, D., Inglada, J., Defourny, P. An automatedmethod for annual cropland mapping along the season for various globally-distributed agrosystems using highspatial and temporal resolution time series. Remote Sens. (Basel) 2015, 7(10), 13208-13232.

17

PIXEL R-CNN

[Battude(2016)] Battude, M.; Al Bitar, A.; Morin, D.; Cros, J.; Huc, M.; Marais Sicre, C.; Le Dantec, V.; Demarez, V.Estimating maize biomass and yield over large areas using high spatial and temporal resolution Sentinel-2 likeremote sensing data. Remote Sens. Environ. 2016, 184, 668–681.

[Huang(2015)] Huang, J.; Tian, L.; Liang, S.; Ma, H.; Becker-Reshef, I.; Huang, Y.; Su, W.; Zhang, X.; Zhu, D.; Wu,W. Improving winter wheat yield estimation by assimilation of the leaf area index from Landsat TM and MODISdata into the WOFOST model.Agric. For. Meteorol. 2015, 204, 106–121.

[Toureiro(2017)] Toureiro, C.; Serralheiro, R.; Shahidian, S.; Sousa, A. Irrigation management with remote sensing:Evaluating irrigation requirement for maize under Mediterranean climate condition.Agric. Water Manag. 2017, 184,211–220.

[Yu(2017)] Yu, Q.; Shi, Y.; Tang, H.; Yang, P.; Xie, A.; Liu, B.; Wu, W. eFarm: A Tool for Better ObservingAgricultural Land Systems. Sensors, 2017, 17, 453.

[Boryan(2011)] Boryan, C., Yang, Z., Mueller, R., Craig, M. Monitoring US agriculture: the US department ofagriculture, national agricultural statistics service, cropland data layer program. Geocarto. Int. 2011, 26(5), 341-358.

[Xie(2008)] Xie, Y., Sha, Z., Yu, M. Remote sensing imagery in vegetation mapping: a review. Plant Ecol. 2008, 1(1),9-23.

[Johnson(2016)] Johnson, D. M. (2016). A comprehensive assessment of the correlations between field crop yields andcommonly used MODIS products. Int. J. Appl. Earth Obs. Geoinf. 2016, 52, 65-81.

[Senf(2015)] Senf, C., Leitão, P. J., Pflugmacher, D., van der Linden, S., Hostert, P. Mapping land cover in complexMediterranean landscapes using Landsat: Improved classification accuracies from integrating multi-seasonal andsynthetic imagery. Remote Sens. Environ. 2015, 156, 527-536.

[Jin(2016)] Jin, H., Li, A., Wang, J., Bo, Y. Improvement of spatially and temporally continuous crop leaf area indexby integration of CERES-Maize model and MODIS data. Eur. J. Agron. 2016, 78, 1-12.

[Liaqat(2017)] Liaqat, M. U., Cheema, M. J. M., Huang, W., Mahmood, T., Zaman, M., Khan, M. M. Evaluation ofMODIS and Landsat multiband vegetation indices used for wheat yield estimation in irrigated Indus Basin. Comput.Electron. Agric. 2017, 138, 39-47.

[Senf(2017)] Senf, C., Pflugmacher, D., Heurich, M., Krueger, T. A Bayesian hierarchical model for estimatingspatial and temporal variation in vegetation phenology from Landsat time series. Remote Sens. Environ. 2017, 194,155-160.

[Kussul(2016)] Kussul, N., Lemoine, G., Gallego, F. J., Skakun, S. V., Lavreniuk, M., Shelestov, A. Y. Parcel-basedcrop classification in ukraine using landsat-8 data and sentinel-1A data. IEEE J. Sel. Top. Appl. Earth Obs. RemoteSens. 2016, 9(6), 2500-2508.

[Xiong(2017)] Xiong, J., Thenkabail, P. S., Gumma, M. K., Teluguntla, P., Poehnelt, J., Congalton, R. G., Thau,D. Automated cropland mapping of continental Africa using Google Earth Engine cloud computing. ISPRS J.Photogramm. Remote Sens. 2017, 126, 225-244.

[Yan(2015)] Yan, L., Roy, D. P. Improved time series land cover classification by missing-observation-adaptivenonlinear dimensionality reduction. Remote Sens. Environ. 2015, 158, 478-491.

[Gomez(2016)] Gomez, C., White, J. C., Wulder, M. A. Optical remotely sensed time series data for land coverclassification: A review. ISPRS J. Photogramm. Remote Sens. 2016, 116, 55-72.

[Xiao(2018)] Xiao, J., Wu, H., Wang, C., Xia, H. Land Cover Classification Using Features Generated From AnnualTime-Series Landsat Data. IEEE Geosci. Remote Sens. Lett. 2018, 15(5), 739-743.

[Khaliq(2018)] Khaliq, A., Peroni, L., Chiaberge, M. Land cover and crop classification using multitemporal Sentinel-2images based on crops phenological cycle. In IEEE Workshop on Environmental, Energy, and Structural MonitoringSystems (EESMS) IEEE, 2018; pp. 1-5.

[Gallego(2008)] Gallego, J., Craig, M., Michaelsen, J., Bossyns, B., Fritz, S. Best practices for crop area estimationwith remote sensing. 2008, Ispra: Joint Research Center.

[Zhou(2013)] Zhou, F., Zhang, A., Townley-Smith, L. A data mining approach for evaluation of optimal time-series ofMODIS data for land cover mapping at a regional level. ISPRS J. Photogramm. Remote Sens. 2013, 84, 114-129.

[Arvor(2011)] Arvor, D., Jonathan, M., Meirelles, M. S. P., Dubreuil, V., Durieux, L. Classification of MODIS EVItime series for crop mapping in the state of Mato Grosso, Brazil. Remote Sens. 2011, 32(22), 7847-7871.

[Zhong(2012)] Zhong, L., Gong, P., Biging, G. S. Phenology-based crop classification algorithm and its implicationson agricultural water use assessments in California’s Central Valley. Photogramm. Eng. Remote Sens. 2012, 78(8),799-813.

18

PIXEL R-CNN

[Wardlow(2012)] Wardlow, B. D., Egbert, S. L. A comparison of MODIS 250-m EVI and NDVI data for crop mapping:a case study for southwest Kansas. Int. J. Remote Sens. 2012,31(3), 805-830.

[Löw(2013)] Löw, F., Michel, U., Dech, S., Conrad, C. Impact of feature selection on the accuracy and spatialuncertainty of per-field crop classification using support vector machines. ISPRS J. Photogramm. Remote Sens.2013, 85, 102-119.

[Hao(2015)] Hao, P., Zhan, Y., Wang, L., Niu, Z., Shakir, M. Feature selection of time series MODIS data for earlycrop classification using random forest: A case study in Kansas, USA. Remote Sens. 2015, 7(5), 5347-5369.

[Blaschke(2010)] Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens.2010, 65(1), 2-16.

[Novelli(2016)] Novelli, A., Aguilar, M. A., Nemmaoui, A., Aguilar, F. J., Tarantino, E. Performance evaluation ofobject based greenhouse detection from Sentinel-2 MSI and Landsat 8 OLI data: A case study from Almería(Spain). Int. J. Appl. Earth Obs. Geoinf. 2016, 52, 403-411.

[Long(2013)] Long, J. A., Lawrence, R. L., Greenwood, M. C., Marshall, L., Miller, P. R. Object-oriented cropclassification using multitemporal ETM+ SLC-off imagery and random forest. GIsci. Remote Sens. 2013, 50(4),418-436.

[Li(2015)] Li, Q., Wang, C., Zhang, B., Lu, L. (2015). Object-based crop classification with Landsat-MODIS enhancedtime-series data. Remote Sens. (Basel) 2015, 7(12), 16091-16107.

[Walker(2014)] Walker, J. J., De Beurs, K. M., Wynne, R. H. Dryland vegetation phenology across an elevationgradient in Arizona, USA, investigated with fused MODIS and Landsat data. Remote Sens. Environ. 2014, 144,85-97.

[Walker(2015)] Walker, J. J., De Beurs, K. M., Henebry, G. M. Land surface phenology along urban to rural gradientsin the US Great Plains. Remote Sens. Environ. 2015, 165, 42-52.

[Simonneaux(2008)] Simonneaux, V., Duchemin, B., Helson, D., Er-Raki, S., Olioso, A., Chehbouni, A. G. The use ofhigh-resolution image time series for crop classification and evapotranspiration estimate over an irrigated area incentral Morocco. Int. J. Remote Sens. 2008, 29(1), 95-116.

[Shao(2016)] Shao, Y., Lunetta, R. S., Wheeler, B., Iiames, J. S., Campbell, J. B. An evaluation of time-series smoothingalgorithms for land-cover classifications using MODIS-NDVI multi-temporal data. Remote Sens. Environ. 2016,174, 258-265.

[Galford(2008)] Galford, G. L., Mustard, J. F., Melillo, J., Gendrin, A., Cerri, C. C., Cerri, C. E. Wavelet analysis ofMODIS time series to detect expansion and intensification of row-crop agriculture in Brazil. Remote Sens. Environ.2008, 112(2), 576-587.

[Funk(2009)] Funk, C., Budde, M. E. Phenologically-tuned MODIS NDVI-based production anomaly estimates forZimbabwe. Remote Sens. Environ. 2009, 113(1), 115-125.

[Siachalou(2015)] Siachalou, S., Mallinis, G., Tsakiri-Strati, M. A hidden Markov models approach for crop classifica-tion: Linking crop phenology to time series of multi-sensor remote sensing data. Remote Sens. (Basel) 2015, 7(4),3633-3650. Remote Sens. Environ. 2009, 113(1), 115-125.

[Xin(2015)] Xin, Q., Broich, M., Zhu, P., Gong, P. Modeling grassland spring onset across the Western United Statesusing climate variables and MODIS-derived phenology metrics. Remote. Sens. Environ. 2015, 161, 63-77.

[Xin(2016)] Gonsamo, A., Chen, J. M. Circumpolar vegetation dynamics product for global change study. Remote.Sens. Environ. 2016, 182, 13-26.

[Dannenberg(2015)] Dannenberg, M. P., Song, C., Hwang, T., Wise, E. K. Empirical evidence of El Niño–SouthernOscillation influence on land surface phenology and productivity in the western United States. Remote Sens. Environ.2015, 159, 167-180.

[Khatami(2016)] Khatami, R., Mountrakis, G., Stehman, S. V. A meta-analysis of remote sensing research on super-vised pixel-based land-cover image classification processes: General guidelines for practitioners and future research.Remote Sens. Environ. 2016, 177, 89-100.

[Gislason(2006)] Gislason, P. O., Benediktsson, J. A., Sveinsson, J. R. Random forests for land cover classification.Pattern Recognit. Lett. 2006, 27(4), 294-300.

[LeCun(2015)] LeCun, Y., Bengio, Y., Hinton, G. Deep learning. Nature 2015, 521(7553), 436-444.

[Congalton(1991)] Congalton, R. G. A review of assessing the accuracy of classifications of remotely sensed data.Remote Sens. Environ. 1991, 37(1), 35-46.

19

PIXEL R-CNN

[Sutskever(2014)] Sutskever, I., Vinyals, O., Le, Q. V. Sequence to sequence learning with neural networks. In Adv.Neural Inf. Process Syst.; Nips, 2014; pp. 3104-3112.

[Sutskever(2000)] Gers, F. A., Schmidhuber, J., Cummins, F. Learning to forget: Continual prediction with LSTM.Neural Computation 2000, 850-855

[Chung(2014)] Chung, J., Gulcehre, C., Cho, K., Bengio, Y. Empirical evaluation of gated recurrent neural networkson sequence modeling.arXiv preprint arXiv 2014, 1412.3555.

[Bahdanau(2014)] Bahdanau, D., Cho, K., Bengio, Y.. Neural machine translation by jointly learning to align andtranslate. arXiv preprint arXiv 2014,1409.0473.

[Chung(2015)] Chung, J., Gulcehre, C., Cho, K., Bengio, Y Gated feedback recurrent neural networks. In Proc. Int.Conf. Machine Learning, 2015, pp. 2067-2075.

[Lyu(2016)] Lyu, H., Lu, H., Mou, L. Learning a transferable change rule from a recurrent neural network for landcover change detection. Remote Sens. 2016, 8(6), 506.

[Lyu(2018)] Lyu, H., Lu, H., Mou, L., Li, W., Wright, J., Li, X., Gong, P. Long-term annual mapping of four cities ondifferent continents by applying a deep information learning method to landsat data. Remote Sens. 2018, 10(3), 471.

[LeCun(1998)] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. Gradient-based learning applied to document recognition.Proceedings of the IEEE 1998, 86(11), 2278-2324.

[Hubel(1962)] Hubel, D. H., Wiesel, T. N. Receptive fields, binocular interaction and functional architecture in thecat’s visual cortex. J. Physiol. (Lond.) 1962, 160(1), 106-154.

[Hubel(2010)] Szeliski, R. Computer vision: algorithms and applications. In Springer Science & Business Media;Springer, London, 2010.

[Hubel(2015)] Dalal, N., Triggs, B. Histograms of oriented gradients for human detection. Conf. on Comp. Vision andPatt. Recog 2015, 886-893

[Krizhevsky(2012)] Krizhevsky, A., Sutskever, I., Hinton, G. E. Imagenet classification with deep convolutional neuralnetworks. In Adv. Neural Inf. Process Syst.; 2012, pp. 1097-1105.

[Zeiler(2014)] Zeiler, M. D., Fergus, R. Visualizing and understanding convolutional networks. In Comput. Vis. ECCV;Springer, Cham, 2014, pp. 818-833.

[Wan(2017)] Wan, X., Zhao, C., Wang, Y., Liu, W. (2017). Stacked sparse autoencoder in hyperspectral data classifi-cation using spectral-spatial, higher order statistics and multifractal spectrum features. Infrared Phys. & Technol.2017, 86, 77-89.

[Mou(2017)] Mou, L., Ghamisi, P., Zhu, X. X. (2017). Unsupervised spectral–spatial feature learning via deep residualConv–Deconv network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2017, 56(1),391-406.

[Kampffmeyer(2016)] Kampffmeyer, M., Salberg, A. B., Jenssen, R. Semantic segmentation of small objects andmodeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In Proceedingsof the IEEE conf. on comput. vision and patt. recog. workshops; 2016, pp. 1-9.

[Kampffmeyer(2017)] Maggiori, E., Tarabalka, Y., Charpiat, G., Alliez, P. High-resolution aerial image labeling withconvolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55(12), 7092-7103.

[Audebert(2018)] Audebert, N., Le Saux, B., Lefèvre, S. Beyond RGB: Very high resolution urban remote sensingwith multimodal deep networks. IISPRS J Photogramm. Remote Sens. 2018, 140, 20-32.

[Kussul(2017)] Kussul, N., Lavreniuk, M., Skakun, S., Shelestov, A. Deep learning classification of land cover andcrop types using remote sensing data. IEEE Geosci. Remote Sens. Lett. 2017, 14(5), 778-782.

[Kussul(2017)] Land Use and Coverage Area frame Survey (LUCAS) Details. [online] Availble:https://ec.europa.eu/eurostat/web/lucas [Accessed on 14-01-2019]

[Kussul(2017)] Sentinel-2 MSI Technical Guide: [online] Availble: https://sentinel.esa.int/web/sentinel/technical-guides/sentinel-2-msi [Accessed on 14-01-2019]

[Kaufman(1988)] Kaufman, Y. J., Sendra, C. Algorithm for automatic atmospheric corrections to visible and near-IRsatellite imagery. Int. J. Remote Sens. 1988, 9(8), 1357-1381.

[Mellet(1996)] Mellet, E., Tzourio, N., Crivello, F., Joliot, M., Denis, M., Mazoyer, B. Functional anatomy of spatialmental imagery generated from verbal instructions. Open J. Neurosci. 1996, 16(20), 6504-6512.

[Reddi(2019)] Reddi, S. J., Kale, S., Kumar, S. On the convergence of adam and beyond. arXiv preprint arXiv 2019,1904.09237.

20

https://ec.europa.eu/eurostat/web/lucas

https://ec.europa.eu/eurostat/web/lucas

https://sentinel.esa.int/web/sentinel/technical-guides/sentinel-2-msi

https://sentinel.esa.int/web/sentinel/technical-guides/sentinel-2-msi

PIXEL R-CNN

[Kingma(2019)] D.P. Kingma, and J. Ba, "Adam: A method for stochastic optimization", 2014, arXiv preprintarXiv:1412.6980. [online] Availble:https://arxiv.org/pdf/1412.6980.pdf [Accessed on [21-04-2019]

[Smith(2017)] Smith, L. N. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conf. on Appl. ofComput. Vision (WACV); IEEE, 2017, pp. 464-472.

[Srivastava(2014)] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. Dropout: a simple wayto prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15(1), 1929-1958.

[Fernández(2014)] Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D. Do we need hundreds of classifiers tosolve real world classification problems?. J. Mach. Learn. Res. 2014, 15(1), 3133-3181.

[Shi(2016)] Shi, D., Yang, X. An assessment of algorithmic parameters affecting image classification accuracy byrandom forests. Photogramm. Eng. Remote Sens. 2016, 82(6), 407-417.

[Lawrence(2006)] Lawrence, R. L., Wood, S. D., Sheley, R. L. Mapping invasive plants using hyperspectral imageryand Breiman Cutler classifications (RandomForest). Remote Sens. Environ. 2006, 100(3), 356-362.

[Bergstra(2012)] Bergstra, J., Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012,13, 281-305.

[Lawrence(2018)] Rußwurm, M., Körner, M. Multi-temporal land cover classification with sequential recurrentencoders.ISPRS Int. J. Geoinf. 2018, 7(4), 129.

[Conrad(2014)] Conrad, C., Dech, S., Dubovyk, O., Fritsch, S., Klein, D., Löw, F., Zeidler, J. Derivation of temporalwindows for accurate crop discrimination in heterogeneous croplands of Uzbekistan using multitemporal RapidEyeimages. Comput. Electron. Agric. 2014, 103, 63-74.

[Skakun(2015)] Skakun, S., Kussul, N., Shelestov, A. Y., Lavreniuk, M., Kussul, O. Efficiency assessment of multi-temporal C-band Radarsat-2 intensity and Landsat-8 surface reflectance satellite imagery for crop classification inUkraine. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 9(8), 3712-3719.

[Vuolo(2018)] Vuolo, F., Neuwirth, M., Immitzer, M., Atzberger, C., Ng, W. T. How much does multi-temporalSentinel-2 data improve crop type classification?. Int. J. Appl. Earth Obs. Geoinf. 2018, 72, 122-130.

[Peña(2011)] Peña-Barragán, J. M., Ngugi, M. K., Plant, R. E., Six, J. Object-based crop identification using multiplevegetation indices, textural features and crop phenology. Remote sens. Environ. 2011, 115(6), 1301-1316.

21

http://arxiv.org/abs/1412.6980

https://arxiv.org/pdf/1412.6980.pdf

MPROVEMENT IN LAND COVER AND CROP CLASSIFICATION …arxiv.org/pdf/2004.12880v2.pdfAleem Khaliq...

Documents

Transcript of MPROVEMENT IN LAND COVER AND CROP CLASSIFICATION …arxiv.org/pdf/2004.12880v2.pdfAleem Khaliq...