Compressed Sensing Meets Machine Learning - Classification ...
REMOTE SENSING CLASSIFICATION. ALGHORITMS ANALYSIS APPLIED...
-
Upload
truongcong -
Category
Documents
-
view
225 -
download
0
Transcript of REMOTE SENSING CLASSIFICATION. ALGHORITMS ANALYSIS APPLIED...
REMOTE SENSING CLASSIFICATION.
ALGHORITMS ANALYSIS APPLIED TO
LAND COVER CHANGE.
GERMAN ALBA
Seminar
Master in Emergency Early Warning and Response Space Applications. Mario Gulich
Institute, CONAE. Argentina
October, 2014
Contents
1.Introduction..............................................................................................................3
2.Classification Algorithms: Categorization...............................................................4
2.1.Supervised and Unsupervised...................................................................4
2.2.Parametric and Non-parametric................................................................4
2.3. Per-pixel and Subpixel classifiers and hard and soft classifiers..............5
3.Maximum likelihood...............................................................................................5
4.Decision Tree...........................................................................................................7
5.Support Vector Machine.........................................................................................9
6. Comparison of SVM, MLC and DT's performance.............................................12
7.Fuzzy Classification..............................................................................................12
7.1 Fuzzy sets and fuzzy logic theory...........................................................13
7.2 Soft classifiers.........................................................................................16
8.Land cover change detection.................................................................................13
9.Conclusions...........................................................................................................18
10.References...........................................................................................................19
1.Introduction
Remote-sensing research focusing on image classification has long attracted the
attention of the remote-sensing community because classification results are the basis for many
environmental and socioeconomic applications. However, classifying remotely sensed data into a
thematic map remains a challenge because many factors, such as the complexity of the landscape in a
study area, selected remotely sensed data, and image- processing and classification approaches, may
affect the success of a classification (D. LU and Q. WENG, 2007).
Remote-sensing classification is a complex process and requires consideration of many factors.
The major steps of image classification may include determination of a suitable classification system,
selection of training samples, image preprocessing, feature extraction, selection of suitable
classification approaches, post-classification processing, and accuracy assessment. The user’s need,
scale of the study area, economic condition, and analyst’s skills are important factors influencing the
selection of remotely sensed data, the design of the classification procedure, and the quality of the
classification results (D. LU and Q. WENG, 2007).
This report will focus on the image classification process itself, exploring four algorithms:
Maximum Likelihood, Decision Tree, Support Vector Machine and Fuzzy classification theory and
some examples.
2.Classification Algorithms: Categorization
There are a lot of categories in which we can separate the different types of algorithms,
depending on the criteria we focus.
2.1.Supervised and Unsupervised
The first big separation is supervised and unsupervised classification algorithms, that depends
on whether training samples are needed or not.
The first case is the most used for its accuracy, but field information is needed. Land cover
classes are defined and sufficient reference data has to be available and used as training samples. The
signatures generated from the training samples are then used to train the classifier to classify the
spectral data into a thematic map (D. LU and Q. WENG, 2007). Some examples of this kind of
approach are: Maximum likelihood, minimum distance, artificial neural network, decision tree
classifier. Later on, Maximum likelihood and decision tree will be explain.
Unsupervised is also very used like a first step to understand the spectral response of the
different covers in a satellite image. Clustering-based algorithms are used to partition the spectral
image into a number of spectral classes based on the statistical information inherent in the image. No
prior definitions of the classes are used. The analyst is responsible for labeling and merging the
spectral classes into meaningful classes (D. LU and Q. WENG, 2007). ISODATA and K-means are the
most used algorithms for this.
2.2.Parametric and Non-parametric
Another important criteria for categorization is Parametric or Non-parametric
classifiers. This distinction depends on whether parameters such as mean vector and covariance
matrix are used or not.
In the first one, Gaussian distribution is assumed. The parameters (e.g. mean vector and
covariance matrix) are often generated from training samples. When landscape is complex, parametric
classifiers often produce ‘noisy’ results. Another major drawback is that it is difficult to integrate
ancillary data, spatial and contextual attributes, and non-statistical information into a classification
procedure (D. LU and Q. WENG, 2007). Examples of this are Maximum likelihood and linear
discriminant analysis.
In the second one, no assumption about the data is required. They do not employ statistical
parameters to calculate class separation and are especially suitable for incorporation of non-remote-
sensing data into a classification procedure (D. LU and Q. WENG, 2007). Examples are Artificial
neural network, decision tree classifier, evidential reasoning, support vector machine, expert system.
Later on, decision tree and support vector machine will be explained.
2.3. Per-pixel and Subpixel classifiers and hard and soft classifiers
This separation is mostly for classical classification and fuzzy classification. It depends on
which kind of pixel information is used and whether the outputs is a definitive decision about land
cover class or not. In per-pixel classifiers, most traditional classifiers, typically develop a signature
by combining the spectra of all training-set pixels from a given feature. The resulting signature
contains the contributions of all materials present in the training-set pixels, ignoring the mixed pixel
problems. Related to this are the hard classifiers. They make a definitive decision about the land cover
class that each pixel is allocated to a single class. The area estimation by hard classification may
produce large errors, especially from coarse spatial resolution data due to the mixed pixel problem.
Most of the classifiers are example of this, such as maximum likelihood, minimum distance, artificial
neural network, decision tree, and support vector machine (D. LU and Q. WENG, 2007).
In subpixel classifiers, the spectral value of each pixel is assumed to be a linear or non-linear
combination of defined pure materials (or end members), providing proportional membership of each
pixel to each end member (D. LU and Q. WENG, 2007). The same in soft classifiers, where for each
pixel a measure of the degree of similarity for every class is provided. Soft classification generate
more information and potentially a more accurate result, especially for coarse spatial resolution data
classification. This is how Fuzzy classifiers work in general, like Fuzzy-set classifiers, subpixel
classifier, spectral mixture analysis. Later on, Fuzzy set logic will be explain.
3.Maximum likelihood
Maximum likelihood classification algorithm is one of the well known parametric classifies
used for supervised classification.
The maximum-likelihood classifier is a parametric classifier that relies on the second- order
statistics of a Gaussian probability density function (pdf) model for each class. It is often used as a
reference for classifier comparison because, if the class pdf's are indeed Gaussian, it is the optimal
classifier (Paola J. D., Schowengerdt R. A, 1995). The basic discriminant function for each class is
where n is the number of bands, X is the data vector, Ui, is the mean vector of class i and i, is the
covariance matrix of class i ,
The values in the mean vector, Vi, and the covariance matrix, Xi, are estimated from the training data
by the unbiased estimators
where Pi is the number of training patterns in class i. Note that in order for the inverse of the
covariance matrix to be calculated, Pi must be at least one greater than the numbe of image bands. The
second equation shown before, can be reduced by taking the natural log and discarding the constant R
term to
If the a priori probabilities are assumed to be equal, the first term is a constant and can be ignored. The
second term is a constant for each class. This leaves only the third term to be calculated for each pixel
during classification. The discriminant g i ( X ) is calculated for each class and the class with the
highest value is selected for the final classification map (Paola J. D., Schowengerdt R. A, 1995).
The advantage of the MLC as a parametric classifier is that it takes into account the variance–
covariance within the class distributions and for normally distributed data, the MLC performs better
than the other known parametric classifies (Erdas, 1999). However, for data with a non-normal
distribution, the results may be unsatisfactory (Otukei, Blaschke, 2010).
4.Decision Tree
A decision tree classifier is a non-parametric classifier that does not require any a priori
statistical assumptions to be made regarding the distribution of data. The process of building the
decision tree is presented in Quinlan (1993). The basic structure of the decision tree however, consists
of one root node, a number of internal nodes and finally a set of terminal nodes. The data is
recursively divided down the decision tree according to the defined classification framework. At each
node, a decision rule is required and this can be implemented using a splitting test often of the form :
for univariate decision trees, where xi represents the measurement vectors on the n selected features
and a is a vector of linear discriminate coefficients while c is the decision threshold (Brodley and
Utgoff, 1992).
In this framework, a data set is classified by sequentially sub-dividing it according to the
decision framework defined by the tree, and a class label is assigned to each observation according to
the leaf node into which the observation falls (Friedl M. A. , Brodley C.E.,1997).
The DTs are known to produce results of higher accuracies in comparison to traditional
approaches such as the ‘‘box’’ and ‘‘minimum distance to means’’ classifiers but the performance of
DTs can be affected by a number of factors including: pruning and boosting methods used and
decision thresholds (Mahesh and Mather, 2003).
Decision trees have several advantages over traditional supervised classification procedures
used in remote sensing such as maximum likelihood classification. In particular, decision trees are
strictly non-parametric and do not require any assumptions regarding the distributions of the input data.
In addition, they handle non-linear relationships between features and classes, allow for missing values,
and are capable of handling both numeric and categorical inputs in a natural fashion (Hampson and
Volper, 1986; Fayyad and Irani, 1992a). Finally, decision trees have significant intuitive appeal because
the classification structure is explicit and therefore easily interpretable (Friedl M. A. , Brodley
C.E.,1997).
Numerous tree construction approaches have been developed over the past thirty or so years.
For classification problems that utilize data sets that are both well understood and well behaved,
classification trees may be defined solely on analyst expertise. However, this procedure is difficult to
implement in practice because the exact values of thresholds can vary substantially in both time and
space, and are therefore difficult to specify based on user knowledge alone (Friedl M. A. , Brodley
C.E.,1997).
More commonly, the splits defined at each internal node of a decision tree are estimated from
training data using a statistical procedure. The specific techniques that are used for this work are called
“learning algorithms”, which have been developed within the machine learning and pattern recognition
communities. They require high quality training data from which relationships among the features and
classes present within the data are “learned". Therefore, a set of training samples representative of the
population to be classified must be available to construct an accurate decision tree (Friedl M. A. ,
Brodley C.E.,1997).
A classic example of this approach is the classification and regression tree (CART) model
described by Breiman et al. (1984). In CART a tree-structured decision space is estimated 6by
recursively splitting the data at each node based on a statistical test that increases the homogeneity of
the training data in the resulting descendant nodes. The basic decision tree classification model
described by CART was tested using remotely sensed data (Hansen, et al. 1996). The results from this
work showed that the decision tree performed comparably to a maximum likelihood classifier in terms
of classification accuracy for the data set examined, and that decision trees have significant advantages
for feature selection and handling of disparate data types and missing data (Friedl M. A. , Brodley
C.E.,1997).
A key step in any decision tree estimation problem is to correct the tree for over fitting by
pruning the tree back. Conventionally, a tree is grown such that all training observations are correctly
classified (i.e., training classification accuracy = 100%). If the training data contains errors, then over
tting the tree to the data in this manner can lead to poor performance on unseen data. Therefore, the tree
must be pruned back to reduce classification errors when data outside of the training set are to be
classified. Quinlan (1987) and Mingers (1989) describe common methods for pruning trees.
5.Support Vector Machine
The support vector machines (SVMs) are a set of related learning algorithms used for
classification and regression. Like the DTs classifiers, the SVM are also non-parametric classifiers.
The theory of the SVM was originally proposed by Vapnik and Chervonenkis (1971) and later
discussed in detail by Vapnik (1999). The success of the SVM depends on how well the process is
trained. The easiest way to train the SVM is by using linearly separable classes. According to Osuna et
al. (1997) if the training data with k number of samples is represented as {X i , y i }, i = 1, . . ., k where
is an N-dimensional space and is a class label then these classes are considered linearly separable if
there exists a vector W perpendicular to the linear hyper-plane (which determines the direction of the
discriminating plane) and a scalar b showing the offset of the discriminating hyper-plane from the
origin. For the two classes, i.e. class 1 represented as -1 and class 2 represented as +1, two hyper-planes
can be used to discriminate the data points in the respective classes (Otukei, Blaschke, 2010). These are
expressed as
The two hyper-planes are selected so as not only to maximize the distance between the two
given classes but also not to include any points between them. The overall goal is to find out in which
class the new data points fall (Otukei, Blaschke, 2010). Overall, the SVMs are reported to produce
results of higher accuracies compared with the traditional approaches but the outcome depends on: the
kernel used, choice of parameters for the chosen kernel and the method used to generated SVM
(Huang et al., 2002).
The inductive principle behind SVM is structural risk minimization (SRM). According to
Vapnik (1995), the risk of a learning machine (R) is bounded by the sum of the empirical risk estimated
from training samples (R emp ) and a confidence interval. The strategy of SRM is to keep the
empirical risk (R emp) fixed and to minimize the confidence interval (Y ), or to maximize the margin
between a separating hyperplane and closest data points. A separating hyperplane refers to a plane in a
multi-dimensional space that separates the data samples of two classes. The optimal separating
hyperplane is the separating hyperplane that maximizes the margin from closest data points to the
plane.
Let the training data of two separable classes with k samples be represented by(x1 , y1), ..., (xk , yk )
where x Rn is an n-dimensional space, and y {+1, 1} is class label. Suppose the two classes can
be separated by two hyperplanes parallel to the optimal hyperplane:
The optimal separating hyperplane between (a) separable samples and (b)non-separable data samples.
Training data selection is one of the major factors determining to what degree the classification
rules can be generalized to unseen samples (Paola and Schowengerdt,1995). A previous study showed
that this factor could be more important for obtaining accurate classifications than the selection of
classification algorithms (Hixson et al. 1980).
With data sizes fixed, training pixels can be selected in many ways. A commonly used sampling
method is to identify and label small patches of homogeneous pixels in an image (Campbell 1996).
However, adjacent pixels tend to be spatially correlated or have similar values (Campbell 1981).
Training samples collected this way under- estimate the spectral variability of each class and are likely
to give degraded classifications (Gong and Howarth 1990). A simple method to minimize the effect of
spatial correlation is random sampling (Campbell 1996). Two strategies for this are sample rate (ESR)
in which a fixed percentage of pixels are randomly sampled from each class as training data, and equal
sample size (ESS), in which a fixed number of pixels are randomly sampled from each class as training
data (C. Huang et al., 2002). In the study perform by C. Huang et al.( 2002), is shown that for most
training cases slightly higher accuracies were achieved when the training samples were selected using
the ESR method. Considering the disadvantage of undersampling or even totally missing rare classes of
the ESR method, the sampling rate of very rare classes should be increased when this method is
employed.
Furthermore, the minimum number of samples for adequately training an algorithm may depend
on the algorithm concerned, the number of input variables, the method used to select the training
samples, and the size and spatial variability of the study area (C. Huang et al., 2002).
Also the kernel function plays a major role in locating complex decision boundaries between
classes. By mapping the input data into a high-dimensional space, the kernel function converts non-
linear boundaries in the original data space into linear ones in the high-dimensional space, which can
then be located using an optimization algorithm. Therefore the selection of kernel function and
appropriate values for corresponding kernel parameters, referred to as kernel configuration, may affect
the performance of the SVM (C. Huang et al., 2002). The parameter to be predefined for using the
polynomial kernels is the polynomial order p. Rapid increases in computing time as p increases limited
experiments with higher p values. In general, the linear kernel ( p=1) performed worse than nonlinear
kernels, which is expected because boundaries between many classes are more likely to be non-linear.
Previous studies suggest that polynomial order p has different impacts on kernel performance when
different number of input variables is used (C. Huang et al., 2002). With large numbers of input
variables, complex nonlinear decision boundaries can still be mapped into linearones using relatively
low-order polynomial kernels. However, if a data set has only several variables, it is necessary to try
high-order polynomial kernels in order to achieve optimal performances using polynomial SVM (C.
Huang et al., 2002).
Summarizing, the training of the SVM is affect by training data size, kernel parameter setting
and class separability. Generally, when the training data size is doubled, the training time would be
more than doubled. Training the SVM to classify two highly mixed classes could take several times
longer than training it to classify two separable classes. As expected, increases in training data size
generally led to improved performances (C. Huang et al., 2002).
6. Comparison of SVM, MLC and DT's performance
In C. Huang et al. (2002), the SVM was compared to three other popular classifiers, including
the maximum likelihood classifier (MLC), neural network classifiers (NNC) and decision tree
classifiers (DTC). Belove, some conclusions of this comparison are cited, excluding the NNC analysis.
The MLC had lower accuracies than the non-parametric algorithms. The SVM was more
accurate than DTC in 22 out of 24 training cases. The higher accuracies of the SVM should be
attributed to its ability to locate an optimal separating hyperplane. Statistically, the optimal separating
hyperplane found by the SVM algorithm should be generalized to unseen samples with fewer errors
than any other separating hyperplane that might be found by other classifiers. the SVM had less success
in transforming non-linear class boundaries in a very low-dimensional space into linear ones in a high-
dimensional space.
In terms of algorithm stability, the SVM gave more stable overall accuracies than the other three
algorithms except when trained using 6% pixels with three variables. Of the other three algorithms,
DTC gave slightly more stable overall accuracies than the MLC, but this one gave overall accuracies in
wide ranges. In terms of training speed, the MLC and DTC were much faster than the SVM. The SVM
was affected by training data size, kernel parameter setting and class separability.
The training speeds of the classifiers were substantially different. In all training cases training
the MLC and DTC did not take more than a few minutes on a SUN Ultra 2 workstation, while training
the SVM took hours and days, respectively.
7.Fuzzy Classification
Fuzzy systems are an alternative to classical notions of set membership and logic thathave their origins
in ancient Greek philosophy (Brule, 1985).
7.1 Fuzzy sets and fuzzy logic theory
The fuzzy set framework introduces vagueness, with the aim of reducing complexity, by
eliminating the sharp boundary dividing the members of a class from non-members. In some
situations, these sharp boundaries may be arbitrary, or powerless, as they cannot capture the semantic
flexibility inherent in complex categories. The grades of membership correspond to the degree of
compatibility with the concepts represented by the class concerned: the direct evaluation of grades
with adequate measures is a significant stage for subsequent decision- making processes (McBratney et
al., 1997).
In a formal definition of a fuzzy set, is presuppose that X = { x} is a finite set (or space) of
points, which could be elements, objects or properties; a fuzzy subset, A of X,is defined by a
function , in the ordered pairs:
In plain language, a fuzzy subset is defined by the membership function defining the membership
grades of fuzzy objects in the ordered pairs consisting of the objects and their membership grades. The
relation is therefore termed as a membership function (MF) defining the grade of
membership x (the object) in A and x X indicates that x is an object of, or is contained in X. For all
A, takes on the values between and including 0 and 1. In practice, X = {x1,x2, . . . . ,xn} and
Eq. (1) is written as:
the + is used as defined in the set theoretic sense. If = 0, then x, is omitted
(McBratney et al., 1997).
A fuzzy membership function (FMF) is thus an expression defining the grade of membership of
x in A. In other words, it is a function that maps the fuzzy subset A to a membership value between and
including 0 and 1. In contrast to the characteristic function in conventional set theory which implies
that membership of individual objects
in a subset as either belonging or not at all, → {0,1} where is the non-fuzzy
equivalent of fuzzy subset A, the FMF of x in A is expressed as:
that associates with each element x X its grade of membership [0,1]. Thus = 0
means that x does not belong to the subset A, = 1 indicates that x fully belongs, and
0 < < 1 means that x belongs to some degree; partial membership is therefore possible
(McBratney et al., 1997).
Fuzzy logic therefore uses a 'soft' linguistic type of variables (e.g. deep, sandy, steep, etc.)
which are defined by continuous range of truth values or FMFs in the interval [0,1] instead of the strict
binary (TRUE or FALSE) decisions and assigmnents, as is the case with the Boolean logic. The
linguistic input-output associations, when combined with an inference procedure, constitute the fuzzy
rule-based systems (McBratney et al., 1997).
7.2 Soft classifiers
Rule-based expert systems are often applied to classification problems in various application
fields, like fault detection, biology, and medicine. Fuzzy logic can improve such classification and
decision support systems by using fuzzy sets to define overlapping class definitions. The application of
fuzzy if- then rules also improves the interpretability of the results and provides more insight into the
classifier structure and decision making process (Roubos J.A. et al., 2003).
The automated construction of fuzzy classification rules from data has been approached by
different techniques. In Roubos J.A. et al. (2003), different examples of various approaches are listed,
like neuro-fuzzy methods in D. Nauck, R. Kruse, (1999) and S. Mitra, Y. Hayashi (2000), genetic-
algorithm based rule selection in H. Ishibuchi, T. Nakashima (1999), and fuzzy clustering in
combination with other methods such as fuzzy relations in M. Setnes, R. Babuska (1999) and genetic
algorithm (GA) optimization in M. Setnes, J.A. Roubos (2000).
In the field of classification problems, we often encounter classes with a very different
percentage of patterns between them, classes with a high pattern percentage and classes with a low
pattern percentage. These problems receive the name of “classification problems with imbalanced data-
sets” . This occurs when the number of instances of one class is much lower than the instances of the
other classes. Most classifiers generally perform poorly on imbalanced data-sets because they are
designed to minimize the global error rate and in this manner they tend to classify almost all instances
as negative (i.e., the majority class). Here resides the main problem for imbalanced data-sets, because
the minority class may be the most important one, since it can define the concept of interest, while the
other class(es) represent(s) the counterpart of that concept. For this type of data-sets, fuzzy
classification methods have been proposed to improve the results. For example, in Fernández et al.
(2007).
In addition, various approaches may be used to derive a soft classier. These approaches are
based on specific uncertainty representation frameworks such as the fuzzy set theory explained before.
In addition to the use of specific representation frameworks, the output of “hard” classifiers, such as the
maximum likelihood classifier and the multilayer perception, can be softened to derive measures of the
strength of class membership (Schowengerdt, 1996; Wilkinson, 1996).
The most common solutions adopt a fuzzy set framework (Pedrycz, 1990; Binaghi and
Rampini, 1993; Ishibuchi et al., 1993). The apparatus of the fuzzy set theory serves as a natural
framework for modeling the gradual transition from membership to non-membership in intrinsically
vague classes.
Finally, after the production of soft results, a hardening process is sometime performed to
obtain final crisp assignments to classes. This is done by applying appropriate ranking procedures and
decision rules based on the inherent uncertainty and total amount of information dormant within the
data.
However, despite the sizable achievements obtained, the use of soft classifiers is still limited by
the lack of well-assessed and adequate methods for the evaluation of the accuracy of their outputs, an
element of primary concern, which must be considered an integral part of the overall classification
procedure. In Binaghi et al. (1999), a method for assessment of soft classification is proposed.
8.Land cover change detection
Change detection is the process of identifying differences in the state of an object, a surface, or
a process by observing it at different times. Methods of change detection in remote sensing typically
analyze sequential images of the same area, and involve the detection and display of the change in the
image space (Abuelgasim A. A. et al., 1999).
The underlying assumption in using remotely sensed data for change detection is that changes
in the land-cover result in significant differences in the remote sensing measurements between two or
more dates. In addition, these differences must be larger or somehow distinguishable from other
changes in the images due to changing atmospheric conditions, seasons, illumination conditions, and
sensor calibration (Abuelgasim A. A. et al., 1999).
For this kind of studies, classification algorithms are used to identify the changes in classified
classes. Multiple options are available and the selection of approaches and resources depend on the
purposes of the study. While some comparisons of algorithmperformance have been published,there are
not generally accepted criteria for selection of the most appropriate classification algorithm for a given
set of circumstances (DeFries and Cheung-Wai Chan, 2000).
A simple taxonomy of land cover change might start with separation of land cover changes that
are continuous versus categorical. In continuous land cover changes, there is a change in the amount or
concentration of some attribute of the landscape that can be continuously measured (Abuelgasim A. A.
et al., 1999). An example might be change in a forest attribute like forest cover or basal area or leaf
area index. In this context, the goal of change detection would be to measure the degree of change in an
amount or concentration through time (Abuelgasim A. A. et al., 1999).
The second type of change is categorical, in which the changes in time are between land cover
or land use categories. Simple examples in this context might be deforestation, in which areas that were
once forest are no longer. Urbanization, expansion of agriculture, or reforestation are other examples
(Abuelgasim A. A. et al., 1999).
The second type of change determination is sensitive to slight quantitative changes in the input
vector introduced by noise that lead to changes in the most likely class. The problem is particularly
acute for change determinations between classes that are spectrally similar. The approach that has been
used in recent literature involves examining the output signal and a comparison of the magnitude of
likelihood values or fuzzy membership values between classes(Abuelgasim A. A. et al., 1999). Using
maximum likelihood, this approach involves examination of the distribution of likelihood values for the
various classes (Foody et al., 1992).
Techniques for extracting land cover information are very difficult to generalized but the effort
is to automated to the degree possible to process these large volumes of data. In addition, the
techniques need to be objective, reproducible, and feasible to implement within available resources. For
example, there are a lot of international efforts to characterize the extent of forest cover globally from
satellite data at repeated intervals over time. This task can only realistically be achieved through
techniques that minimize time-consuming human interpretation and maximize automated procedures
for data analysis (DeFries and Cheung-Wai Chan, 2000).
Comparing the classifiers depending on the performance, it can be stated that all classifiers are
affected by the selection of training samples. However, the initial trends of improved classification
accuracies for all classifiers as training data size increased emphasize the necessity of having adequate
training samples in land cover classification. Feature selection is another factor affecting
classification accuracy. Substantial increases in accuracy are normally achieved when as much
information as possible is used in deriving land cover classification from satellite images.
In the literature there is a vast amount of publications of very different techniques for this kind
of studies. For example, De Fries and Cheung-Wai Chan (2000), De Fries et al. (1998), Rogan (2002).
For fuzzy techniques applied to this thematic, Lizarazo (2012), Fisher (2010), Gopal et al. (1999). For
implementation Strategy and more information about the approaches to this kind of complex
phenomenons, see Lambin et al. (1999). For classification accuracy assessment, see Foody (2002).
9.Conclusions
Remote sensing classification is a very vast thematic and a lot of techniques are available in the
different softwares. The election of them is personal and specific of the porpoise of the study itself.
But some general appreciations can be made to avoid the principal problems that have to be solved,
mostly regarding to training areas and training the algorithms.
SVM are a very promising way of classification, but more slow and more difficult to train for
good results. More simple algorithms like MLC, are very useful and more easy to obtained results with,
but have problems with data that is not normally distributed. DT's are very used to land cover change
studies and there is a lot of literature that state this are very useful for this porpoise.
To take advantage of the pros and cons of them a combination of techniques is, in my
perspective, a good approach. More complex analysis can be made with fuzzy systems that can be a
good second step in the understanding of this complex processes.
10.References
Abuelgasim A. A. , Ross W. D., Gopal S., Woodcock C. E. 1999. Change Detection Using
Adaptive Fuzzy Neural Networks: Environmental Damage Assessment after the Gulf War
. REMOTE SENS. ENVIRON, 70:208–223.
Binaghi, E., Rampini, A. 1993. Fuzzy decision making in the classification of multisource
remote sensing data. Optical Engineering 6, 1193±1203.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classi cation and Regression
Trees . Belmont, CA: Wadsworth International Group, 358 pages.
Brodley, C.E., Utgoff, P.E. 1992. Multivariate versus univariate decision trees. Technical
Report 92-8. University of Massachusetts, Amherst, MA, USA.
Brule, F.J. 1985. Fuzzy systems - - a tutorial. Source: Internet Newsgroups: comp.ai;
http//www.quadralay.com/www/Fuzzy/tutorial.html.
Campbell, J. B. 1981, Spatial correlation effects upon accuracy of supervised classification
of land cover. Photogrammetric Engineering and Remote Sensing, 47, 355–363.
Campbell, J. B. 1996, Introduction to Remote Sensing (New York: The Guilford Press).
Carpenter, G., Gopal, S., Martens, S., and Woodcock, C. 1999. Evaluation of mixture
estimation methods for vegetation mapping, Technical Report CAS/CNS-97-014, Boston
University. Remote Sens. Environ., in press.
De fries R. S., Hansen M., Townshend J. R. G. and Sohlberg R. 1998. Global land cover
classifications at 8 km spatial resolution: the use of training data derived from Landsat imagery in
decision tree classifiers. int. j. remote sensing, vol. 19, no. 16, 3141± 3168.
DeFries, R. S. and Cheung-Wai Chan, Jonathan . 2000. Multiple Criteria for Evaluating
Machine Learning Algorithms for Land Cover Classification from Satellite Data. Remote sens.
Environ. 74:503–515.
Erdas Inc. 1999. Erdas Field Guide. Erdas Inc., Atlanta, Georgia.
Fernández Alberto, García Salvador, María José del Jesus, Herrera Francisco. 2007. A study of
the behaviour of linguistic fuzzy rule based classification
systems in the framework of imbalanced data-sets. Available online at www.sciencedirect.com
Fisher, Peter F.. 2010.Remote sensing of land cover classes as type 2 fuzzy sets
. Remote Sensing of Environment 114, 309–321.
Foody, G. M., Campbell, N. A., Trodd, N. M., and Wood,
T. F. 1992. Derivation and applications of probabilistic
measures of class membership from maximum likelihood classification. Photogramm. Eng. Remote
Sens. 58:1335– 1341.
Foody, Giles M. 2002. Status of land cover classification accuracy assessment
. Remote Sensing of Environment 80,185 – 201.
Friedl M. A. , Brodley C.E. 1997. Decision Tree Classification of Land Cover From Remotely Sensed
Data. Remote Sensing of Environment, February 7.
Gong, P., and Howarth, P. J. 1990, An assessment of some factors in influencing multispectral
land-cover classification. Photogrammetric Engineering and Remote Sensing, 56, 597–603.
Gopal Sucharita, Woodcock Curtis E. and Strahler Alan H. 1999. Fuzzy Neural Network
Classification of Global Land Cover from a 1° AVHRR Data Set. Remote sens. Environ. 67:230–
243.
Hansen, M., Dubayah, R., and Defries, R. 1996. Classification trees: An alternative to
traditional land cover classifiers. International Journal of Remote Sensing, 17 , 1075-1081.
Hixson, M., Scholz, D., Fuhs, N., and Akiyama, T. 1980, Evaluation of several schemes for
classi cation of remotely sensed data. Photogrammetric Engineering and Remote Sensing, 46,
1547–1553.
Huang, C., Davis, L.S., Townshed, J.R.G. 2002. An assessment of support Vector Machines for
Land cover classification. International Journal of Remote sensing 23, 725–749.
Ishibuchi H., T. Nakashima. 1999.Voting in fuzzy rule-based systems for pattern classification
problems, Fuzzy Sets and Systems 103, 223–238.
Ishibuchi, H., Nozaki, K., Tanaka, H. 1993. Eficient fuzzy partition of pattern space for
classification problems. Fuzzy Sets and Systems 59, 295±304.
Lizarazo, Ivan. 2012. Quantitative land cover change analysis using fuzzy segmentation
. International Journal of Applied Earth Observation and Geoinformation 15, 16–27.
• LU D. and WENG Q. 2007. A survey of image classification methods and techniques for
improving classification performance . International Journal of Remote Sensing, Vol. 28, No. 5, 823–
870.
• Mahesh, P., Mather, P.M. 2003. An assessment of the effectiveness of the decision tree
method for land cover classification. Remote Sensing of the Environment 86, 554–565.
McBratney A.B. , Odeh L O.A. 1997. Application of fuzzy sets in soil science: fuzzy logic,
fuzzy measurements and fuzzy decisions. Geoderma 77, 85-113.
Mingers, J. 1989. An empirical comparison of pruning methods for decision tree induction.
Machine Learning, 4 , 227-243.
Mitra S., Y. Hayashi, 2000. Neuro-fuzzy rule generation: Survey in soft computing framework,
IEEE Transactions on Neural Networks 11, 748–768.
• Nauck D., R. Kruse. 1999. Obtaining interpretable fuzzy classification rules from medical data,
Artificial Intelligence in Medicine 16,149–169.
• Osuna, E.E., Freud, R., et al. 1997. Support Vector Machines: Training and Applications, A.I.
Memo No. 1602, C.B.C.L. Paper No. 144. Massachusetts Institute of Technology and Artificial
Intelligence Laboratory, Massachusetts.
• Otukei J.R., T. Blaschke. 2010. Land cover change assessment using decision trees, support
vector machines and maximum likelihood classification algorithms. International Journal of Applied
Earth Observation and Geoinformation, 12S, S27–S31.
• Paola J. D., Schowengerdt R. A. 1995. A Detailed Comparison of Backpropagation Neural
Network and Maximum-Likelihood Classifiers for Urban Land Use Classification. IEEE transactions
on geoscience and remote sensing, vol. 33, NO. 4.
Paola, J. D., and Schowengerdt, R. A. 1995, A review and analysis of backpropagation
neural networks for classification of remotely sensed multi-spectral imagery.
International Journal of Remote Sensing, 16, 3033–3058
Pedrycz, W. 1990. Fuzzy sets in pattern recognition: methodology and methods. Pattern
Recognition 23, 121±146.
Quinlan, J. R. 1987. Simplifying decision trees. International Journal of Man-machine Studies,
27 , 221-234.
Quinlan, R.1993. Programs for Machine Learning. Morgan Kaufman Publishers, San Mateo.
Rogana John, Franklina Janet, Robertsb Dar A. 2002. A comparison of methods for monitoring
multitemporal vegetation change using Thematic Mapper imagery. Remote Sensing of Environment
80, 143 – 156.
Roubos Johannes A. , Setnes Magne, Abonyi Janos. 2003. Learning fuzzy classification rules
from labeled data. Information Sciences 150, 77–93.
Schowengerdt, R.A.1996. On the estimation of spatial-spectral mixing with classifier likelihood
functions. Pattern Recognition Letters 17, 1379±1387. Setnes M., J.A. Roubos. 2000. GA-fuzzy
modeling and classification: Complexity and performance,
IEEE Transactions on Fuzzy Systems 8, 509–522.
Setnes M., R. Babuska. 1999. Fuzzy relational classifier trained by fuzzy clustering, IEEE
Transactions on Systems, Man, and Cybernetics––Part B: Cybernetics 29, 619–625.
Vapnik, W.N. 1999. An overview of statistical learning theory. IEEE Transactions of Neural
Networks 10, 988–999.
Wilkinson, G.G.1996. Classification algorithms ± where next?.
In: Binaghi, E., Brivio, P.A., Rampini, A. (Eds.), Soft
Computing in Remote Sensing Data Analysis. Series in
Remote Sensing, Vol. 1, pp. 93±100.
Vapnik, W.N., Chervonenkis, A.Y. 1971. On the uniform convergence of the relative frequencies of
events to their probabilities. Theory of Probability and its Applications 17, 264–280.