Extended adaptive neuro-fuzzy inference systems
Transcript of Extended adaptive neuro-fuzzy inference systems
University of Wollongong Thesis Collections
University of Wollongong Thesis Collection
University of Wollongong Year
Extended adaptive neuro-fuzzy inference
systems
Chun Yin LauUniversity of Wollongong
Lau, Chun Yin, Extended adaptive neuro-fuzzy inference systems, PhD thesis, Schoolof Information Technology and Computer Science, University of Wollongong, 2006.http://ro.uow.edu.au/theses/564
This paper is posted at Research Online.
http://ro.uow.edu.au/theses/564
Extended Adapative Neuro-FuzzyInference Systems
A thesis submitted in fulfillment of the requirements for the
award of the degree:
Doctor of Philosophy
UNIVERSITY OF WOLLONGONG
Lau Chun Yin Dip. CompSc. B.CompSc., M.CompSc.
Faculty of Informatics
School of Information Technology & Computer Science
2006
Certification
I, Lau Chun Yin, declare that this thesis, submitted in partial fulfillment of the
requirements for the award of the degree of Doctor of Philosophy, in the School of
Information Technology & Computer Science, Faculty of Informatics, University of
Wollongong, is wholly my own work unless otherwise referenced or acknowledged.
This document has not been submitted for qualifications at any other academic in-
stitutions.
Lau Chun Yin
3rd October 2006
ii
Acknowledgements
It is an unforgettable experience of pursuing a PhD degree in my life. After I received
a Master of Computer Science degree at the University of Wollongong, I worked in
Chu Hai College of Higher Education in Hong Kong. The College is in the process
of upgrading its standard to become a University. The accreditation process by
the Hong Kong Council for Academic Accreditation provided the impetus for the
College to upgrade the qualifications of its staff and facility accordingly. I am one
of those staff who participated in the staff development program to study for a PhD
degree.
I would like to thank my wife Freda and my family for their full support. I
would also like to express my sincere gratitude to my supervisor Prof. Ah Chung
Tsoi, formerly of the University of Wollongong, and now at Monash University, my
co-supervisor Prof. Tommy Chow Wai Shing, City University of Hong Kong for
their profound knowledge on the subject area of my research, their inspiration and
expert guidance on my research. Finally, I would like to thank Dr. Kong Yau Pak,
Chu Hai College and Prof. Ah Chung Tsoi again for providing me the opportunity
of studying for a PhD degree in the University of Wollongong.
iii
Abstract
This thesis presents a novel extension to the Adaptive Neuro-Fuzzy Inference Sys-
tem (ANFIS) which we call extended ANFIS (EANFIS). The extension includes the
introduction of an output class based membership function architecture, in which
each output class in a discrete output situation has its own membership function
and in the case of a continuous output, only one class; the possibility of determining
the structure of the rule base from the underlying structure of the input variables;
the determination of a possibly non-symmetric membership function the parameters
of which can be determined automatically from the given input variables; the pos-
sibility of incorporating global information on the input variables through a Linear
Discriminant Analysis in combination with the local input variable structure as rep-
resented by the membership functions. The possibility of determining the structure
of the rule section before the training process commences means that the proposed
EANFIS architecture can be applied to possibility large scale practical problems,
as it does not require the formation of all possible combination of rules before the
training process commences. In other words, the EANFIS architecture together
with its structure determining procedures overcomes the current limitation facing
ANFIS architecture when applied to systems with large number of inputs. The
possibility of determining a membership function from the input variables means
iv
v
the user no longer needs to select a membership function from a set of candidate
membership functions. The possibility of incorporating global information on the
input variables in addition to the local information on input variables means that
the EANFIS architecture can take advantage when such global information might be
useful in improving the performance of the Neuro-Fuzzy system. The new EANFIS
architecture is evaluated on a number of standard benchmark problems, and have
been found to have superior performance. In addition, as this is an EANFIS, rules
can be extracted from the trained system, thus providing information on the way
in which the underlying system is operating. The proposed EANFIS recommends
itself readily for applications in practical systems.
Abbreviations
ANFIS Adaptive Network Based Fuzzy Inference System
ANN Artificial Neural Network
COA Centroid Of Area
FIS Fuzzy Inference System
FLD Fisher Linear Discriminant
LDA Linear Discriminant Analysis
MLP Multilayer Perceptrons
N2Lmap Nonlinear to Linear mapping
NOAA National Oceanic and Atmospheric Administration
PCA Principal Component Analysis
QRcp QR factorization with column pivoting
RBFN Radial Basis Function Network
RMS Root Mean Square
vi
vii
SOM Self Organizing Map
SVD Singular Value Decomposition
SVM Support Vector Machines
TSK Takagi-Sugeuno-Kang Fuzzy Model
Notation
[ ]: continuous set
{}: discrete set
plaintext : a plain text indicates a variable
bold : bold face indicates a vector
BOLD : bold face and capital letter indicates a matrix
CAPITAL : capital letter indicates the upper bound of a variable
αrd: weight in Takagi-Sugeuno-Kang (TSK) Fuzzy Model of rule r in d dimension
δ: is a desire value
Δ: is a small constant
η: is a learning constant
θr: Output from r Linear Discriminant Analysis (LDA) node
πr: Output form the probability layer of rule r
πr: Normalized output from the probability layer in rule r
viii
ix
φr: Output from rule layer of rule r
φr: Normalized output from rule layer of rule r
τ : output cluster type where τ ∈ {1, T}, T is the upper bound of τ
σ: is the spread of a Gaussian function
ϕ: is a activation function
Br: rth basis function of Radial Basis Function Network (RBFN)
cr: membership function center of rule r where c ∈ �D
d: data dimension where d ∈ {1, D}, D is the upper bound of d
e: indicate an error value
g: index number of grid point where g ∈ {1, G}, G is the upper bound of g
i: index number of input vector where i ∈ {1, I}, I is the upper bound of i
r: index number of fuzzy rule where r ∈ {1, R}, R is the upper bound of r
xi: ith input vector where xi ∈ �D
wr: is a weight attached to rule r
yi: is a system output value of ith input pair
zgd: gth non-linear grid center in d dimension
Contents
Certification ii
Acknowledgements iii
Abstract iv
Abbreviations vi
Notation viii
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Neuro-fuzzy systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 The contribution of this thesis . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Neuro-Fuzzy Systems 14
2.1 Background on Fuzzy concepts . . . . . . . . . . . . . . . . . . . . . . 15
x
Contents xi
2.2 Fuzzy rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Reasoning with fuzzy rules . . . . . . . . . . . . . . . . . . . . 16
2.3 Fuzzy inference System . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Neuro-Fuzzy Inference System . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Adaptive Neuro Fuzzy Inference System (ANFIS) . . . . . . . . . . . 24
2.5.1 A Feed-Forward Network . . . . . . . . . . . . . . . . . . . . . 25
2.5.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.3 Network Pruning . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Radial Basis Function Networks (RBFN) . . . . . . . . . . . . . . . . 31
2.7 Nonlinear Approximation Method proposed by Schilling et al. . . . . 35
2.8 Kohonen Self-Organising Map (SOM) . . . . . . . . . . . . . . . . . . 39
3 Non-uniform Grid Construction in a Radial Basis Function Neural
Network 42
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Method of obtaining non-linear grid points . . . . . . . . . . . . . . . 43
3.3 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.1 Van der Pol Oscillator . . . . . . . . . . . . . . . . . . . . . . 56
3.3.2 Currency exchange rate between the US Dollar and the In-
donesian Rupiah . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.3 Sunspot Cycle Time Series . . . . . . . . . . . . . . . . . . . . 84
3.3.4 Experiments with the Iris Dataset . . . . . . . . . . . . . . . . 93
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Contents xii
4 Extended Adaptive Neuro-Fuzzy Inference Systems 100
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2 Introduction to Extended Adaptive Neuro-Fuzzy Inference System . . 101
4.3 Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 104
4.3.1 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4 Structure determination of the proposed neuro-fuzzy architecture . . 114
4.4.1 A proposed algorithm for rule formation . . . . . . . . . . . . 118
4.5 Determination of candidate membership function and the required
number of membership functions in each input variable . . . . . . . . 125
4.6 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.6.1 Error Correction Layer (Layer 4) parameter learning . . . . . 135
4.6.2 Output layer parameter learning . . . . . . . . . . . . . . . . . 139
4.6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.7.1 Exclusive OR problem . . . . . . . . . . . . . . . . . . . . . . 141
4.7.2 Sunspot Cycle Time Series . . . . . . . . . . . . . . . . . . . . 148
4.7.3 Mackey-Glass Time Series example . . . . . . . . . . . . . . . 157
4.7.4 Iris Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.7.5 Wisconsin Breast Cancer example . . . . . . . . . . . . . . . . 171
4.7.6 Inverted Pendulum on a cart . . . . . . . . . . . . . . . . . . . 175
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Contents xiii
5 Combining Local and Global Input Structures for the Extended
Adaptive Neuro-Fuzzy Inference System 188
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.3 Possible architectures for combining the local and global methods . . 191
5.4 Possible Global methods . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.4.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . 194
5.4.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . 199
5.4.3 Selection of the combined architecture . . . . . . . . . . . . . 208
5.5 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 214
5.5.1 Sunspot Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
5.5.2 Iris Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.5.3 Wisconsin Breast Cancer . . . . . . . . . . . . . . . . . . . . . 223
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6 Conclusions and Recommendations 227
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.2 Future areas of research . . . . . . . . . . . . . . . . . . . . . . . . . 231
References 233
Appendix 241
A Network Training Algorithm 241
Contents xiv
A.1 RBFN Network Training Algorithm . . . . . . . . . . . . . . . . . . . 241
A.2 ANFIS Network Training Algorithm . . . . . . . . . . . . . . . . . . . 242
A.3 Gradient determination in EANFIS Layer 4 for continuous output
values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
A.4 Gradient determination in EANFIS Layer 4 for discrete output types 245
B An example of Linear Discriminant Analysis transformation 247
List of Figures
2.1 A block diagram of a Fuzzy Inference System [12] . . . . . . . . . . . 17
2.2 Mamdani fuzzy inference system using min and max operators [12]. . 19
2.3 An alternative Mamdani fuzzy inference system using product and
max operators [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Tsukamoto fuzzy inference system [12]. . . . . . . . . . . . . . . . . . 21
2.5 Sugeno fuzzy inference system [12]. . . . . . . . . . . . . . . . . . . . 22
2.6 Architecture of a Neuro Fuzzy System. . . . . . . . . . . . . . . . . . 23
2.7 Architecture of ANFIS . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.8 Architecture of RBFN . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.9 An example of the distribution of Gaussian function centers in a uni-
formly distributed fashion. . . . . . . . . . . . . . . . . . . . . . . . . 34
2.10 An example of non-uniformly distributed scheme of Gaussian function
centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.11 The pseudo-code implementation of Schilling et al’s mapping function
[39]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.12 Block diagram of Schilling et al’s mapping method . . . . . . . . . . . 38
xv
List of Figures xvi
2.13 A flow chart showing the implementation of a non-linear grid in a
Radial Basis Function Neural Network. . . . . . . . . . . . . . . . . . 40
2.14 SOM Mexican hat update function . . . . . . . . . . . . . . . . . . . 41
3.1 A pseudocode representation of the proposed turning point detection
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Uniform grid point distribution in the d-th dimension of a given signal. 46
3.3 Updating algorithm of finding the set of non-linear grid points. . . . . 48
3.4 A diagram illustrating a triangular function on uniform distribution
grid points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 A diagram illustrating the determination of the centers and spreads
of a non-uniformly distributed set of grid points. . . . . . . . . . . . . 50
3.6 The magnitude update using triangular functions in the grid point
location updating algorithm. . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 The magnitude update using Gaussian functions in the grid point
location updating algorithm. . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 The magnitude of the update using one grid point on either sides of
the function center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.9 The update of the magnitude of the updating algorithm using two
grid point either side of grid function center b. . . . . . . . . . . . . . 55
3.10 Performance comparisons of RBFN using three different regimes: lin-
ear grid, nonlinear grid, and the transformation of the nonlinear grid
to the linear grid (N2Lmap) method. . . . . . . . . . . . . . . . . . . 57
List of Figures xvii
3.11 The set of turning points superimposed on the original signal for the
van der Pol equation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.12 The distribution of the set of grid points. The upper graph shows
the distribution using a linear grid while the lower graph shows the
location of the grid points using a nonlinear grid regime. The total
number of grid points used is 15. . . . . . . . . . . . . . . . . . . . . 59
3.13 The actual output of a RBFN using 15 grid points in a linear grid
regime. It is observed that the output is significantly different from
that of the original output of the van der Pol equation. . . . . . . . . 60
3.14 The differences in the output of the van der Pol equation, and the
reconstructed one using a RBFN with 12 grid points using a linear
grid regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.15 The output of a RBFN using 15 grid points and a nonlinear grid regime. 61
3.16 The output differences between the original signal from the van der
Pol equation and a reconstructed signal using a RBFN with 15 grid
points and using a nonlinear grid regime. . . . . . . . . . . . . . . . . 61
3.17 The output of a RBFN with 15 grid points using a nonlinear grid
regime. In this case, we use the nonlinear grid mapped onto a linear
grid using the method proposed by Shilling et al [39]. . . . . . . . . 62
3.18 The output differences between the original signal and the recon-
structed signal using a nonlinear to linear grid mapping as proposed
in Shilling et al. [39]. The number of grid points used is 15. . . . . . . 62
List of Figures xviii
3.19 The distribution of the grid points. The upper graph shows the linear
grid point distribution, while the lower graph shows the distribution
using a nonlinear grid. The number of grid points used is 40. . . . . . 64
3.20 The output of a RBFN with 40 grid points using a linear grid regime. 65
3.21 The output differences between the original signal and the reconstruc-
tion using a RBFN using a linear grid regime with 40 grid points. . . 65
3.22 The output of a RBFN using a nonlinear grid regime with 40 grid
points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.23 The output differences between the original signal and the reconstruc-
tion using a RBFN with a nonlinear grid regime using 40 grid points. 66
3.24 The output of a RBFN using a nonlinear grid mapped onto a linear
grid with 40 grid points. . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.25 The output differences between the original signal and a reconstructed
signal using a RBFN with a nonlinear grid mapped onto a linear grid
using 40 grid points. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.26 The currency exchange time series between the US dollars and the
Indonesian Rupiah, between 1st January, 1994, and 31st December,
1999. Note that the vertical axis of this graph is normalised with 0
denoting 1 USD to 2160 IDR, while the maximum 1 denoting 1 USD
to 16,475 IDR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.27 The variation of the root mean square error values as a function of
the number of grid points used. . . . . . . . . . . . . . . . . . . . . . 71
List of Figures xix
3.28 The set of turning points in the time series of USD to IDR. Note that
we have connected the points so as to make it easier to discern where
the turning points are. . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.29 The distribution of the grid points. The upper graph shows the dis-
tribution of the linear grid points, while the lower graph shows the
distribution of the nonlinear grid points. . . . . . . . . . . . . . . . . 74
3.30 The output of a RBFN using 25 grid points with a linear grid regime. 75
3.31 The output differences between the original signal and the reconstruc-
tion using a RBFN with 25 grid points with a linear grid regime. . . . 76
3.32 The output of a RBFN using 25 grid points with a nonlinear grid
regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.33 The output differences between the original signal and the reconstruc-
tion using a RBFN with 25 grid points in a nonlinear grid regime. . . 77
3.34 The output of a RBFN using 25 grid points but with a mapping from
the nonlinear grid to a linear grid. . . . . . . . . . . . . . . . . . . . . 77
3.35 The output differences between the original signal and the reconstruc-
tion using 25 grid points with a mapping from the nonlinear grid to
a linear grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.36 The distribution of the grid points. The upper graph shows the dis-
tribution of the linear grid points, while the lower graph shows the
distribution of the nonlinear grid points. The number of grid points
used is 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.37 The output of a RBFN using 100 grid points with a linear grid regime. 80
List of Figures xx
3.38 The output differences between the original signal and the recon-
structed one from a RBFN using 100 grid points with a linear grid
regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.39 The output of a RBFN using a nonlinear grid regime with 100 grid
points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.40 The output differences between the original signal and the one re-
constructed from a RBFN with a nonlinear regime using 100 grid
points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.41 The output of a RBFN using a mapping from the nonlinear grid to a
linear grid regime using 100 grid points. . . . . . . . . . . . . . . . . 82
3.42 The output differences between the original signal and the recon-
structed one from a RBFN with a mapping from the nonlinear grid
to a linear grid regime using 100 grid points. . . . . . . . . . . . . . . 83
3.43 The monthly average sunspot number time series from January 1749
to July 2004. The x-axis is normalised to lie between 0 and 1. Simi-
larly the y-axis is also normalised to lie between 0 and 1. . . . . . . . 85
3.44 The set of turning points for the NOAA sunspot number time series. 85
3.45 The variation of the RMS error values as a function of the number of
grid points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.46 Grid point distribution using 50 grid points. The upper graph shows
the distribution of the linear grid points, while the lower graph shows
the distribution of the nonlinear grid points. . . . . . . . . . . . . . . 87
List of Figures xxi
3.47 The output and differences of outputs between the original signal and
the reconstructed one using a RBFN with 50 linear grid points. . . . 88
3.48 The output and differences of the original signal and the reconstructed
one using a RBFN with a nonlinear grid regime using 50 grid points. 89
3.49 The output and differences in the original signal and the reconstructed
one using a RBFN with a mapping from the nonlinear grid to a linear
grid regime with 50 grid points. . . . . . . . . . . . . . . . . . . . . . 89
3.50 Grid point distribution using 100 grid points for the sunspot cycle
time series. The upper graph shows the linear grid distribution, while
the lower graph shows the distribution of grid points using a nonlinear
grid regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.51 The actual output and the differences in the original signal and the
output reconstructed using a RBFN with 100 linear grid points. . . . 91
3.52 The actual output and the differences in the original signal and the
reconstructed output using a RBFN with 100 grid points with a non-
linear grid regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.53 The actual output and the differences in the original signal and the
reconstructed output using a RBFN with a mapping of the nonlinear
grid to a linear grid regime with 100 grid points. . . . . . . . . . . . . 92
3.54 The RMS error values as function of the number of grid points per
dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
List of Figures xxii
3.55 Grid point distributions of the iris data set. The upper graphs in
each sub-graph show the distribution of the linear grid points, while
the lower graphs show the nonlinear grid points. . . . . . . . . . . . . 96
3.56 basis functions cover in two dimension . . . . . . . . . . . . . . . . . 97
3.57 Number of neurons used in a RBFN when the input dimension is four. 98
4.1 EANFIS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2 An example illustrating the determination of the maximum itemset
in the Apriori algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.3 An example to illustrate the proposed rule formation algorithm. . . . 123
4.4 Illustration of the distribution of grid points in the d-th dimension of
the input xd. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.5 The pseudo code implementation of the proposed algorithm for the
formation of clusters. The small diagram on the top right hand corner
illustrates when a new cluster is formed, and when grid points are
merged together to form a cluster. . . . . . . . . . . . . . . . . . . . . 130
4.6 Example of finding the membership function using the proposed self
organising mountain clustering membership function method. . . . . . 134
4.7 A diagram to illustrate the training of the EANFIS for the case of
discrete output classes. The notation ¬ denotes the negative of the
output class τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.8 The resulting data clusters for the exclusive OR problem after the ap-
plication of the mountain clustering membership function; ‘*’ denotes
the first cluster, ‘o’ denotes the second cluster. . . . . . . . . . . . . . 143
List of Figures xxiii
4.9 The Gaussian membership functions for the ANFIS architecture for
the exclusive OR problem. There are two membership functions per
dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.10 The architecture of the XOR probelm as found by the proposed EAN-
FIS architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.11 The monthly average Sun spot number. The training data (55 years)
are shown in dark, while the testing data (200 years) are shown in
lighter colour. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.12 The self organising mountain clustering membership functions of sunspot
cycle time series ‘*’ denotes the first cluster, ‘o’ denotes the second
cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.13 The combination of the fuzzy rules for the EANFIS architecture. . . . 151
4.14 The prediction results of the monthly average sunspot number time
series using an EANFIS architecture with linear grid regime using the
self organising mountain clustering membership functions. . . . . . . 153
4.15 The prediction results of the monthly average sunspot number time
series using an ANFIS architecture with Gaussian membership func-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.16 The architecture found by using the proposed EANFIS architecture. . 154
4.17 The output of the Mackey-Glass equation. . . . . . . . . . . . . . . . 159
4.18 The Gaussian membership function for the Mackey-Glass equation
example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
List of Figures xxiv
4.19 The outputs of a neuro-fuzzy network using the ANFIS architecture
with 16 Gaussian membership functions for the Mackey-Glass equation.161
4.20 The outputs of the EANFIS architecture with 12 Gaussian member-
ship functions for the Mackey-Glass equation. . . . . . . . . . . . . . 162
4.21 The difference between the output of the ANFIS architecture with
16 Gaussian membership functions and the original signal for the
Mackey-Glass equation. . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.22 The difference between the output of the EANFIS architecture with
12 Gaussian membership functions and the original signal for the
Mackey-Glass equation. . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.23 The architecture found by using the proposed EANFIS architecture. . 164
4.24 The membership function (mountain clustering) for the Iris data set. 169
4.25 The architecture found using the proposed EANFIS architecture. . . 171
4.26 The self organising mountain clustering membership functions for the
Wisconsin breast cancer example. Solid line denotes “benign”, and
dashed line denotes “malignancy”. . . . . . . . . . . . . . . . . . . . . 176
4.27 The architecture found using the proposed EANFIS architecture. . . 178
4.28 Inverted pendulum on a cart. . . . . . . . . . . . . . . . . . . . . . . 179
4.29 Block diagram of the Inverted pendulum on a cart control system. . . 180
4.30 Control force of the training system . . . . . . . . . . . . . . . . . . . 181
4.31 Input status of the Inverted pendulum on a cart control system . . . 182
4.32 Control force of the EANFIS system . . . . . . . . . . . . . . . . . . 183
List of Figures xxv
4.33 Input status of the Inverted pendulum on a cart control system using
EANFIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
4.34 Control force of the ANFIS system . . . . . . . . . . . . . . . . . . . 185
4.35 Input status of the Inverted pendulum on a cart control system using
ANFIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.36 The architecture found using the proposed EANFIS archit4ecture. . . 187
5.1 A block diagram to show the preprocessing method of combining the
EANFIS architecture and global method. . . . . . . . . . . . . . . . 191
5.2 A block diagram to show the preprocessing method of combining
ANFIS architecture and a global method. . . . . . . . . . . . . . . . . 191
5.3 A block diagram showing the parallel connection of the global mod-
ule with the membership function module in an extended EANFIS
architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.4 A block diagram showing the series-parallel connection of the global
module and the series connection of the membership function module
and the competitive and normalisation layers in the EANFIS archi-
tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.5 The raw data of the first and second dimensions of the iris data set. . 197
5.6 The iris flower dataset projected down to one transformed dimension
using the PCA algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 198
5.7 The iris flower dataset projected from three dimensions to two trans-
formed dimensions using the PCA algorithm. . . . . . . . . . . . . . . 199
List of Figures xxvi
5.8 The iris flower data set projected down onto one transformed dimen-
sion using the class-dependent LDA method. . . . . . . . . . . . . . . 205
5.9 The iris flower data set projected down to two transformed dimensions
with a class-dependent LDA method. . . . . . . . . . . . . . . . . . . 205
5.10 Iris flower data set projects down to one transformed dimension with
class-independent LDA . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.11 Iris flower data set projects down to two transformed dimension with
class-independent LDA . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.12 The extended adaptive neuro-fuzzy inference system with the LDA
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.13 Monthly Average Sunspot number time series in which the first 55
year data is used for training, and the rest of 200 year data is used
for testing. The training data is shown in continuous line, while the
testing data is shown in dotted line. . . . . . . . . . . . . . . . . . . . 215
5.14 The Sunspot number time series data set using a class dependent
LDA technique to project it to one transformed dimension. . . . . . . 217
5.15 The Sunspot number time series data set using a class independent
LDA technique to project it to one transformed dimension. . . . . . . 217
5.16 The prediction outputs of the class dependent LDA method combined
with the EANFIS architecture. . . . . . . . . . . . . . . . . . . . . . 218
5.17 The prediction output errors of the class dependent LDA method
combined with the EANFIS architecture. . . . . . . . . . . . . . . . . 219
List of Figures xxvii
5.18 The membership function formed using the self-organizing mountain
clustering membership function method. ’.’ denotes the iris-setosa,
’o’ denotes the iris-versicolor and ’+’ denotes the iris-virginica. . . . . 220
5.19 The breast cancer dataset projected down onto one transformed di-
mension with the class dependent LDA method. . . . . . . . . . . . . 223
5.20 The breast cancer dataset projected down to one transformed dimen-
sion with the class independent LDA method. . . . . . . . . . . . . . 224
B.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
B.2 Class-dependent LDA transformation . . . . . . . . . . . . . . . . . . 255
B.3 Class-independent LDA transformation . . . . . . . . . . . . . . . . . 255
List of Tables
3.1 Output results comparisons of the van der Pol equation example. . . 69
3.2 The comparison of the root mean square values between using 25 grid
points and 100 grid points for the currency exchange time series. . . . 72
3.3 Output results comparisons. . . . . . . . . . . . . . . . . . . . . . . . 92
4.1 The input output pairs of the exclusive OR problem. . . . . . . . . . 142
4.2 The fuzzy rules found for the exclusive OR problem using our pro-
posed method for rule formation. . . . . . . . . . . . . . . . . . . . . 144
4.3 The results of the XOR problem by comparing three methods: EAN-
FIS architecture with mountain clustering membership function, EAN-
FIS architecture with Gaussian membership functions, and ANFIS
architecture with Gaussian membership functions. . . . . . . . . . . . 146
4.4 The fuzzy rules found for the sunspot cycle time series. . . . . . . . . 152
4.5 The RMS errors of applying various methods on the sunspot number
time series. Please see the text for explanation of the experimental
conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
xxviii
List of Tables xxix
4.6 The RMS errors of the Mackey-Glass compares with ANFIS, EANFIS
with the self organising mountain clustering membership function and
EANFIS with Gaussian membership function. . . . . . . . . . . . . . 158
4.7 The outcomes of applying the EANFIS architecture and the ANFIS
architecture on the Iris data set. The values reported in this table are
obtained from an average of 100 experiments using randomly selected
99 training data samples and 51 testing data samples. . . . . . . . . 167
4.8 The extracted fuzzy rules for the Iris Dataset. These rules are used
in the EANFIS architecture. . . . . . . . . . . . . . . . . . . . . . . . 170
4.9 The comparison of the prediction capabilities of the ANFIS archi-
tecture with membership functions, the EANFIS architecture, with
linear and nonlinear grid regimes using single weight output layer. . . 173
4.10 The comparison of the prediction capabilities of the ANFIS archi-
tecture with membership functions, the EANFIS architecture, with
linear and nonlinear grid regimes using TSK output layer. . . . . . . 174
4.11 The extracted fuzzy rules for the Wisconsin Breast Cancer. These
rules are used in the EANFIS architecture. . . . . . . . . . . . . . . . 177
4.12 The table shows the RMS error compare with ANFIS and different
EANFIS architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.1 Using the LDA methods as pre-processing methods in combination
with the ANFIS or EANFIS architectures. . . . . . . . . . . . . . . . 210
5.2 Prediction RMS Errors for the Sunspot number time series. . . . . . . 216
5.3 Prediction classification accuracy comparison on the Iris data set. . . 222
List of Tables xxx
5.4 The output accuracy comparison of the Wisconsin breast cancer data
set using various architectures. . . . . . . . . . . . . . . . . . . . . . . 225
5.5 The output accuracy comparison on the Wisconsin breast cancer data
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
B.1 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Chapter 1
Introduction
1.1 Introduction
The dream of a machine which can mimic human thinking has always been a research
topic in computer science. With the advent of more and more powerful computers,
this dream seems to be forever elusive. In the 1970s there were much research
carried out by researchers in what is commonly called “expert systems” as part of
the overall idea of having a computing machine which can mimic human thoughts,
and human actions. An expert system normally has two components: a knowledge
base, and an inference engine. The knowledge base stores relevant facts about
the system under study, in a machine understandable manner, while the inference
engine is a machine which is capable of reasoning based on the knowledge stored
in the knowledge base, and in response to queries by the user. Indeed there were a
1
1.1. Introduction 2
number of widely publicised expert systems, among them, one called MYCIN which
was capable of medical diagnosis and another one called PROSPECTOR, which is
capable of prospecting minerals [2]. These systems are alternatively called expert
systems, or rule based systems.
However, it soon transpired that while the expert systems, like MYCIN, and
PROSPECTOR worked well in their respective domains, it was quite difficult to
use the methodology on new domains. It also soon transpired that to implement
one of these expert systems, it would require a knowledge engineer (an expert in
acquiring the necessary knowledge, and able to “translate” them into compatible
format with expert systems) to solicit the knowledge from domain experts. The
domain experts are people who have the knowledge of the domains of interest, but
who may not be well versed in the expert system methodology. This knowledge
solicitation effort is often called knowledge acquisition. It was discovered that the
knowledge acquisition process is far from trivial and required extensive collaboration
of both the domain expert and the knowledge engineer. It also soon discovered that
expert systems are relatively “brittle”. In other words, these systems can provide
inference from user queries. However, if the query is “vague” or if the knowledge
stored in the knowledge base is “vague” then the expert systems seem to behave
in a relatively erratic manner. Here “noise” may mean imprecise knowledge in the
knowledge encoding process, or that the queries differ slightly from the knowledge
which was encoded in the knowledge base.
On the other hand, there was another innovation which can be dated even earlier
than expert systems, or rule based systems. This is the study of artificial neural
1.1. Introduction 3
networks. Artificial neural networks were first proposed by McCulloch and Pitt [4]
in the 1940s, based on their observation that the human brain consists of many
interconnected neurons. These neurons have seemingly simple mechanisms, and yet
the interconnection of these seemingly simple neurons constitutes what we know
as “human intelligence”. Hence, if we can model these neurons sufficiently then
by interconnecting these neurons together, the system will be able to infer from
given facts. Indeed Rosenblatt [5] studied such a concept in the 1950s, and provided
impressive examples to show that such a system is capable of mimicking human
thoughts. However, Minsky and Papert in their famous book called Perceptrons [3]
in the late 1960s showed that such a set of simple connected artificial neurons cannot
even solve the simple exclusive OR problem. This dampened the enthusiasm in this
area for considerable time.
In the meantime, Zadeh in the 1960s proposed a concept called fuzzy set [17]. He
argued that humans do not reason using numerical values, but instead, we reason
using “categories” which are not based on numerical values. For example, we may
say “today is hot”. The “hotness” depends on the person’s perception. Thus, to
someone living in the tropic, “hotness” may mean 40 degrees C, while to someone
living in more temperate zones, “hotness” may mean 20 degree C. Thus when we
reason using such a concept, like, “if it is hot and if it rains heavily, then I do not
play golf”. Here, the concept of “hotness” and raininess” are both “fuzzy”. In
other words, different people may have different conceptualisation of these qualities.
Thus, the main concept of a “fuzzy” set is that it is capable of providing information
between the continuum of 0 and 1, or between “black” and “white”, etc.
1.1. Introduction 4
This concept of “fuzzy” set soon found its way to rule based systems, called
“Fuzzy Rule Based Systems” [19]. The concept was greatly promoted when the
Japanese companies revealed that some of their most advanced railway systems
were controlled by using “fuzzy control” systems [18]. This generated much enthu-
siasm among researchers exploring the concept of fuzzy systems. However, it soon
discovered that for a fuzzy control system to work properly one needs to find an
appropriate fuzzy membership function [12]. The effort of designing an appropriate
fuzzy membership function is comparable to the effort of knowledge acquisition, as
in both cases, someone will need to discuss the concepts with the domain experts,
try to solicit the knowledge and then to represent such knowledge either in fuzzy
membership functions, or in the knowledge base.
In the meantime there was a revival of studies in artificial neural networks
through the seminal work of Rummelhart, McClelland, and Hinton [13] in 1986
in providing what they called a backpropagation training algorithm in the training
of a multilayer perceptron. This opened up the new era of artificial neural network
studies, as it was shown that such multilayer perceptrons can solve nonlinear prob-
lems, can learn from a set of training examples and capable of generalising them to
unseen examples.
Thus, it is observed from this brief description that there were two main streams
of thought in mimicking human reasoning by machines:
• Rule based systems. These are systems based on a set of rules. Often the rules
take on the following form: “If expression 1 is true, and expression 2 is true,
1.1. Introduction 5
and .. expression N is true, then consequence”. These rules are found by hu-
mans, and may be based on human experience mined by knowledge engineers
in the process of knowledge acquisition. The rules expressed our understand-
ing of the way that the system works. The rules may be expressed using fuzzy
membership functions in what is called fuzzy rule based systems, or expressed
using Bayesian reasoning. Bayesian reasoning assigns a conditional proba-
bility for each evidence and hypothesizes reasoning using Bayes’ theorem in
what is commonly called Bayesian rule based system. It will require significant
amount of work to elucidate the rules. However once the rules are obtained it
provides a way in which the human operator can understand the operations of
the system, and may be able to provide more rules as time progresses as more
and more operating experience is accumulated.
• Artificial neural networks. These are systems which intend to mimic human
reasoning by providing a set of interconnected artificial neurons. These arti-
ficial neurons are relatively simple devices; the main inference power of the
artificial neural networks comes from the way in which the artificial neurons
are interconnected. Once the architecture (the way the neurons are inter-
connected) is decided, then the weights (synaptic weights) can be learned by
presenting to the artificial neural network a set of examples. Once the weights
are trained then the artificial neural network can be used to generalise (infer)
to unseen examples. The issue here is that humans cannot understand the
way in which such a system works, as it is quite difficult to extract human-
understandable rules from such a trained network. However, such networks
have been shown to have good potentials. For example, one particular config-
1.1. Introduction 6
uration of such neurons is called a multilayer perceptron [13]. The multilayer
perceptron has been shown to be universal approximators, i.e. it is capable of
approximating to any arbitrary close manner to any given nonlinear function,
provided that there are sufficient number of hidden layer neurons [14].
Thus it is observed that there is considerable tension between the study using
artificial neural networks and rule based systems in their basic premises, and their
applications. Both methods have considerable following, and both methods claim
to provide good mimicking of human reasoning capabilities.
An innovative idea was engendered in the 1990s, and the researchers asked the
obvious question: is it possible to combine the best capabilities of the rule based sys-
tems, and artificial neural networks to provide machines with human-like reasoning
capabilities. In other words, it would be ideal if we have a reasoning system which
allows us to train the parameters based on a set of training examples, and be able to
use rule based type of reasoning to infer on user queries, so that the human operator
may understand the operation of the system. In other words, we wish to have a
system which will take away the difficulties related to the crafting or acquiring the
knowledge from experts, based on a set of training examples, but we also wish to
understand the actions of the system in a transparent manner so that the human
operator can understand. One such a system is called the neuro-fuzzy system.
1.2. Neuro-fuzzy systems 7
1.2 Neuro-fuzzy systems
As indicated in the previous section, neuro-fuzzy systems attempt to combine the
best features of an artificial neural network and a fuzzy rule based system. Before
we can understand the revolutionary way in which the neuro-fuzzy system proposes,
we will need to provide sufficient details 1 about a fuzzy inference system.
A fuzzy inference system is an inference system which is based on fuzzy concepts
in representing its underlying variables, and it will make its inference based on such
fuzzy variables. Thus, a fuzzy inference system consists of the following components:
• A fuzzy membership section. In this component, the knowledge or experience
is represented by a set of fuzzy membership functions. Such membership func-
tions may take a number of typical forms, e.g., triangular function, trapezoidal
function, or Gaussian function. Each of these candidate membership functions
will have a number of parameters. These parameters can be determined by an
expert.
• Rules section. In this component the outputs from the membership functions
are combined in various ways to form rules. There are usually no parameters
associated with the rule section. There are however different ways in which
the outputs of the fuzzy membership section can be combined.
1The concept of a neuro-fuzzy system will be discussed in more detail in the next chapter. In
this chapter, sufficient details will be provided so as to allow us to discuss the concepts, and to
understand the background to this thesis.
1.2. Neuro-fuzzy systems 8
• Inference engine. Inferences may be made based on the rules. The simplest
inference mechanism will be: the output is a linear combination of the outputs
of the rule section. There are parameters associated with the inference engine.
The way in which the features of a fuzzy rule based system and artificial neural
network are exploited can be summarised as follows:
• Since the artificial neural networks can learn from a set of trained examples,
one way in which such capabilities can be deployed is to use some kind of
learning algorithms to learn the output weights. Note that often the fuzzy
membership parameters are not learned using the set of training examples,
as the relationships between the outputs and the fuzzy membership function
parameters are highly nonlinear. On the other hand, if a linear combination
is used for the inference engine section, then there is a very simple linear
relationship between the output and the linear combination weights.
• The rule section. This allows the combination of the outputs of the fuzzy
membership functions to be combined. There are no trainable weights in the
rule section, except the ways in which the outputs of the fuzzy membership
function are combined.
Thus provided that the parameters of the fuzzy membership functions can be
found somehow, such a neuro-fuzzy system combines the best features of both the
artificial neural networks, and a fuzzy rule based system. In general, it was found
that the neuro-fuzzy system is not too sensitive to the shape of the fuzzy membership
functions, or their parameters, as long as there is a sufficient number of them.
1.2. Neuro-fuzzy systems 9
A particular neuro-fuzzy system is called an adaptive neuro-fuzzy inference sys-
tem (ANFIS) which has been quite popular among practitioners. [68, 69] Indeed,
it was shown that if the fuzzy membership function is Gaussian function, and the
inference mechanism is the simple linear combination of the outputs of the rule sec-
tion, then ANFIS is simular to a multilayer perceptron with a single hidden layer,
and the hidden layer neuron activation function is Gaussian [70]. Obviously if the
ANFIS uses a different inference mechanism, e.g. the Takagi-Sugeuno-Kang (TSK)
mechanism (details concerning this method is shown in the next chapter) then the
ANFIS is not equivalent to a multilayer perceptron with Gaussian functions as the
hidden layer neuron activation functions.
However, there are a number of issues which the ANFIS faces:
• The fuzzy membership function. As indicated previously ANFIS is relatively
insensitive to the shape and the number of fuzzy membership functions used.
However, intellectually this is unsatisfactory in that we somehow do not make
use of all the information provided in the problem. Thus, the question is:
would it be possible to model the input variables somehow, and what effect
would such modelling have on the performance of the ANFIS.
• The number of rules. As there is no a priori information provided to ANFIS,
hence in the ANFIS, all possible combination of rules need to be formed in
the rule section. Now this creates some issues if the number of inputs or the
number of fuzzy membership functions is high. For example, if the number
of inputs is 10, and say there are 3 fuzzy membership functions per input
variables, then the total number of rules need to be formed in the rule section
1.3. Objectives 10
is 310, which is a large number. However, without any a priori information on
the way the fuzzy membership function outputs are combined it is difficult to
see how the number of rules can be reduced. Thus this is a common bottleneck
in ANFIS, preventing its application to large scale problem which may include
many input variables.
• While it is not apparent, the ANFIS exploits only the local neighborhood of
the input variables. The issue is: if it there is information on the global nature
of the input variables, then would such information improve the performance
of ANFIS.
1.3 Objectives
The objectives of this thesis are
• to design a novel neuro-fuzzy system that will have the following capabilities:
– Automatic determiniation of the structure of the neuro-fuzzy system in
terms of the number of rules required for a particular problem
– Automatic determination of the shape of the membership functions
– Capable of taking into account both local information and global infor-
mation in the structure of the input variables.
1.4. The contribution of this thesis 11
1.4 The contribution of this thesis
There are a number of contributions of this thesis.
• The thesis proposes an extension of the ANFIS in what we called an extended
ANFIS (EANFIS) architecture. This architecture is based on the observation
that it is possible to improve the performance of the ANFIS in cases when the
outputs are discrete variables by incorporating some kind of logistic functions
2, in a manner inspired by the incorporation of a logistic operation in artificial
neural networks [15]. From this insight we propose an EANFIS architecture
which can be used for both discrete output variables or continuous output
variables.
• A way to reduce the number of rule formation in the EANFIS a priori. In other
words, we propose a novel way of determining the appropriate architecture
from the input structure of the problem. In this architecture, only the rules
which will be used for inference will be formed. The rules which will not be used
in the inference process will not be formed. The proposed method is inspired by
the Apriori algorithm [41] in associative rule data mining techniques, though it
is not exactly the same as what is used in the popular Apriori algorithm. Thus
we find a way to produce an architecture which is appropriate for a particular
problem at hand based on an analysis of the input structure of the problem.
This is significant in that this will allow our proposed architecture to be used
for practical problems of may be high input dimensions as the limiting factor
2The sigmoid function is an example of the logistic function [10].
1.4. The contribution of this thesis 12
is no longer the large number of rules which must be formed, but only the set
of rules which is necessary to the inference process will be formed.
• A way to use the input variables to form possibly non-symmetric fuzzy mem-
bership functions. In other words, instead of pre-supposing the shape of the
fuzzy membership functions, in our proposed method, the shape of the fuzzy
membership functions is determined from the input variables. The parameters
of the fuzzy membership function are determined in the process as well. Thus,
using our proposed method, the user does not need to find the parameters of
the fuzzy membership functions somehow.
• By further analysing the input structure, it is found that if the global structure
of the input variables is known, then such information can lead to improved
performance of the EANFIS. Obviously there are situations in which such
global information may not help, as the problem may only be dependent on
the local structure of the input variables. However, where such information is
useful, it is found that the combination of local and global structures of the
input can lead to better performance of the EANFIS.
• A minor contribution of the thesis is that we find a way in which a nonlinear
grid decomposition can be obtained from input variables. This method allows
us to find an appropriate nonlinear grid structure for a set of inputs. This can
be applied to the radial basis function neural network in finding the centres
and spreads of the radial basis functions.
1.5. Structure of the thesis 13
1.5 Structure of the thesis
The structure of the thesis will be as follows:
• In Chapter 2, we will provide more details about the ANFIS architecture as
a background to the thesis. In addition, we will show how the parameters
of the inference section of the ANFIS architecture can be determined using a
backpropagation type of learning algorithm.
• In Chapter 3, we will give a description of the nonlinear grid determination
method. This section can be standalone.
• In Chapter 4, we will describe the proposed EANFIS architecture. We will
further describe our proposed method to customise the EANFIS architecture
for particular problem based on the input variable structure as provided in the
set of training examples. We will further describe the ways in which the fuzzy
membership function can be determined from the set of inputs, and how the
associated parameters may be determined.
• In Chapter 5, we will describe a method for combining global and local infor-
mation concerning input variables in the EANFIS architecture. It is shown
that where appropriate such combination can improve the performance of the
EANFIS architecture.
• Chapter 6 will draw some conclusions from the work presented in this thesis,
and will provide some directions for future research.
Chapter 2
Neuro-Fuzzy Systems
In this chapter we will give some background descriptions on the neuro-fuzzy sys-
tem. We will first describe a fuzzy inference system, followed by a description of a
particular adaptive neuro-fuzzy system, viz., adaptive neuro-fuzzy inference system
(ANFIS). Note that we make a difference in this thesis between neuro-fuzzy sys-
tem and fuzzy system. A fuzzy system is one in which the membership function is
pre-assigned and that the consequent part is also pre-assigned, while a neuro-fuzzy
system is one in which the membership function is unknown which needs to be
determined and the parameters of the consequent part need to be determined.
14
2.1. Background on Fuzzy concepts 15
2.1 Background on Fuzzy concepts
Instead of giving a detailed background on fuzzy concepts we will direct the readers
to the following books on fuzzy systems [7]. We will provide only sufficient materials
in this chapter to understand the concepts as required in this thesis.
2.2 Fuzzy rules
One of the main concepts in a fuzzy system is a rule which expresses the relationship
of entities in fuzzy terms, viz., a fuzzy rule. A fuzzy rule can be expressed as follows:
IF x is A
THEN y is B(2.1)
where x and y are linguistic variables, and A and B are linguistic values determined
by the fuzzy sets on the universe of discourse X and Y respectively. The IF statement
is sometimes referred to as the “premise” while the THEN statement is sometimes
referred to as the “consequence”.
Fuzzy IF-THEN rule uses natural language to express the premise and the con-
sequence. The premise may have more than one conditional expression joined by
logical operator “logical OP”: “AND” or “OR”.
2.3. Fuzzy inference System 16
Ri : IF x1 is μi1 < logical OP > . . . < logical OP > xd is μi
d THEN
y is μi
where μid represents membership function of rule i in d dimension
2.2.1 Reasoning with fuzzy rules
Reasoning with fuzzy rules typically involves two parts:
• Evaluation of the rule antecedent (the IF part of the rule)
• Applying the result of the evaluation of the antecedent to the consequent (the
THEN part of the rule).
In a fuzzy rule reasoning situation, when the antecedent is true, then all rules fire
to some extent (dependent on the degree of incursion of the membership function
into the others).
2.3 Fuzzy inference System
A Fuzzy Inference System (FIS) mimics a human reasoning process by implementing
fuzzy sets and approximate reasoning mechanism which use numerical values instead
of logical values.
2.3. Fuzzy inference System 17
A Fuzzy Inference System (FIS) consists of three conceptual components:
• a rule base containing a set of fuzzy rules,
• a database which defines the membership functions used in the fuzzy rules,
and,
• a reasoning mechanism which performs the inference procedures.
Figure 2.1 gives a block diagram representation of a FIS [12].
Figure 2.1: A block diagram of a Fuzzy Inference System [12]
Fuzzy inference may be defined as the process of mapping from a given input
to an output using the theory of fuzzy sets. Fuzzy reasoning is also known as
approximate reasoning in that the process is to draw a conclusion provided that the
fuzzy implication A → B is true. Fuzzy reasoning includes four steps: [12]
• Fuzzification: find the degree of similarity between the input function and
the membership function
2.3. Fuzzy inference System 18
• Inference: Combine the degree of similarities from different membership func-
tions using fuzzy AND or OR operators to form a firing strength of the rule
• Aggregation: Apply the firing strengths to the consequent membership func-
tion to generate a qualified consequent membership function
• Defuzzification: Aggregate all the qualified consequent membership func-
tions to produce a crisp output.
There are three types of popular fuzzy models used in various applications. The
main difference among these models is in the aggregation and defuzzification stages.
Mamdami Fuzzy Model
The Mamdani fuzzy inference system was proposed in [19, 20]. The original design
uses a min-max composition as shown in Figure 2.2. One possible variation is
shown in Figure 2.3. The differences between the original Mamdami model and
the alternative one is: the original Mamdani method uses a min (T-norm) operator
for the implication operation (AND) and uses a max (T-conorm) operator for the
aggregation operation (OR) while the alternative one uses an algebraic product
(μaμb) for the T-norm operation and (μa + μb − μaμb) for the T-conorm operation.
Figures 2.2 and 2.3 respectively show two crisp values (x, y), the input variables
to two fuzzy rules and obtain the degree of similarity as a result. The rule used is:
(Ri: IF x is μi1 AND y is μi
2 THEN z is μic). The inference process combines the firing
strengths of the rules. The aggregator generates a qualified consequent membership
2.3. Fuzzy inference System 19
Figure 2.2: Mamdani fuzzy inference system using min and max operators [12].
Figure 2.3: An alternative Mamdani fuzzy inference system using product and maxoperators [12].
2.3. Fuzzy inference System 20
function by using these firing strengths. The defuzzification process aggregates all
qualified consequent membership functions with a max operation and extracts a
crisp output value from the fuzzy set. Generally, there are seven commonly used
methods to extract a crisp output value from a fuzzy set [30]. The most commonly
used method is the Centroid of Area (COA) technique as shown in Equation 2.2;
this calculates the expected value of probability distributions [12].
ZCOA =
∫zμC′ (z) zdz∫
zμC
′ (z) dz(2.2)
where μC′ is the membership function.
Tsukamoto Fuzzy Model
In the Tsukamoto Fuzzy Model (Figure 2.4), the consequent of each fuzzy rule is
a monotonic membership function. The output of each fuzzy rule is a crisp value
induced by a firing strength. The overall output is the weighted average of each rule’s
output. Because of the output of each fuzzy rule is a crisp value, the monotonic
membership function can avoid a time consuming defuzzification process [12].
Sugeno (TSK) Fuzzy Model
In the Sugeno Fuzzy Model, alternatively referred to as the Takagi-Sugeno-Kang
(TSK) model, the consequence membership function is replaced by a polynomial
equation as shown in Figure 2.5 [12]. The output of each rule is a linear combination
2.3. Fuzzy inference System 21
Figure 2.4: Tsukamoto fuzzy inference system [12].
of input variables plus a bias as shown in Equation 2.3. The weights αri, i =
1, 2, . . . , d for a particular value of r, r is the rule number, can be adjusted by a
steepest descent algorithm [32] or other similar methods. The overall output is the
weighted average of each rule’s output.
zr = αr0 + αr1x1 + · · · + αrdxd (2.3)
where α is the weight and αr0 is the bias of rule r.
The TSK model refers to the possibility of a polynomial inference mechanism.
In the TSK model [21], the linguistic value in the consequence is replaced by a
unit with inputs taken from the premise. Usually, the unit performs a first order
polynomial operation in the inputs:
2.3. Fuzzy inference System 22
Figure 2.5: Sugeno fuzzy inference system [12].
Ri : IF x1 is μi1 < logical OP > . . . < logical OP > xd is μi
d THEN
yi = fi(x1, . . . , xd)
where fi is a first order polynomial.
Dependent on the reasoning method, whether it is a Mamdami, a Tsukamoto,
or a TSK model, there are a number of parameters which need to be determined.
These parameters can be determined by either experts or through a learning process
from the training data set. If they are determined by experts then this is known as
a fuzzy system. On the other hand if they are determined using a learning process,
then it is known as a neuro-fuzzy system.
2.4. Neuro-Fuzzy Inference System 23
2.4 Neuro-Fuzzy Inference System
The neuro-fuzzy inference system is in between a fuzzy inference system and adaptive
neuro-fuzzy inference system. It consists of five feed-forward layer [23].
x1
x2
xd
y
wr
w1
w4
w3
w2
Layer 1 Layer 2 Layer 3 Layer 4
11ϕ
12ϕ
21ϕ
22ϕ
dcϕ
1φ
2φ
3φ
4φ
1φ
2φ
3φ
4φ
MIN MAX
Layer 5
Figure 2.6: Architecture of a Neuro Fuzzy System.
• Layer 1 (Input Layer): In this layer the input x is input into the corre-
sponding membership function in the next layer.
• Layer 2 (Matching): In this layer each node is a membership function. It
calculates the similarity between the input and the membership function. The
membership function can be a logistic function as shown below.
ϕd,c (x) =2
1 + exp(−a (x − c)2) (2.4)
where c is the function center.
2.5. Adaptive Neuro Fuzzy Inference System (ANFIS) 24
• Layer 3 (MIN): This layer performs a fuzzy AND operation.
φr = min(ϕ1,c (x) , ϕ2,c (x) , . . . , ϕd,c (x)) (2.5)
• Layer 4 (MAX): This layer performs a fuzzy OR operation to integrate the
rules which have the same consequence.
φt = max(φ1, φ2, . . . , φt) (2.6)
• Layer 5 (Defuzzifaction): This layer aggregates the output from different
rules.
y =
∑t
φtwtμt
(φt
)∑t
φt
(2.7)
where w and μ (φ) are the weights and output membership function respec-
tively.
The weights and the parameters in the membership function can be trained
by a back-propagation type of algorithm.
2.5 Adaptive Neuro Fuzzy Inference System (AN-
FIS)
A neuro-fuzzy system is functionally equivalent to a fuzzy inference system (FIS). A
FIS requires a domain expert to define the membership functions and to determine
2.5. Adaptive Neuro Fuzzy Inference System (ANFIS) 25
the associated parameters both in the membership functions, and the reasoning sec-
tion. However, there is no standard for the knowledge acquisition process and thus
the results may be different if a different knowledge engineer is at work in acquir-
ing the knowledge from experts. A neuro-fuzzy system can replace the knowledge
acquisition process by humans using a training process with a set of input-output
training data set. Thus instead of dependent on human experts the neuro-fuzzy sys-
tem will determine the parameters associated with the neuro-fuzzy system through
a training process, by minimising an error criterion. A popular neuro-fuzzy system
is called an adaptive neuro-fuzzy inference system (ANFIS) [22]. It consists of five
feed-forward layers as shown in Figure 2.7. The ANFIS is functionally equivalent
to Sugeno (TSK) Fuzzy Model presented in Section 2.3. It can also express its
knowledge in the IF-THEN rule format as follow: [12]
Rule1 : IF x1 is μ11 and x2 is μ21 then y1 = α10 + α11x1 + α12x2
Rule2 : IF x1 is μ11 and x2 is μ22 then y2 = α20 + α21x1 + α22x2
Rule3 : IF x1 is μ12 and x2 is μ21 then y3 = α30 + α31x1 + α32x2
Rule4 : IF x1 is μ12 and x2 is μ22 then y4 = α40 + α41xi1 + α42x2
2.5.1 A Feed-Forward Network
In this section, the basics of the ANFIS architecture are briefly described. This
will form as background to the proposed extended ANFIS architecture. An ANFIS
architecture in general consists of three sections:
2.5. Adaptive Neuro Fuzzy Inference System (ANFIS) 26
x1
x2
xd
y
wr
w1
w4
w3
w2
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
11ϕ
12ϕ
21ϕ
22ϕ
dcϕ
1φ
2φ
3φ
4φ
rφrφ
1φ
2φ
3φ
4φ
Figure 2.7: Architecture of ANFIS
• The input section – in this section, the input variables are modeled by using
fuzzy membership functions. There are many possible candidate fuzzy mem-
bership functions, e.g. triangular, trapezoidal, Gaussian functions. We will
call this collection of possible membership functions as the set of candidate
membership functions. A membership function is chosen by the user from this
set of candidate membership functions.
• The rules section – in this section, rules are formed from the set of membership
functions. Since there is no a priori reason to exclude any possible combina-
tions of rules from a combination of membership functions hence all possible
combinations of the membership functions are formed.
• The output section – in this section, the outputs are formed by a combination
of the outputs of the rules. There are a number of possibilities. However, in
2.5. Adaptive Neuro Fuzzy Inference System (ANFIS) 27
practice there are two popular choices:
(1) a zeroth order output function (in other words, the output is a linear
combination of the inputs), and
(2) the Takaga-Sugeno-Kang (TSK) output function.
The TSK mechanism allows a direct influence by the input on the output,
while the zeroth order output function consists of a linear combination of the
internal signals of the ANFIS architecture as the output.
We will delay indication of the architecture details until Chapter 4 when we will
give description of each section. For the other variety of neuro-fuzzy system, please
refer to [73].
The ANFIS architecture while popular is known to have a number of shortcom-
ings. These include: the lack of scalability to a large number of input variables,
the almost ad hoc manner in which a membership function is chosen, the number
of membership functions required for each input variable. Fortunately there is some
empirical evidence that the prediction of the output is largely insensitive to the
choice of membership function type. There is also some empirical evidence that
as long as a sufficient number of membership functions is chosen for each input
variable, the architecture appears to be able to handle the modeling of the input
variables. However, there is no remedy for the number of rules needed for ANFIS.
Indeed as the ANFIS architecture does not make any a priori assumption on the
rule structure, and hence it is required to form all possible combinations of the input
2.5. Adaptive Neuro Fuzzy Inference System (ANFIS) 28
variables. Hence, if the number of input variables is high, then it is quite difficult
to implement the architecture in practice.
In Chapter 4, we will provide an extension of the ANFIS architecture, which
provides alleviation of these deficiencies. We claim that the EANFIS architecture
will overcome some of these issues adequately that the neuro-fuzzy architecture, of
which ANFIS is a genre, can be recommended for practical applications.
2.5.2 Network Training
The adaptive neuro-fuzzy network can be trained using a steepest descent algo-
rithm as shown in Equation 2.8. The detailed derivation of the training algorithm
will be shown in Appendix A.2. Alternatively we may use a normal equation [33]
formulation if the parameter dependence is linear 1.
αnewrd = αold
rd + ηeφrxd (2.8)
where η is a learning constant, e is the error between the desired output and the
output of the system, φr is the set of internal states of the system, while xd is the
input. For further details please see Section A.2.
1There are a number of equivalent formulations in this case. It can be solved in a one-shot
fashion in solving the normal equation. Alternatively this can be solved in a recursive fashion
using a least square recursive training technique.
2.5. Adaptive Neuro Fuzzy Inference System (ANFIS) 29
2.5.3 Network Pruning
The membership function in ANFIS should cover all the input area. However, if the
model includes as many rules as possible it is the number of membership functions
per dimension which determines the number of rules. Because of the fact that the
input data is not usually distributed uniformly in the input space, not every rule in
the network is useful or “fired”. A pruning process can be used to shrink down the
generated rules according to their significance (in terms of firing strength). A survey
on various pruning algorithms is in provided in [35]. An orthogonal transformation
method proposed in [36] that can determine the less important nodes, i.e. those
which have less or no firing strengths, and hence can be eliminated.
An orthogonal transformation method for Network Pruning
This method is proposed by Partha and Debarag in [36]. It uses a singular value
decomposition (SVD) method [37] and a QR [38] decomposition method with column
pivoting factorization (QRcp) for transformation. The SVD mainly serves as the
null space detector and the QRcp coupled with SVD is used for subset selection [36]
which can find out the less important nodes in the network.
The SVD is given in Equation 2.9 where φ ∈ �I×R, i = 1, 2, . . . I, r = 1, 2, . . .R is
the normalized firing strength matrix coming from RBFN or ANFIS, U ∈ �I×I and
V ∈ �R×R are left and right orthogonal matrices respectively with the properties
(UTU = I,VTV = I) and Σ ∈ �I×R = diag[σ1, . . . , σR] is a pseudodiagonal matrix
where the singular values are sorted in descending order where σ1 ≥ . . . ≥ σR. The
2.5. Adaptive Neuro Fuzzy Inference System (ANFIS) 30
singular values in Σ are square roots of eigenvalues from φφT or φTφ which reflects
the number of important nodes in φ.
φ = UΣVT (2.9)
The QR factorization is given in Equation 2.10. Let V ∈ �R×q constitute the
first q columns of V from Equation 2.9 where q is the first q important columns
in Σ. P ∈ �R×R is a permutation matrix, Q ∈ �q×q is an orthogonal matrix and
R ∈ �q×R is an upper triangular matrix.
VTP = QR (2.10)
The F ∈ �I×q in Equation 2.11 is the constructed normalized output firing
strength constitute of the first q important columns from an input normalized firing
matrix φ. P ∈ �R×q is the first q important columns of the permutation matrix P.
F = φP (2.11)
In the ANFIS system, the network includes as many rules as possible to cover the
input space. Although the pruning method can detect less important nodes in the
network, this approach is not recommended. When the network is fully expanded
it requires huge memory if the network is large. Another approach is by using a
data mining method that can determine the important rules first before they are
2.6. Radial Basis Function Networks (RBFN) 31
generated. A possible method to achieve this will be shown in Section 4.4.
2.6 Radial Basis Function Networks (RBFN)
RBFN neural networks play a crucial role in adaptive neuro-fuzzy systems. It incor-
porates an adaptive feature to the system which does not require a domain expert to
specify the membership functions. The membership functions used are radial basis
functions. A RBFN neural network is a linear combination of these basis functions
(see Equation 2.12).
y (x) =∑r=1
wrBr (x) + w0 (2.12)
where x ∈ �D is an input vector, Br(x), r = 1, 2, . . . , R, is a basis function of rth
neuron, wr is a constant attached to each neuron and w0 is a bias. Figure 2.8 shows
the RBFN architecture.
A typical basis function is a Gaussian function (please refer to Equation 2.13)
or a logistic function (please refer to Equation 2.14).
• Gaussian function
Br (x) = exp
(−‖x − cr‖2
2σ2r
)(2.13)
where ‖ · ‖ denotes the Euclidean norm, cr and σr are respectively the centre
and spread.
2.6. Radial Basis Function Networks (RBFN) 32
B1
B2
BR
y
w1
w2
w3
wR
w0
.
.
.
X
Figure 2.8: Architecture of RBFN
• Logistic function
Br (x) =1
1 + exp(
‖x−cr‖2
σ2r
) (2.14)
where cr, σr have the same meaning as in Eq(2.13).
Generally speaking, a basis function consists of a center cr, and a spread of the
effective area σr. These parameters will need to be determined from the inputs.
An improved version of RBFN is to use the weighted average as shown in Equa-
tion (2.15) instead of the weighted sum of each neuron’s firing strength.
y (x) =
∑r=1
wrBr (x)∑r=1
Br (x)+ w0 (2.15)
The final output is obtained by a linear combination of the normalized firing
2.6. Radial Basis Function Networks (RBFN) 33
strengths of neurons. When the overlapped area between two or more receptive
fields is large, the weighted average method can have a well-interpolated overall
output between the outputs of the overlapping receptive fields [12]. In this thesis
we will not use this version of the RBFNs and hence it will not be considered any
further.
The weights wi are usually adjusted by using either a steepest descent algorithm
[32] (as shown in Equation (2.16) or using a normal equation [33] (as shown in
Equation (2.17)).
• Steepest descent learning algorithm:
wnewr = wold
r + ηeiBr (xi) (2.16)
where η is a learning constant, xi is the ith input vector, i = 1, 2, . . . , I, ei is
an error obtained from eiΔ= δi − yi, δi is the desired value of ith output yi.
Br(xi) is a basis function of rth neuron, r = 0, 1, . . . , R. The detailed learning
algorithm is derived in Appendix A.1.
• Normal equation
w =(BBT
)−1BδT (2.17)
where w ∈ �R+1 is a weight vector, B ∈ �I×(R+1), i = 1, 2, . . . , I is a basis
function matrix which contains firing strengths from the input matrix X, X ∈�I×D, d = 1, 2, . . . , D and δ ∈ �I is the desired output value vector.
2.6. Radial Basis Function Networks (RBFN) 34
The normal equation can be solved in a recursive manner if desired.
In the application of the RBFN neural network, the center and the spread for
various neurons would need to be determined before the network training process.
Normally, the basis function should cover the entire input space and distributed
uniformly as shown in Figure 2.9. Very often for convenience, it is assumed that the
centres are distributed uniformly over the input space. Under certain circumstances,
a non-uniformly distributed center scheme may be used instead, as shown in Figure
2.10.
X
Y
1
1
0
Figure 2.9: An example of the distribution of Gaussian function centers in a uni-formly distributed fashion.
This non-uniformed distributed centre scheme method will be considered in
Chapter 3.
2.7. Nonlinear Approximation Method proposed by Schilling et al. 35
X
Y
1
1
0
Figure 2.10: An example of non-uniformly distributed scheme of Gaussian functioncenters.
2.7 Nonlinear Approximation Method proposed
by Schilling et al.
Schilling et. al. [39] considered both a zeroth order and a first order radial basis
function neural network (RBFN) as an approximation of continuous signals using a
raised cosine function. In Section 3.3, we will consider explicitly the case of a zeroth
order RBFN. In a zeroth order RBFN [39]:
yi = wT B (xi) (2.18)
where yi is a scalar output, and B(xi) ∈ Rr is a vector denoting the outputs of r
radial basis function neurons. xi ∈ �D, i = 1, 2, . . . , I is the normalized input. A
radial basis function neuron is an artificial neuron with a radial basis function as
the activation function [12]. There are a number of possible radial basis function
activation functions. A popular basis function is Gaussian function shown in Equa-
2.7. Nonlinear Approximation Method proposed by Schilling et al. 36
tion (2.19). The one recommended in [39] is a raised cosine function as shown in
Equation (2.20). In this thesis, we will use the Gaussian function
Br (xi) = exp
(−‖xi − cr‖2
2σ2r
)(2.19)
where c is the centre and σ is the spread. The vector w ∈ Rr denotes the constant
weights connecting the outputs of the radial basis function neurons to the output.
The weights w can be obtained using a normal equation [33] as shown in Equation
(2.17).
Br(xi) =
⎧⎪⎨⎪⎩
1+cos(π‖xi−cr‖2)2
0
‖xi − cr‖2 ≤ 1
‖xi − cr‖2 > 1(2.20)
It is assumed that there are Gd non-linear grid points obtained in Section 3.2
along each dimension of the input xi ∈ RD, i = 1, 2, . . . , I. The nonlinear grid
points along the d-th dimension is denoted by zd,1 ≤ zd,2 ≤ . . . ≤ zd,G. The mapping
function pd : [zd,1, zd,G] → [1, G] as shown in Equation (2.21). By construction, the
mapping pd maps the g-th grid of zd,g into the array index of g. The pseudo codes
implementing the algorithm is shown in Figure 2.11 in which the input X maps to
X. The xd,i is an element of X, X ∈ �D×I , d = 1, 2, . . . , D, i = 1, 2, . . . , I. The xd,i
is an element of X, X ∈ �D×I . The zd,g is an element of Z ∈ �D×G, d = 1, 2, . . . , D,
g = 1, 2, . . . , G.
2.7. Nonlinear Approximation Method proposed by Schilling et al. 37
pd(xd,i) =
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
1 +xd,i−zd,1
zd,2−zd,1
g +xd,i−zd,g
zd,g+1−zd,g
G +xd,i−zd,G−1
zd,G−zd,G−1
xi,d < zd,1
zd,g ≤ xd,i < zd,g+1
xi,d ≥ zd,G
(2.21)
X = mapping(X, Z)
for each dimension dfor each data point i
if (xd,i < zd,1)
xd,i = 1 +xd,i−zd,1
zd,2−zd,1
elseif (xd,i >= zd,G)
xd,i = G +xd,i−zd,G−1
zd,G−zd,G−1
else
xd,i = g +xd,i−zd,g
zd,g+1−zd,g
end
end
Figure 2.11: The pseudo-code implementation of Schilling et al’s mapping function[39].
Shilling et al [39] assume that there already exists a way to find the nonlinear
grid points. Hence they only provide a method for transforming the nonlinear grid
points to a set of linear grid points. After the transformation from the nonlinear
grid to the linear grid, the transformed data can be put in a linear grid supported
RBFN for training purposes. The basis function center is given by the index g
where g ∈ {1, Gd}. Figure 2.12 shows a block diagram implementing Schilling et
al’s mapping method [39]. This method allows the transformation from a nonlinear
grid to a linear grid and use it in the training of a RBFN. We will call this method
a nonlinear to linear map (N2Lmap) method.
2.7. Nonlinear Approximation Method proposed by Schilling et al. 38
Thus we have three possibilities:
• Use a linear grid regime. The linear grid regime will provide the centers and
spreads of the radial basis functions used in a RBFN.
• Use a nonlinear grid regime. The nonlinear grid regime will provide the centers
and spreads of the radial basis functions used in a RBFN.
• Use a mapping from the nonlinear grid to a linear grid regime using Shilling et
al’s method [39]. In this case, the transformed linear grid provides the centers
and spreads of radial basis functions used in a RBFN. This method is called
N2Lmap method.
It would be interesting to investigate the performance of the RBFN using these
three methods of obtaining the centers and spreads. This will be carried out in
Section 3.3.
Non-linear grid points Z
Schilling et al’smapping function
Linear setup grid RBFNX X y
Figure 2.12: Block diagram of Schilling et al’s mapping method
In summary, there are two alternatives after the set of non-linear grid points is
obtained.
2.8. Kohonen Self-Organising Map (SOM) 39
(1) Put the non-linear grid points into a non-linear grid supported RBFN, and
(2) Put the non-linear grid points into Schilling et al’s mapping function [39] that
maps the input data x to x and then use a linear grid supported RBFN.
A flowchart of the main system incorporating Shilling et al’s method [39] and
the proposed method indicated in this chapter is shown in Figure 2.13.
2.8 Kohonen Self-Organising Map (SOM)
Kohonen’s Self-Organising Map is proposed in [40] which is known as an unsuper-
vised learning method. It does not require a desired output value in its training
process. During the training process, similar patterns would gather together. If a
new pattern which is close to a pattern that is laready stored in the network then
it is classified as the stored class.
The algorithm can be expressed as follows:
• Step 1: Select a 2D map size, each node (at the intersection of the grid in the
2D map) contains a random feature vector. The feature vector has the same
length as the input vector and is initialized to have vlaues lying between 0 to
1.
• Step 2: Apply an input vector to each node to find a node which has the
2.8. Kohonen Self-Organising Map (SOM) 40
Turning Point Sampling Algorithm
( )turning_point_sampling=X X�
Set up linear grid points Z
Non-linear grid points formulation
( )non_linear , ,�,a,b,c=Z X Z�
Method
Set Z as basis function centers then put in
Non-linear grid supported RBFN
Schilling et al’s mapping function
( )mapping ,=X X Z
Set array index as basis function centers then put in
linear grid supported RBFN
N2LmapNon-linear grid supported RBFN
Start
End
Figure 2.13: A flow chart showing the implementation of a non-linear grid in aRadial Basis Function Neural Network.
2.8. Kohonen Self-Organising Map (SOM) 41
smallest Euclidean distance as a winner node.
minj
‖Xi − Wj‖ =
√√√√ D∑d=1
(xdi − wdj)2 (2.22)
where d = 1, 2, . . . , D is the input dimension, i = 1, 2, . . . , I is the index of the
i − th input vector, j = 1, 2, . . . , J is the j − th neuron in the map.
• Step 3:Update the feature vector of the winner node and those neurons of its
immediate neighbourhood.
wnewdj = wold
dj + Δwdj (2.23)
Δwdj =η (xd − wdj) , j ∈ Λj
0 , j /∈ Λj
(2.24)
where Λ is a neighborhood function which usually takes the form of a Mexican
hat shape around the winner node j as follow.
Figure 2.14: SOM Mexican hat update function
Repeat step 2 and step 3 until all input vectors are fed into the network. The
training process should run until the network is converged.
Chapter 3
Non-uniform Grid Construction in
a Radial Basis Function Neural
Network
3.1 Motivation
Basically, a neuro-fuzzy system employs a radial basis function network (RBFN)
for the premise and the Takagi-Sugeno-Kang (TSK) method for the consequent in
a fuzzy system. To improve the performance of RBFN is equivalent to “tuning” up
the performance of the neuro-fuzzy system. In an adaptive neuro-fuzzy inference
system (ANFIS) the basis functions should span the entire input space. The normal
implicit assumption is that the input is distributed uniformly over the entire input
42
3.2. Method of obtaining non-linear grid points 43
space. However, in general, the input data may not be distributed uniformly in the
input space. The sparse or flat area in the input space would require less number
of neurons to adequately represent it. On the contrary, the dense or “bumpy” area
would require more number of neurons. Experiments carried out in this thesis (in
Section 3.3) show that using a non-uniform grid distribution (this will be alterna-
tively referred to as a “non-linear grid”) outperforms one which uses a linear grid.
Such a method for obtaining the nonlinear grid will complement a method proposed
by Shilling et al [39]. In [39], they proposed a method for transforming a nonlinear
grid to a linear grid. However, in their method they assumed that the nonlinear grid
is obtained by a domain expert and they did not provide any explicit method for the
determination of the non-uniform grid from the inputs. Our proposed method will
fill this gap. The following sections provide a detailed discussion of our proposed
method for generating a non-linear grid. The experimental results will be shown in
Section 3.3 which will demonstrate the potential of the proposed method.
3.2 Method of obtaining non-linear grid points
Given a signal, x ∈ RD, the problem is to find a set of nonlinear grid points which
will “adequately” approximate the signal. Here “adequately” means the error with
respect to an error criterion is small, or smaller than a prescribed threshold. The
elements of x is denoted by xd,i; d = 1, 2, . . . , D; and i = 1, 2, . . . , Id. In other
words, we allow the D-dimensional signal to have different number of points in each
dimension. Generally Id = I, i.e. each dimension will have the same number of
3.2. Method of obtaining non-linear grid points 44
points.
The reasons why we wish to find a set of nonlinear grid points can be stated as
follows:
• As a preprocessing method for the method proposed by Shilling et al [39].
• As a standalone method which will provide an approximate signal for a given
signal.
The method of obtaining a set of non-linear grid points involves two steps.
1. Find the turning points of the given signal, and
2. Find a set of non-linear grid points by non-uniform sampling of the given
signal.
The first step in the method involves finding the set of turning points in the
given signal. A turning point is defined as the point where the gradient of the
signal changes. There are a number of ways in which the gradient of a signal can
be detected to change. For example, we can compute the approximate gradient
of the signal by computing |(xd,i − xd,i−1) − (xd,i+1 − xd,i)| > τ where τ is a given
threshold and xd,i is the ith value of the d-th dimension of the given signal. If the
input data is noisy then the value of τ should be higher. Otherwise the unwanted
noise will be included. On the other hand, if the signal is not noisy, then the value
of τ would be lower. In general, the value of τ is determined by a trial and error
3.2. Method of obtaining non-linear grid points 45
method 1. The turning point set is a set of points where the gradient values change
are larger than the threshold. The intermediate points, i.e., the points which are
not part of the set of turning points will not be used in the second step. The turning
point sampling algorithm is as shown in Figure 3.1 which maps X → X where the
elements of X is denoted by xd,i, d = 1, 2, . . . , D and i = 1, 2, .., I and the elements
of X is denoted by xd,j , d = 1, 2, . . . , D and j = 1, 2, .., Jd.
X = turning point sampling(X)
for each input dimension dj=1
for each data point iIf |(xd,i − xd,i−1) − (xd,i+1 − xd,i)| > τ
xd,j = xd,i
j=j+1
endend
end
Figure 3.1: A pseudocode representation of the proposed turning point detectionalgorithm.
Usually, the grid points are uniformly distributed in the d-th dimension as shown
in Figure 3.2, zd,1 = xmind, zd,G = xmaxd
where xmindand xmaxd
are respectively
the minimum and maximum values of the input signal in the d-th dimension. In
1Note that the issue of determination of the threshold value of τ is related to the prevention of
information loss from the signal reconstruction process as explained in this chapter. If a large value
of τ is chosen, then this may lead to fewer grid points being chosen, thus may lead to information
loss in the reconstruction of the signal. On the other hand, if a smaller value of τ is chosen, this
may lead to noise being allowed to pass through to the signal reconstruction. As the value of τ is
found by trial and error method, hence in general, the issue of information loss will depend on the
judgement of the user.
3.2. Method of obtaining non-linear grid points 46
other words, zd,g − zd,g−1 = Δ, where g = 1, 2, . . . , G, and Δ is a constant [1].
Δ
,1dz ,2dz ,d gz ,d Gz
z
mindx maxd
x
Figure 3.2: Uniform grid point distribution in the d-th dimension of a given signal.
However, intuitively, it can be surmised that there may be two practical situa-
tions:
(1) where the given signal varies rapidly over a certain region, it makes sense to
allocate more grid points over the rapidly varying region, and
(2) in a region where the signal varies slowly, it will require less number of grid
points.
In this case, it will be advantageous to use a non-constant value of Δ.
In the second step, Equation (3.1) is used to determine the non-uniform grid
points. Let the values of a nominal grid be denoted by the vector zd, an element
of Z a matrix which stores the values of the nominal grid, Z ∈ �D×G, and it is
assumed that there will be Gd grid points in the d-th dimension, and G = maxd Gd.
3.2. Method of obtaining non-linear grid points 47
Consider an element of zd, denoted by zd,g, d = 1, 2, . . . , D and g = 1, 2, . . . , Gd.
Initially, the points zd, d = 1, 2, . . . , D are assumed to be uniformly distributed on
the zd axis. We wish to have an algorithm which will adjust the grid points such
that the original signal is approximated. There are a number of possibilities. For
example, one may use a simple algorithm which approximates the gradient of the
underlying signal, as represented by the set of turning points. However, such an
algorithm may not be optimal, as when the gradients change rapidly, or when the
value of τ is set too small, it may result in too many grid points. In this thesis we
propose an updating algorithm which is inspired by the updating algorithm of self
organising map algorithm [40]. The updating equation for zd,g is given in Equation
(3.1) and a pseudo-code representation of the proposed updating algorithm is shown
in Figure 3.3.
znewd,g = zold
d,g − ηξd,g (xd,j ; a, b, c) ed,g (3.1)
where xd,j is an element of X, X ∈ �D×Jd, d = 1, 2, . . . , D, and j = 1, 2, . . . , Jd is the
set of turning points obtained in Step 1; η is a learning constant, usually selected
such that 0 ≤ η ≤ 1; and ξ(x; a, b, c) is a triangular function given as follows [12]:
ξ (x; a, b, c) = max
(min
(x − a
b − a,c − x
c − b
), 0
)(3.2)
The triangular function shown in Figure 3.4 has a height of 1 at the point b.
The base of the triangle is located at points a and c respectively. By allowing b
to be different from 12(a + c), we allow for a non-symmetrical triangular function.
3.2. Method of obtaining non-linear grid points 48
Z = non uniform(X, Z, η, a, b, c)for each iteration t
for each input dimension dfor each data point i
for each grid point gzd,g = zd,g − ηξd,g(xd,i; a, b, c)ed,g
end
endend
end
Figure 3.3: Updating algorithm of finding the set of non-linear grid points.
,1dz ,2dz ,d Gz
za
b
c
1
,d gξ
Figure 3.4: A diagram illustrating a triangular function on uniform distribution gridpoints.
3.2. Method of obtaining non-linear grid points 49
The constants a, b, c are chosen a priori. The error is given by ed,g = xd,i − zd,g
where xd,i is the set of turning points found in Step 1 of the proposed algorithm for
the dth dimension of the input, and g = 1, 2, . . . , Gd. When the sum of errors ed,g
computed over all grid points g = 1, 2, . . . , Gd is small or smaller than a prescribed
threshold, or the number of iterations has reached a preset constant, the algorithm
will stop. The converged values zd,g, d = 1, 2, . . . , D; g = 1, 2, . . . , Gd will be the
set of nonlinear grid points in d-th dimension. This set of nonlinear grid points will
approximate the original signal.
Now since the set of nonlinear grid points found using the proposed algorithm
is a set of discrete points, to approximate the original signal, it will be necessary to
use either interpolation algorithms, or a set of radial basis functions to interpolate
the signal over these nonlinear grid points. In this thesis we will only consider using
the set of radial basis functions method.
There are two parameters associated with each radial basis function, viz. the
center and the spread. Once these two parameters are determined the shape of
the radial basis function is determined. There are many ways in which these two
parameters associated with the radial basis function can be determined. For exam-
ple, [64] provided an offline method which uses a clustering algorithm to cluster the
given data into clusters, and from such clusters, the centers and spreads of the set
of radial basis functions can be determined accordingly. In [1] a method is proposed
for determining the centers and spreads of a set of radial basis functions if a grid is
provided. In this case, Tsoi and Tan [1] suggested that one simple way to find the
centers and spreads is to assume that the centers are located at the intersection of
3.2. Method of obtaining non-linear grid points 50
the grid points, and the spread is determined by the interval between the grid points.
This is a very simple method if a grid on the input space is provided. This scheme of
determining the centre and spread of the radial basis function, given a grid, will be
the method which we will use in this thesis. Hence once a grid is obtained then the
centres and spreads of the radial basis functions can be determined. The question
is: how do we find the grid in the first instance.
The centre and spread of the radial basis function shown in Figure 3.5 will be
denoted respectively by zd,g and zd,g+1 − zd,g, g = 1, 2, . . . , Gd, d = 1, 2, . . . , D.
maxdxmind
x
,1dz ,2dz ,d Gz
zBasis
function
, 1 ,d g d gz z+ −,d gz
Figure 3.5: A diagram illustrating the determination of the centers and spreads ofa non-uniformly distributed set of grid points.
This algorithm can be conceptualised as follows: if the error is small, and within
the range [a, c], then the grid point zd,g is updated. The magnitude of the update
depends on the relative position of the turning point xd,i with respect to the apex
b of the triangular function with parameters a, b, c. If xd,i lies to the left hand side
of b, then zd,g will move a little closer to b, determined by the constant η. On the
other hand, if xd,i lies to the right hand side of b, then zd,g will move a little closer
3.2. Method of obtaining non-linear grid points 51
to b. If the point zd,g is outside the region of interest [a, c], then there is no update
to the value of zd,g.
Here in this exposition, we have chosen a triangular function to determine the
region of interest in the updating algorithm. It is obviously possible to use other
functions, e.g. a Gaussian function. We find from a number of preliminary ex-
periments that the Gaussian function performs inferior to the triangular function.
Figure 3.7 shows the magnitude of the update using a Gaussian function and a tri-
angular function respectively. Because the magnitude update is the product of the
function ξd,g and the error ed,g, and noting that the Gaussian function is non-zero
at the boundaries of points a, c, it leads the Gaussian function to produce a “leak-
age” effect outside the region bracketed by the points a and c. Hence, we will only
consider the deployment of ξ as a triangular function in Figure 3.6.
This algorithm is in the spirit of the self organizing map (SOM) technique pro-
posed by Kohonen [40] in the sense that it updates the weights associated with the
winning node and the neighborhood nodes respectively. The SOM update equation
is shown in Equation (3.3). Here SOM uses Ωc, a neighborhood function to control
the update around the winning node. On the other hand, our proposed algorithm
uses a triangular function as a neighborhood function. Because the triangular neigh-
borhood function has zero values if the input is outside the region bracketed by the
points a and c, this algorithm is not required to find the winning node. Secondly,
in the SOM updating algorithm [40], the size of the neighborhood function shrinks
as the algorithm progresses. On the other hand, in the proposed algorithm, we use
a neighborhood function with a constant cover. The end result of our proposed
3.2. Method of obtaining non-linear grid points 52
1 2 3 4 5−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
Grid
Upd
ate
Mag
nitu
de
a b c
Figure 3.6: The magnitude update using triangular functions in the grid pointlocation updating algorithm.
1 2 3 4 5−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
Grid
Upd
ate
Mag
nitu
de
a b c
Figure 3.7: The magnitude update using Gaussian functions in the grid point loca-tion updating algorithm.
3.2. Method of obtaining non-linear grid points 53
algorithm is that the grid points will move towards the turning points, thus creating
a set of non-linear grid points.
znewk = zold
k − ηΩc (xi) ek (3.3)
where η is a learning constant, Ω is a neighborhood function and c denotes a winning
node, k = 1, 2, . . . , K.
Once the set of non-linear grid points zd,g is obtained, we can then use the classic
radial basis neural network architecture with the non-linear grid support as described
in Section 2.6 or use it as a preprocessing method to obtain the non-linear grid in
the Schilling et al’s method as shown in Section 2.7 to construct the approximate
signal.
The behaviour of our proposed algorithm is governed by:
• The number of grid points. We can control the number of grid points used in
the algorithm. The number of grid points provides an indirect control on the
goodness of fit on the original signal. If we use relatively few grid points, then
it is found that only the coarse nature of the signal is approximated. On the
other hand, if we use a relatively large number of grid points, then some finer
features of the signal will be approximated. In this respect, it would be difficult
to characterize the exact nature of the approximation. We cannot indicate,
like with wavelet functions, the approximation capabilities in terms of coarse
and fine features of the signal. It is also not possible to give a statement on
how well the method approximates the given signal as a function of the number
3.2. Method of obtaining non-linear grid points 54
of grid points used. However, we have employed this method on a number of
practical signals, and it is found that the method works well. It is also found
that this method is able to filter out some of the noise content in the signal
by controlling the number of grid points.
• The parameters a, b, c. These parameters interact with the number of grid
points. The parameters a and c control the effective area, the parameter b
is the center. When the number of grid points along the input dimension is
large it would be advisable to increase the effective area of the neighborhood
function. On the other hand, if the number of grid points is small along the
input dimension it would be advisable to decrease the effective area covered
by the neighborhood function. Figure 3.8 and Figure 3.9 respectively show
the magnitudes of updates when using one or two grid points in the effective
area covered from either sides of the center b. It is observed that for more grid
points used on either sides of the center, the triangular function works well.
• The learning constant η. It is a small constant or a monotonic decreasing
function. It is found that if η is relatively large, then the algorithm exhibits
an oscillatory behaviour. On the other hand if η is relatively small, it will take
considerable time for the algorithm to converge.
• The stopping criterion. There are a number of possible stopping criteria:
cumulated error function, or a prescribed maximum number of iterations. In
this thesis, we opt to use a fixed maximum number of iterations.
3.2. Method of obtaining non-linear grid points 55
1 2 3 4 5−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
Grid
Upd
ate
Mag
nitu
de
a b c
Figure 3.8: The magnitude of the update using one grid point on either sides of thefunction center.
1 2 3 4 5
−0.5
0
0.5
Grid
Upd
ate
Mag
nitu
de
a b c
Figure 3.9: The update of the magnitude of the updating algorithm using two gridpoint either side of grid function center b.
3.3. Application Examples 56
3.3 Application Examples
In this section, we will demonstrate the effectiveness of the proposed method using
a number of examples: one based on the van der Pol equation [71], one based on real
data: the currency exchange rate between US Dollar and Indonesian Rupiah during
the tumultuous days of Asian economic meltdown between 1997 and 1998, one based
on the famous Sunspot cycle time series and one multidimensional example based
one iris dataset. These examples are chosen because they represent different types
of problems, some artificially generated, e.g. the van der Pol equation, and some
practical problems, e.g. sunspot cycle time series, the currency exchange problem.
In addition, the iris dataset provides a good example of a multi-dimensional dataset.
These datasets would help us to evaluate the effectiveness of the proposed algorithm.
3.3.1 Van der Pol Oscillator
The Van der Pol Oscillator is a nonlinear oscillaotry model exhibiting a limit cycle
behaviour. In this section, we will demonstrate the application of our proposed
method on the system by varying the number of grid points. The response of the van
der Pol equation is simulated using a random initial value. This response is stored
in a file which represents the input to this investigation. The implementation of the
algorithms is in Matlab. This will allow rapid evaluation of various concepts without
being bogged down in details of code development. We performed experiments which
use different values of the total number of grid points using our proposed method.
We plotted the root mean square (RMS) errors as a function of the number of grid
3.3. Application Examples 57
points as shown in Figure 3.10. The upper graph corresponds to the performance of
the linear grid regime, the middle one corresponds to that of a nonlinear grid regime
and the lower most one corresponds to that of the N2Lmap method. It is observed
that from a RMS error’s point of view, the RBFN using the nonlinear grid and the
one using the N2Lmap both perform better than the RBFN using a linear grid.
10 20 30 40 50 60 70 80 90 1000
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Grid Points
RM
S E
rror
linear RBFNnonlinear RBFNN2Lmap
Figure 3.10: Performance comparisons of RBFN using three different regimes: lineargrid, nonlinear grid, and the transformation of the nonlinear grid to the linear grid(N2Lmap) method.
This result is interesting in that it shows that the nonlinear grid regime outper-
forms a linear grid regime. In addition, it also shows that if we use a small number
of grids, then the RMS error values appear to accentuate. On the other hand, if a
sufficiently large number of grid points is used, then there is hardly any difference
between the three methods. This result is hardly surprising. When a large number
3.3. Application Examples 58
of grid points is used, there will be sufficient number of grid points in the linear grid
regime to cover the rapidly varying portion of the signal. Hence, the effectiveness
of the nonlinear grid regime is lost. On the other hand, if only a small number
of grid points is used, then in the linear grid regime, it can be surmised that the
rapidly varying portion of the signal does not have sufficient number of grid points
to adequately represent the signal, and hence the RMS error values would be worse
than those found using a linear grid counterpart. It is observed from Figure 3.10, if
we use 40 grid points then the performances of the linear grid and nonlinear grid are
comparable. On the other hand, if we use only 15 grid points, then there is much
difference in the performance between the linear grid and non-linear grid methods.
Figure 3.11 shows the set of turning points obtained for the van der Pol equation
and Figure 3.12 shows the non-linear grid point distribution using 15 neurons while
in Figure 3.13 the actual output of the linear grid method using 15 neurons is shown.
Figure 3.15 shows the actual output of the nonlinear grid method using 15 neurons
and Figure 3.17 shows the actual output of the N2Lmap method using 15 neurons.
Figure 3.14 shows the differences in outputs of the original signal and the recon-
structed signal using 15 linear grid points.
Figure 3.16 shows the differences in the outputs of the original signal using a
nonlinear grid with 15 points and Figure 3.18 shows the differences in outputs of
N2Lmap method using 15 grid points.
It is observed that from Figures 3.14, 3.16 and 3.18 that the linear grid method
3.3. Application Examples 59
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Turning points sampling of Van der Pol Oscillator
Time
Am
plitu
de
Figure 3.11: The set of turning points superimposed on the original signal for thevan der Pol equation.
0 0.2 0.4 0.6 0.8 1
Linear Grid Points Distribution
Amplitude
0 0.2 0.4 0.6 0.8 1
Nonlinear Grid Points Distribution
Amplitude
Figure 3.12: The distribution of the set of grid points. The upper graph shows thedistribution using a linear grid while the lower graph shows the location of the gridpoints using a nonlinear grid regime. The total number of grid points used is 15.
3.3. Application Examples 60
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Time
Am
plitu
de
Actual output of 15 linear grid points RBFN
Figure 3.13: The actual output of a RBFN using 15 grid points in a linear gridregime. It is observed that the output is significantly different from that of theoriginal output of the van der Pol equation.
0 0.2 0.4 0.6 0.8 1−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
Time
Err
or A
mpl
itude
Output different of 15 linear grid points RBFN
Figure 3.14: The differences in the output of the van der Pol equation, and thereconstructed one using a RBFN with 12 grid points using a linear grid regime.
3.3. Application Examples 61
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Time
Am
plitu
de
Actual output of 15 nonlinear grid points RBFN
Figure 3.15: The output of a RBFN using 15 grid points and a nonlinear grid regime.
0 0.2 0.4 0.6 0.8 1−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
Time
Err
or A
mpl
itude
Output different of 15 nonlinear grid points RBFN
Figure 3.16: The output differences between the original signal from the van der Polequation and a reconstructed signal using a RBFN with 15 grid points and using anonlinear grid regime.
3.3. Application Examples 62
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Time
Am
plitu
de
Actual output of N2Lmap using 15 grid points
Figure 3.17: The output of a RBFN with 15 grid points using a nonlinear gridregime. In this case, we use the nonlinear grid mapped onto a linear grid using themethod proposed by Shilling et al [39].
0 0.2 0.4 0.6 0.8 1−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
Time
Err
or A
mpl
itude
Output different of N2Lmap using 15 grid points
Figure 3.18: The output differences between the original signal and the reconstructedsignal using a nonlinear to linear grid mapping as proposed in Shilling et al. [39].The number of grid points used is 15.
3.3. Application Examples 63
produces the highest discrepancies between the original signal and the reconstructed
signal which means the nonlinear grid methods outperform the linear grid method.
This set of figures using 15 grid points confirms the information provided in
Figure 3.10 that the grid may be too coarse to represent the van der Pol equation
output adequately. In particular it is observed that where the signal is rapidly
changing (near the peak), the error is most pronounced. On the other hand, where
the signal is relatively flat, the error is not as pronounced. It is further observed
that there are some differences between the two nonlinear grid methods: one using
the nonlinear grid and the other one using a mapping from a nonlinear grid to a
linear grid. This result is somewhat surprising, as one would have assumed that
the nonlinear mapping as determined in Equation (2.21) is lossless. However, on
further examination of the equations, it is observed that the method is not lossless.
In other words, there is an information loss mapping from a nonlinear grid to a
linear grid. Hence the results are not surprising. What is more surprising is that
the mapping from the nonlinear grid to the linear grid appears to perform better
than the nonlinear grid method. This difference may be surmised to be caused by
the fact that the nonlinear grid is tuned to the signal, while the mapping from a
nonlinear grid to a linear grid is not tuned the signal, and hence would have a better
generalisation capability.
To further confirm our intuition, we will repeat the same set of experiments
except this time we will use 40 grid points. The number 40 is chosen because from
Figure 3.10, it is observed that the RMS values using either the linear grid regime,
or the nonlinear grid regime are sufficiently close to one another. This implies that
3.3. Application Examples 64
we do not anticipate to find too much differences in the errors when we use either
the linear grid or the nonlinear grid method.
Figure 3.19 shows the nonlinear grid distribution using 40 grid points.
0 0.2 0.4 0.6 0.8 1
Linear Grid Points Distribution
Amplitude
0 0.2 0.4 0.6 0.8 1
Nonlinear Grid Points Distribution
Amplitude
Figure 3.19: The distribution of the grid points. The upper graph shows the lin-ear grid point distribution, while the lower graph shows the distribution using anonlinear grid. The number of grid points used is 40.
Figure 3.20 shows the actual output of linear grid method using 40 neurons,
Figure 3.22 shows the actual output of nonlinear grid method using 40 neurons and
Figure 3.24 shows the actual output of N2Lmap using 40 neurons.
Figure 3.21 shows the differential output of the original signal using 40 linear grid
neurons, while Figure 3.23 shows the differential output of the original signal using
40 nonlinear grid neurons and Figure 3.25 shows the differential output of N2Lmap
3.3. Application Examples 65
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Time
Am
plitu
de
Actual output of 40 linear grid points RBFN
Figure 3.20: The output of a RBFN with 40 grid points using a linear grid regime.
0 0.2 0.4 0.6 0.8 1−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
Time
Err
or A
mpl
itude
Output different of 40 linear grid points RBFN
Figure 3.21: The output differences between the original signal and the reconstruc-tion using a RBFN using a linear grid regime with 40 grid points.
3.3. Application Examples 66
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Time
Am
plitu
de
Actual output of 40 nonlinear grid points RBFN
Figure 3.22: The output of a RBFN using a nonlinear grid regime with 40 gridpoints.
0 0.2 0.4 0.6 0.8 1−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
Time
Err
or A
mpl
itude
Output different of 40 nonlinear grid points RBFN
Figure 3.23: The output differences between the original signal and the reconstruc-tion using a RBFN with a nonlinear grid regime using 40 grid points.
3.3. Application Examples 67
using 40 neurons.
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Time
Am
plitu
de
Actual output of N2Lmap using 40 grid points
Figure 3.24: The output of a RBFN using a nonlinear grid mapped onto a lineargrid with 40 grid points.
Again, the linear grid method has the largest error magnitudes as shown in Figures
3.21, 3.23 and 3.25 which implies that the nonlinear grid method outperforms the
linear grid method.
3.3. Application Examples 68
0 0.2 0.4 0.6 0.8 1−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
Time
Err
or A
mpl
itude
Output different of N2Lmap using 40 grid points
Figure 3.25: The output differences between the original signal and a reconstructedsignal using a RBFN with a nonlinear grid mapped onto a linear grid using 40 gridpoints.
3.3. Application Examples 69
RMS error using 15 grid points
Linear RBFN Nonlinear RBFN N2Lmap
0.0658 0.0480 0.0430
RMS error using 40 grid points
Linear RBFN Nonlinear RBFN N2Lmap
0.0073 0.0048 0.0040
Table 3.1: Output results comparisons of the van der Pol equation ex-ample.
It is noted that from Figures 3.13, 3.15 and 3.17, the approximation using 15 grid
points is not as good as one using a higher number of grid points, e.g., using 40 grid
points. Using more grid points produces better approximation of the original signal.
It is observed nevertheless the nonlinear grid method outperforms the linear grid
method (please see Figure 3.10 and Table 3.1) between using 10 – 50 grid points.
Once the distribution of grid points is sufficiently dense to cover the input space,
then there is no significant difference between the performance of the linear grid and
the nonlinear grid.
3.3.2 Currency exchange rate between the US Dollar and
the Indonesian Rupiah
We use the data on the currency exchange rate between US Dollar (USD) and
Indonesian Rupiah (IDR) between 01/01/1994 and 31/12/1999. The minimum was
3.3. Application Examples 70
1 USD to 2160 IDR. while the maximum was 1 USD to 16,475 IDR. The input time
series is first normalized to lie between 0 (0 denotes 1 USD to 2160 IDR) and 1 (1
denotes 1 USD to 16,475 IDR). There are a total of 2191 data points in this time
series. The weekend and holidays values are not included in the total number of
data points, as during these days the value is zero. The currency exchange time
series is as shown in Figure 3.26.
1994 1995 1996 1997 1998 1999 2000
0
0.2
0.4
0.6
0.8
1
Year
Exc
hang
e ra
te
Exchange rate of USD to IDR
Figure 3.26: The currency exchange time series between the US dollars and theIndonesian Rupiah, between 1st January, 1994, and 31st December, 1999. Notethat the vertical axis of this graph is normalised with 0 denoting 1 USD to 2160IDR, while the maximum 1 denoting 1 USD to 16,475 IDR.
We wish to apply the proposed method discussed in this chapter to study this
time series. It is observed that this time series is quite challenging to approximate.
3.3. Application Examples 71
It has a major peak. around day 1700. In addition, the time series appears quite
noisy. As indicated previously, the performance of the proposed method depends on
how many grid points are used. As a result, we first compute the RMS error as a
function of the number of grid points used. The variation of the RMS error values
as a function of the number of grid points is shown in Figure 3.27. There are three
graphs in Figure 3.27, the upper most one represents the performance of a RBFN
using a linear grid, the middle graph shows the performance of a RBFN using a
nonlinear grid but mapped to a linear grid (N2Lmap), while the lowermost graph
shows the performance of a RBFN using a nonlinear grid regime.
20 40 60 80 1000.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Grid Points
RM
S E
rror
linear RBFNnonlinear RBFNN2Lmap
Figure 3.27: The variation of the root mean square error values as a function of thenumber of grid points used.
It is observed that with 100 grid points all methods will yield good results. In
3.3. Application Examples 72
order to show the effect of the number of grid points, we will give two sets of results,
one with 25 grid points only, and the other one with 100 grid points. These two
values are chosen with the assistance of Figure 3.27. It is observed from Figure 3.27
that at 25 grid points, the performances of the three methods appear to be quite
well separated. On the other hand, at 100 grid points the performances of all three
methods appear to be approximately the same. The RMS error values for the case
of 25 grid points and 100 grid points respectively are shown in Table 3.2. Note that
these results are presented in Figure 3.27. Here in Table 3.2, they are indicated in
numerical form.
RMS error using 25 grid points
Linear RBFN Nonlinear RBFN N2Lmap
0.0483 0.0264 0.0380
RMS error using 100 grid points
Linear RBFN Nonlinear RBFN N2Lmap
0.0212 0.0196 0.0186
Table 3.2: The comparison of the root mean square values between using25 grid points and 100 grid points for the currency exchange time series.
It is observed that with 25 grid points, the performance of a RBFN using a nonlinear
grid performs best, giving the lowest RMS values, while the performance of a RBFN
using a nonlinear grid mapped to a linear grid performs slightly inferior. This may
be due to the noise effect. As the time series is quite noisy, and hence by mapping
the nonlinear grid to a linear grid, it may have less resistance to the underlying noise.
Note that this conclusion is different from that of the van der Pol equation. In the
3.3. Application Examples 73
van der Pol equation, there is no noise. Hence we surmise that the nonlinear grid
mapped to a linear grid performs better as the mapping allows a better generalisation
result. Here in this case, as the underlying data is noisy, and hence the mapping
from a nonlinear grid to a linear grid does not perform as well as a RBFN using a
nonlinear grid.
Figure 3.28 shows the set of turning points. Figure 3.29 shows the grid points
distributions which uses 25 grid points. The upper graph shows the distribution of
the grid points using a linear grid regime, while the lower graph shows the grid point
distribution using a nonlinear grid regime.
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Turning points sampling of exchange rate of USD to IDR
Day
Exc
hang
e ra
te
Figure 3.28: The set of turning points in the time series of USD to IDR. Note thatwe have connected the points so as to make it easier to discern where the turningpoints are.
Figure 3.30 shows the output of a RBFN using 25 grid points with a linear grid
regime. Figure 3.32 shows the output of a RBFN using 25 grid points but with a
3.3. Application Examples 74
0 0.2 0.4 0.6 0.8 1
Linear Grid Points Distribution
Exchange rate
0 0.2 0.4 0.6 0.8 1
Nonlinear Grid Points Distribution
Exchange rate
Figure 3.29: The distribution of the grid points. The upper graph shows the distri-bution of the linear grid points, while the lower graph shows the distribution of thenonlinear grid points.
3.3. Application Examples 75
nonlinear grid regime, and Figure 3.34 shows the output of a RBFN using 25 grid
points, with the mapping of the nonlinear grid to a linear grid regime.
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Day
Exc
hang
e ra
te
Actual output of 25 linear grid points RBFN
Figure 3.30: The output of a RBFN using 25 grid points with a linear grid regime.
Figure 3.31 shows the differences in the outputs of the original signal and the recon-
struction using 25 grid points with a linear grid regime,
Figure 3.33 shows the differences of the outputs of the original signal and the re-
construction using a RBFN with a nonlinear grid regime, and Figure 3.35 shows
the differences of the outputs of the original signal and the reconstruction using
the nonlinear grid mapped to a linear grid regime. It is observed that even though
the differences between the outputs of the original signal, and the reconstruction
using a RBFN with a mapping from the nonlinear grid to the linear grid are smaller
than the counterpart using a RBFN with a nonlinear grid, nevertheless the overall
cumulated root means square error is larger than the one using a nonlinear grid.
3.3. Application Examples 76
0 0.2 0.4 0.6 0.8 1−0.5
0
0.5
Day
Err
or A
mpl
itude
Output different of 25 linear grid points RBFN
Figure 3.31: The output differences between the original signal and the reconstruc-tion using a RBFN with 25 grid points with a linear grid regime.
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Day
Exc
hang
e ra
te
Actual output of 25 nonlinear grid points RBFN
Figure 3.32: The output of a RBFN using 25 grid points with a nonlinear gridregime.
3.3. Application Examples 77
0 0.2 0.4 0.6 0.8 1−0.5
0
0.5
Day
Err
or A
mpl
itude
Output different of 25 nonlinear grid points RBFN
Figure 3.33: The output differences between the original signal and the reconstruc-tion using a RBFN with 25 grid points in a nonlinear grid regime.
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Day
Exc
hang
e ra
te
Actual output of N2Lmap using 25 grid points
Figure 3.34: The output of a RBFN using 25 grid points but with a mapping fromthe nonlinear grid to a linear grid.
3.3. Application Examples 78
0 0.2 0.4 0.6 0.8 1−0.5
0
0.5
Day
Err
or A
mpl
itude
Output different of N2Lmap using 25 grid points
Figure 3.35: The output differences between the original signal and the reconstruc-tion using 25 grid points with a mapping from the nonlinear grid to a linear grid.
This supports our argument that the difficulties in the use of the mapping from a
nonlinear grid to a linear grid regime lies in the noise in the time series. However, it
is observed that from both Figures 3.32 and 3.34 that the reconstructions appear to
have captured the essence underneath the original signal, with a minor narrow peak
at day 1485. In contrast, the one using a linear grid cannot capture this narrow
peak.
We will repeat the experiment set using this time a larger number of grid points,
viz. 100 grid points. It is observed that with this number of grid points there is very
little difference between the three methods as they have all yielded similar RMS
values. The main reason why this has yielded similar RMS errors values is that
there is a sufficient number of grid points for the reconstruction of the underlying
3.3. Application Examples 79
system. Hence there is not much differences in their performance. Figure 3.36 shows
the grid point distributions which uses 100 grid points. The upper graph shows the
grid point distribution of a linear grid, while the lower graph shows the distribution
of the nonlinear grid points.
0 0.2 0.4 0.6 0.8 1
Linear Grid Points Distribution
Exchange rate
0 0.2 0.4 0.6 0.8 1
Nonlinear Grid Points Distribution
Exchange rate
Figure 3.36: The distribution of the grid points. The upper graph shows the distri-bution of the linear grid points, while the lower graph shows the distribution of thenonlinear grid points. The number of grid points used is 100.
Figure 3.37 shows the output using a RBFN with 100 grid points with a linear grid
regime, Figure 3.39 shows the output of a RBFN using a nonlinear grid regime, and
Figure 3.41 shows the output reconstruction using a RBFN with a mapping from
the nonlinear grid to a linear grid regime.
Figure 3.38 shows the output differences between the original signal and the one
reconstructed using a RBFN with a linear grid regime using 100 grid points, Figure
3.40 shows the output differences of the original signal and the one reconstructed
3.3. Application Examples 80
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Day
Exc
hang
e ra
te
Actual output of 100 linear grid points RBFN
Figure 3.37: The output of a RBFN using 100 grid points with a linear grid regime.
0 0.2 0.4 0.6 0.8 1−0.5
0
0.5
Day
Err
or A
mpl
itude
Output different of 100 linear grid points RBFN
Figure 3.38: The output differences between the original signal and the reconstructedone from a RBFN using 100 grid points with a linear grid regime.
3.3. Application Examples 81
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Day
Exc
hang
e ra
te
Actual output of 100 nonlinear grid points RBFN
Figure 3.39: The output of a RBFN using a nonlinear grid regime with 100 gridpoints.
0 0.2 0.4 0.6 0.8 1−0.5
0
0.5
Day
Err
or A
mpl
itude
Output different of 100 nonlinear grid points RBFN
Figure 3.40: The output differences between the original signal and the one recon-structed from a RBFN with a nonlinear regime using 100 grid points.
3.3. Application Examples 82
from a RBFN using a nonlinear grid regime and Figure 3.42 shows the output dif-
ferences of the original signal and the reconstruction using a RBFN with a mapping
from the nonlinear grid to a linear grid regime.
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Day
Exc
hang
e ra
teActual output of N2Lmap using 100 grid points
Figure 3.41: The output of a RBFN using a mapping from the nonlinear grid to alinear grid regime using 100 grid points.
It is observed, as shown in Figures 3.30, 3.32 and 3.34, the details exhibited in
the original signal is not well approximated using only 25 grid points. However,
if we use a larger number of grid points, say, 100 grid points, it will be able to
approximate the peaks better as shown in Figures 3.37, 3.39 and 3.41. This implies
that the number of grid points provides an indirect control on the extent in which
the fine features of the original signal can be approximated. It is observed with a
finer grid, the linear grid is able to approximate the narrow peak occurring in day
1485 while using 25 grid points, it was not able to capture this fine feature. With
reference to Figures 3.29 and 3.36, it is observed that more grid points are allocated
3.3. Application Examples 83
0 0.2 0.4 0.6 0.8 1−0.5
0
0.5
Day
Err
or A
mpl
itude
Output different of N2Lmap using 100 grid points
Figure 3.42: The output differences between the original signal and the reconstructedone from a RBFN with a mapping from the nonlinear grid to a linear grid regimeusing 100 grid points.
3.3. Application Examples 84
to the “bumpy” areas which allows the nonlinear grid methods to perform better
than the linear grid method in all cases (please see Figure 3.27) especially when less
grid points are used.
3.3.3 Sunspot Cycle Time Series
The sunspot cycle time series is compiled by the US National Oceanic and At-
mospheric Administration. The sunspot number is collected daily since January
1749 at the Zurich Observatory [54]. This time series appears to exhibit a cyclic
behaviour, with a cycle of approximately 11 years. Note that there has been some
discussions as to whether the sunspot cycle time series “really” has a cycle of 11
years, as it appears that there may be other cycles within the time series. We will
not consider this aspect in this thesis. We take the average sunspot cycle for each
month and the data set consists of data from January 1749 to July 2004. The
sunspot data is shown in Figure 3.43.
The sunspot values have been normalised to lie in the range from 0 to 1. It is
observed that there is considerable amount of noise in the time series, especially
around the peak values. This makes it challenging to use the proposed methods to
approximate the time series. Figure 3.44 shows the set of turning points as obtained
using the proposed algorithm.
It is noted that there are many turning points, as the time series is rather “peaky”.
What this implies is that it will require a larger number of grid points in order
to represent the time series more “faithfully”. This observation is backed up by
3.3. Application Examples 85
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
Month
Sun
spot
Num
ber
NOAA Sunspot Number
Figure 3.43: The monthly average sunspot number time series from January 1749to July 2004. The x-axis is normalised to lie between 0 and 1. Similarly the y-axisis also normalised to lie between 0 and 1.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1Turning points sampling of NOAA Sunspot Number
Month
Sun
spot
Num
ber
Figure 3.44: The set of turning points for the NOAA sunspot number time series.
3.3. Application Examples 86
observing the behaviour of the RMS error values as a function of the number of grid
points in Figure 3.45.
20 40 60 80 100 120 1400.04
0.06
0.08
0.1
0.12
0.14
0.16
Grid Points
RM
S E
rror
linear RBFNnonlinear RBFNN2Lmap
Figure 3.45: The variation of the RMS error values as a function of the number ofgrid points.
Figure 3.45 shows the RMS error values as a function of the number of grid points.
It is noted that there is considerable difference in the RMS error values among the
three methods in using a 50 grid points. However, there are no significant differences
among the three methods in the RMS error values after 100 grid points. It is noted
that the mapping of the nonlinear grid to the linear grid regime appears to perform
worse than the nonlinear grid regime. This is in line with our intuition as observed
in the currency exchange time series. Here the difference in the performance of the
3.3. Application Examples 87
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Linear Grid Points Distribution
Sunspot Number
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Nonlinear Grid Points Distribution
Sunspot Number
Figure 3.46: Grid point distribution using 50 grid points. The upper graph shows thedistribution of the linear grid points, while the lower graph shows the distributionof the nonlinear grid points.
mapping from the nonlinear grid to linear grid regime when compared with the non-
linear grid regime may be attributed to the noise content of the time series. It may
be surmised that in mapping the nonlinear grid to a linear grid the generalisation
capability of the method might have been degraded. We will choose two different
values for the number of grid points, viz., 50 and 100 respectively to carry out our
investigations. 50 grid point was chosen because from Figure 3.45, it appears that
all three methods have a significant difference in their behaviours, while at 100 grid
points it is observed that all three methods do not have much differences in their
RMS error values.
Figure 3.46 shows the grid point distribution using 50 grid points.
Figures 3.47, 3.48 and 3.49 respectively show the output, and the differences in the
3.3. Application Examples 88
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
Month
Sun
spot
Num
ber
Actual output of 50 linear grid points RBFN
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.5
0
0.5
Month
Err
or A
mpl
itude
Output different of 50 linear grid points RBFN
Figure 3.47: The output and differences of outputs between the original signal andthe reconstructed one using a RBFN with 50 linear grid points.
outputs reconstructed using a RBFN using the linear grid regime, the nonlinear grid
regime, and the mapping of the nonlinear grid to the linear grid regime.
It is observed that the linear grid regime produces relatively larger errors. In
addition, it is observed that while the mapping from a nonlinear grid to a linear
grid regime produces in general errors of smaller magnitude than the nonlinear grid
regime, that overall the mapping method suffers from the contamination by the
noise content in the signal. This may be the reason why the overall RMS error value
is larger than the one using the nonlinear grid.
We will repeat the set of experiments with 100 grid points. Figure 3.50 shows
the grid point distribution using 100 grid points.
Figures 3.51, 3.52 and 3.53 respectively show the actual output and the differences
3.3. Application Examples 89
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
Month
Sun
spot
Num
ber
Actual output of 50 nonlinear grid points RBFN
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.5
0
0.5
Month
Err
or A
mpl
itude
Output different of 50 nonlinear grid points RBFN
Figure 3.48: The output and differences of the original signal and the reconstructedone using a RBFN with a nonlinear grid regime using 50 grid points.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
Month
Sun
spot
Num
ber
Actual output of N2Lmap using 50 grid points
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.5
0
0.5
Month
Err
or A
mpl
itude
Output different of N2Lmap using 50 grid points
Figure 3.49: The output and differences in the original signal and the reconstructedone using a RBFN with a mapping from the nonlinear grid to a linear grid regimewith 50 grid points.
3.3. Application Examples 90
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Linear Grid Points Distribution
Sunspot Number
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Nonlinear Grid Points Distribution
Sunspot Number
Figure 3.50: Grid point distribution using 100 grid points for the sunspot cycle timeseries. The upper graph shows the linear grid distribution, while the lower graphshows the distribution of grid points using a nonlinear grid regime.
in the original signal and the reconstructed output using a RBFN with a linear grid
regime, a nonlinear grid regime, and a mapping of the nonlinear grid to a linear grid
regime.
Table 3.3 shows the performance of the RBFN using 50 and 100 grid points
respectively.
3.3. Application Examples 91
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
Month
Sun
spot
Num
ber
Actual output of 100 linear grid points RBFN
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.5
0
0.5
Month
Err
or A
mpl
itude
Output different of 100 linear grid points RBFN
Figure 3.51: The actual output and the differences in the original signal and theoutput reconstructed using a RBFN with 100 linear grid points.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
Month
Sun
spot
Num
ber
Actual output of 100 nonlinear grid points RBFN
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.5
0
0.5
Month
Err
or A
mpl
itude
Output different of 100 nonlinear grid points RBFN
Figure 3.52: The actual output and the differences in the original signal and thereconstructed output using a RBFN with 100 grid points with a nonlinear gridregime.
3.3. Application Examples 92
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
Month
Sun
spot
Num
ber
Actual output of N2Lmap using 100 grid points
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.5
0
0.5
Month
Err
or A
mpl
itude
Output different of N2Lmap using 100 grid points
Figure 3.53: The actual output and the differences in the original signal and thereconstructed output using a RBFN with a mapping of the nonlinear grid to a lineargrid regime with 100 grid points.
RMS error using 50 grid points
Linear RBFN Nonlinear RBFN N2Lmap
0.1194 0.0866 0.1059
RMS error using 100 grid points
Linear RBFN Nonlinear RBFN N2Lmap
0.0628 0.0630 0.0626
Table 3.3: Output results comparisons.
It is observed that the nonlinear grid methods outperform the linear grid method
as one would have expected. The results of this experiment confirm our conclusion
from the currency exchange time series experiments.
3.3. Application Examples 93
So far, our experiments have been conducted on one-dimensional signals. In the
next subsection, we will consider an example with higher dimensions.
3.3.4 Experiments with the Iris Dataset
The Iris dataset consists of three species of iris flower. These are: Iris-Virginica, Iris-
Versicolor and Iris-Setosa. Each species measures sepal length, sepal width, petal
length, and petal width. Iris-Virginica and Iris-Versicolor are linearly inseparable.
We randomly select 51 data points for testing and 99 data points for training. Once
the training and testing data is chosen, the same data sets will be normalized and
put into different methods for performance evaluation.
Figure 3.54 shows the variation of the RMS errors as a function of number of
grid points per dimension.
It is noted that the RMS error values appear to behave quite “oddly” in that the
RMS error values increase with the number of grid points after 6 grid points per
dimension. This is an odd behaviour because one would have expected that the
RMS error values are a monotonically decreasing function of the number of grid
points. This “odd” behaviour is explained as below.
Figure 3.55 shows the grid point distribution using three grid points per dimension.
It may be observed from Figure 3.54, nonlinear grid methods outperform the linear
grid point one when using 2 to 4 grid points per dimension. However, the results
for the number of grid points greater than 3 may be of concern. Basically there is
3.3. Application Examples 94
2 4 6 8 100
0.5
1
1.5
2
2.5
Number of Grid Points per dimension
RM
S E
rror
linear RBFNnonlinear RBFNN2Lmap
Figure 3.54: The RMS error values as function of the number of grid points perdimension.
insufficient number of data points to support any concrete conclusions. As a rule of
thumb, the number of training data should be larger than the number of parameters.
When using 4 grid points per dimension the number of parameters is larger than
the number of examples in the training data set. For 6 grid points per dimension,
the system becomes unstable. This is the main reason why the behaviour of the
RMS error curve seems to behave “oddly”. There is insufficient number of data
point in the training data set to support any concrete conclusions after 3 grid points
per dimension. Since we are dealing with a signal with higher dimensions, one way
in which we can process the signal is to assume that the signal can be processed
individually in each dimension first, and the results can be combined in the product
space. Hence in order for the basis function to cover all input dimensions, the basis
function should be “joined” across all dimensions. For example, if there are two
3.3. Application Examples 95
0.05 0.06 0.07 0.08 0.09 0.1 0.11
Linear Grid Points Distribution
Sepal length
0.05 0.06 0.07 0.08 0.09 0.1 0.11
Nonlinear Grid Points Distribution
Sepal length
0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12
Linear Grid Points Distribution
Sepal width
0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12
Nonlinear Grid Points Distribution
Sepal width
3.3. Application Examples 96
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Linear Grid Points Distribution
Petal length
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Nonlinear Grid Points Distribution
Petal length
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Linear Grid Points Distribution
Petal width
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Nonlinear Grid Points Distribution
Petal width
Figure 3.55: Grid point distributions of the iris data set. The upper graphs in eachsub-graph show the distribution of the linear grid points, while the lower graphsshow the nonlinear grid points.
3.3. Application Examples 97
input dimensions and that there are three grid points per dimension, then the total
number of neurons will be 9 as illustrated in Figure 3.56.
1 2 3
1 2 3
1d
2d
Figure 3.56: basis functions cover in two dimension
Figure 3.57 shows the number of neurons used in Iris data set when increasing the
number of grid points.
In general, for D dimensional inputs and the maximum number of grid points
on any input dimension is G, the total number of neurons is given by: GD. Thus
if D is large, then the number of grid points grows exponentially. This is one of
the main reasons why the current approach has limited applications, as pointed
out in [1], that such an approach can rarely be applied when the number of input
dimensions is high. In the next chapter, we will show that by using an extended
3.4. Conclusions 98
2 4 6 8 100
2000
4000
6000
8000
10000
Number of Grid points per dimension
Num
ber
of N
euro
ns
Figure 3.57: Number of neurons used in a RBFN when the input dimension is four.
ANFIS approach such limitations may be overcome.
3.4 Conclusions
In this chapter, we have provided a simple method for generating the nonlinear
grid which can be used in association with the technique proposed by [39] or used
in a non-linear grid supported RBFN architecture. It is demonstrated where the
performance of the linear grid and the nonlinear grid differs 2. The performances of
2There are three classes of algorithms, the linear grid algorithm, the nonlinear grid algorithm.
These are used as standalone algorithms without any need for the use of radial basis neural network
algorithm. Then, there is a nonlinear to linear mapping algorithm, which is used to map a nonlinear
grid to a linear grid before it is input into the radial basis function neural network. Hence, it is
expected that there will be a difference in the performance of the linear grid method and the
3.4. Conclusions 99
the nonlinear grid in all cases outperform that of the corresponding ones in the linear
grid case. The method can be extended to multi-dimensional inputs. However, the
RBFN type of neural networks suffers from an exponential growth in the number
of neurons required to represent the signal. This can be overcome in Chapter 4
using an extended ANFIS approach. It is also demonstrated that the extent of fine
features can be approximated is indirectly controlled by the number of grid points
used. The more grid points are used, the better will be the approximation of the
fine features in the signal.
It is found that when the signal has no or little noise the mapping from a non-
linear grid to a linear grid seems to outperform that of a nonlinear grid. However,
when there are noises in the inputs, then the nonlinear grid outperforms that of the
mapping from a nonlinear grid to a linear grid one.
nonlinear grid method, if the proposed algorithm (nonlinear grid determination) has any benefits
at all. It is also expected that the nonlinear to linear mapping algorithm will have performance
close to the nonlinear grid method, as it essentially is a nonlinear grid method, except that it is
transformed so that it can be used in association with the radial basis function neural network.
These intuitions are confirmed in the experiments conducted in this chapter and demonstrated in
the results as shown in Tables 3.1, 3.2 and 3.3.
Chapter 4
Extended Adaptive Neuro-Fuzzy
Inference Systems
4.1 Motivation
In the last chapter, we have shown that the RBFN with a nonlinear grid regime
outperforms that with a linear grid regime. However, the RBFN uses a Gaussian
function which requires the determination of the center and spread. Secondly, it
was found that the proposed method discussed in the last chapter cannot be used
for inputs of high dimensions. Hence, it would be useful if we can find a method
which can be applied to higher input dimensions. Secondly, it would be useful to
investigate if there are methods which do not require the determination of the centers
and spreads of Gaussian functions. Since it is known that a RBFN is equivalent to a
100
4.2. Introduction to Extended Adaptive Neuro-Fuzzy Inference System 101
neuro-fuzzy system with Gaussian membership functions [12], this problem may be
transformed into one which concerns the determination of the membership function
in a neuro-fuzzy system. In other words, would it be possible to find other types of
membership functions which “automatically” adjust their shapes of the membership
functions to the inputs. In this chapter, we will introduce a novel neuro-fuzzy system
which we called an extended adaptive neuro-fuzzy inference system (EANFIS) 1.
The EANFIS allows us to handle higher input dimensions by avoiding the need to
form the rules in the first instance. The EANFIS is an extension of the common
adaptive neuro-fuzzy inference system (ANFIS) by the incorporation of additional
layers. Secondly, we will deploy a self organising clustering mountain function as a
membership function which has the property of “automatically” adjusting its shape
to the inputs.
4.2 Introduction to Extended Adaptive Neuro-
Fuzzy Inference System
The adaptive neuro-fuzzy inference system (ANFIS) [12] is a popular neuro-fuzzy
system. It consists of a number of layers implementing the premises, and the conse-
quences of a fuzzy system. It accepts various membership functions in the premises.
However, it is known that ANFIS cannot be applied to inputs with high dimensions.
The reason is that the ANFIS forms the pairwise combination of the inputs at the
1The concepts of ANFIS and extended ANFIS will be explained in this chapter.
4.2. Introduction to Extended Adaptive Neuro-Fuzzy Inference System 102
premises part. Thus, if there are many inputs, there will be an explosion of the
number of rules which is required in the premises part of the ANFIS. This has been
a limiting factor in the application of ANFIS to practical systems with a high input
dimensions.
In this chapter, we will use the following intuition: if it is possible to “prune” the
number of rules before they are formed in the premises part of an ANFIS, then it
might be possible to apply the ANFIS to a higher input dimension. In other words,
we recognise that the limiting factor in the ANFIS is due to its requirement to form
the pairwise combination of the inputs, whether they are required or not. However,
if there is a way in which we can determine what rules need to be formed at the
premises part, then we may avoid this explosion of the rules if the input dimension is
high. In this case, the limiting factor will be the complexity of the problem at hand.
If the underlying system is very complex and requires many rules being formed, then
this will be a limiting factor, as there is only a limited memory available to allow
the formation of rules. On the other hand, if the underlying is not complex, even
though it may contain a high input dimension, then by avoiding the formation of
the rules at the premises part, we may still utilise the idea of ANFIS to model the
system. The problem is that it is not immediately obvious how one may determine
which rule to form before the adaptation process begins.
In this chapter, we will discuss a way of augmenting the ANFIS by incorpora-
tion of two additional layers which is based on the observation: logistic functions
appear to perform well in classification problems. Hence we extended the ANFIS
architecture by incorporating two extra layers: one involving neurons with logistic
4.2. Introduction to Extended Adaptive Neuro-Fuzzy Inference System 103
functions, while the other one performing a normalisation process. Then, inspired
by the a priori algorithm in the data mining literature [41], we introduce a method
whereby the number of rules can be “pruned” thus avoiding the need to form all the
unnecessary ones (those that would not fire sufficiently even if they are formed).
This architecture may deploy various membership functions, e.g., triangular
membership function, trapezoidal membership function, Gaussian membership func-
tion. Most of these membership functions require the determination of their para-
meters in a separate step. They would have difficulties adapting to an input which
may require a non-symmetric membership function 2. In this chapter, we will use
the self organising mountain clustering membership function as originally suggested
in [44]. This membership function is in reality an approximate implementation of
the common kernel density estimation technique common in non-parametric statis-
tics [16]. Hence, the membership function, not necessarily symmetric, can adapt to
any input shape.
Thus the EANFIS architecture consists of the following elements:
1. Membership function generation,
2. Rule formulation,
3. Parameter learning related to the parameters associated with the layers, and
4. Output layer parameter learning.
2They can adapt to non-symmetric input functions by increasing the number of membership
functions required to represent the inputs.
4.3. Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 104
The first two steps are related to the structure of the architecture, while the sec-
ond two steps are related to parameter estimation once the structure is determined.
We will describe the parameter estimation algorithm in Section 4.6.2. We will dis-
cuss the structure determination aspects of the architecture in Section 4.4 after a
discussion of the architecture of the proposed EANFIS architecture in Section 4.3.
4.3 Architecture of the Extended Adaptive Neuro-
Fuzzy Inference System
In this section, we will describe the extended ANFIS architecture in Figure 4.1.
One of the insights which derived from working with ANFIS is that when the ar-
chitecture is applied to a discrete output, the performance could be improved if we
use a softmax-type of output 3. This convinced us that perhaps a softmax type
of formulation may help in extending an ANFIS architecture. However, as may be
observed below formulating the problem as a softmax type problem dramatically
increased the complexity of the problem. Indeed, this greatly increases the number
of rules which need to be formed. This motivates us to look for ways in which
the structure of the architecture can be determined before the determination of the
parameters associated with the inference engine. Inspired by the ways that associa-
3Softmax is a device used by neural network researchers in providing a discriminative training
for discrete value outputs. It essnetially normalises the outputs by taking the exponential of the
output and normalise it with the sum of the exponential of all outputs. In neural network literature,
this has been found to be useful for training discrete value outputs.
4.3. Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 105
tive rules can be found for data mining problems, we worked out a way in which
the structure of the architecture can be determined before the parameters need to
be estimated. Thus, using such a method, we find that there is no need to imple-
ment all the possible combinations of membership function outputs. We only need
to implement those few combinations of membership function outputs, using our
proposed method, which are required for the particular practical problem at hand.
Thus, this alleviate the problem of needing to form all combinations of membership
function outputs in the ANFIS architecture. Next, we turned our attention to the
issue of membership functions. We note that most membership function determi-
nation approaches are essentially “open-loop”. In other words, one postulates that
a candidate membership function can be used, and applies the function, through a
trial and error method, in the determination of the number of membership functions
required. There does not appear to be any attention paid to “what the input data is
trying to tell us”. Thus the idea of a possible non-symmetrical membership function
began to germinate in our minds. We find a particular method, called self organ-
ising mountain clustering method, allows us to find the shape of the membership
function required. The self organising mountain clustering method is essentially an
approximation of the well known kernel based method for finding the probability
density function of given data. However, a straight application of such a method
leads to a large number of membership functions. Hence, we modify the concept
of the self organising mountain clustering method so that it can be applied to our
situation.
In this section, we will describe the proposed general extended ANFIS archi-
tecture. In Section 4.4 we will describe the proposed method for determining the
4.3. Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 106
structure of the architecture and in Section 4.5 we will describe the modified version
of the self organising mountain clustering method in determining the membership
function from the given training data.
In the EANFIS architecture, we need to expand the formulation of the ANFIS
architecture dependent on the output. In particular, we need to consider two dif-
ferent situations: when the output is discrete, and when the output is continuous.
We will deal with the situation of discrete output first before we deal with the con-
tinuous output situation. Let us assume that the output is discrete and there are
T output classes. We will denote the desired output as dτi , where i = 1, 2, . . . , I,
the total number of training instances, and τ = 1, 2, . . . , T , T being the number of
output classes. To model such discrete output classes, we assume that the output
of the EANFIS architecture to be discrete, i.e., yτi , where yi is the output of the
i-th input, with a corresponding desired output dτi . From this, we will need to build
a “separate” ANFIS for each output class, τ . We will describe this more formally
following the same approach as the classic ANFIS architecture, albeit with some
modifications.
Layer 1 (Input Layer): In this layer the input vector xd ∈ �D, d = 1, 2, . . . , D.
This input xd is fed into a membership module which consists of C membership func-
tion nodes. The membership function node can be a traditional Gaussian function,
a triangular function, a trapezoidal function, a bell shaped membership function [12]
or the proposed modified self-organizing mountain clustering membership function
(for more details please see section 4.5). There are C membership functions for each
input xd and for each output class τ . Let the output of the membership module be
4.3. Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 107
Error Correction layer
21,2ϕ
dcτϕ �
x
11φ �
12φ �
21φ �
22φ �
rτφ
11φ �
12φ �
21φ
22φ
11π �
12π �
21π �
22π �
12π
21π
22π
rτφ r
τπ � rτπ
��
wr
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6
y
11π �
ANFIS
21,1ϕ
11,2ϕ
11,1ϕ
Figure 4.1: EANFIS architecture
4.3. Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 108
ϕτdc(xd); d = 1, 2, . . . , D, c = 1, 2, . . . , C and τ = 1, 2, . . . , T . In effect, the outputs of
the membership functions ϕτdc(xd) can be considered as a measure of the similarity
between the input xd and the c-th membership function for a particular output class
τ . If they are close, then the output of ϕτdc(xd) will be high. On the other hand, if
the match between the c-th membership function and the input xd is low, then the
corresponding output ϕτdc(xd) will be low.
Layer 2 (Rule Layer): In this layer the membership function outputs are
multiplied together according to a specific scheme as follows: Assuming that for a
particular output class τ each input dimension has C membership functions. Then,
we start with d = 1, which has membership functions ϕτ1,1, ϕ
τ1,2, . . . , ϕ
τ1,C. For d = 2,
there are ϕτ2,1, ϕ
τ2,2, . . . , ϕ
τ2,C variables. We need to form pairwise combination of
these variables ϕτ2,i, i = 1, 2, . . . , C with those of ϕτ
1,j, j = 1, 2, . . . , C as follows:
ϕτ1,1ϕ
τ2,1, ϕ
τ1,1ϕ
τ2,2, . . . , ϕ
τ1,1ϕ
τ2,C , . . . , ϕτ
1,Cϕτ2,1, . . . , ϕ
τ1,Cϕτ
2,C , a total of C2 terms. Then
for d = 3, we will need to form the terms ϕτ3,1, ϕ
τ3,2, . . . , ϕ
τ3,C with each of the products
formed by concatenating input dimensions 1 and 2 together. There will be a total
of C3 terms, with the general form ϕτ1,iϕ
τ2,jϕ
τ3,k. The method can be generalised to
general value of d, until d = D. Thus there will be in general CD rules for each
value τ giving a total of TCD rules. We will denote each rule by φτr with a general
form: ϕτ1,iϕ
τ2,j . . . ϕτ
d,k. Each membership output product is equivalent to performing
the fuzzy T-norm operation, representing the firing strength of this rule [22].
Note that the total number of rules is R = TCD.
Layer 3 (Normalized Layer): In this layer, it calculates the ratio of the
firing strength to the total firing strength. In other words, this layer computes the
4.3. Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 109
normalized outputs of the rule layer.
φτr =
φτr∑R
r=1 φτr
(4.1)
Note that layers 1 to 3 are the layers in the classic ANFIS [22] except in this case
we separate out each output class into a separate strand. Note also that φr ∈ [0, 1]
which denotes the normalised similarity of rule corresponding to the d-th input and
the c-th membership function.
Layer 4 (Error Correction Layer): This layer is used to “fine tune” the
output of layer 3 by using a logistic function.
�(φr, γ) =2
1 + exp(− (1 − γ)(φr − 1
))
(4.2)
where γ ∈ R is an adjustable parameter. Thus the logistic function is one in which
the slope γ can be adjusted 4. The output of this layer is πr = �(φr, γr)φr.
In general, the fuzzy rule node outputs (in layer 3) in an ANFIS architecture
may contain contradictions, overlaps or inconsistencies which may be attributed to
the noise in the training data set or to the blurred cluster regions among different
output clusters. The proposed error correction layer (layer 4) presents one way
to solve these problems. Thus, this layer will become effective if the output of
the neuro-fuzzy system is ambiguous, e.g., if two rules are giving rise to similar
4We will show in a later section how this parameter may be adjusted.
4.3. Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 110
outputs. In this case, it will be difficult to distinguish the effectiveness of the rules.
However, using this layer, we will be able to distinguish the effectiveness of the rules.
Obviously, this layer will not be required if all the rules give rise to well distinct
outputs.
The logic of this layer can be understood as follows:
1. If the degree of similarity φr is close to 1 and �(φr, γr) is high then the output
πr is high.
2. If the degree of similarity φr is close to 1 and �(φr, γr) is small then the output
πr is still high. It is because this can be thought of as a rarity situation when
there are only very small samples of the case exist.
3. If the degree of similarity φr is close to 0.5 and �(φr, γr) is low then the
output πr is low. It is because this fuzzy rule has a low �(φr, γr) which means
it contributes many errors during the training process. The output from this
rule is untrustworthy. The output strength πr is lower accordingly in this case.
4. If the degree of similarity φr is close to 0.5 and �(φr, γr) is high then the
output πr is medium. If the �(φr, γr) is high it does not apply any discount
to the output πr.
This layer may be formally represented as follows:
Ruler: If x1 is ϕτ1,c and ... and xd is ϕτ
d,c then
πr = φr�(φr, γr)
4.3. Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 111
Layer 5 (Normalized Layer): This layer performs the normalisation of the
outputs of Layer 4:
πr =πr∑Rr=1 πr
(4.3)
The output of this layer is normalised to lie between 0 and 1.
Layer 6 (Output Layer): The output layer can accept two possible forms, viz.
the zeroth order output and the TSK scheme respectively. This is exactly the same
as shown in Section 2.5.1.
For continuous outputs, we will assume that τ = 1. Thus the continuous output
case can be considered as a special case of the more general discrete output case.
The adjustable parameters of this architecture are: γr, r = 1, 2, . . . , R and wr,
r = 0, 1, 2, . . . , R in the case of zeroth order output function, and in addition, αrd,
and αr0, for r = 1, 2, . . . , R, and d = 1, 2, . . . , D in the case of TSK output mech-
anism. Furthermore, there are parameters associated with the membership func-
tions as well. This depends on the membership function used. For example, if we
use Gaussian membership function, then there will be two parameters associated
with each membership function. On the other hand, if we use triangular member-
ship function, then there will be three parameters associated with each membership
function.
4.3. Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 112
4.3.1 Remarks
1. What is the difference between this architecture and the classic multilayer per-
ceptron architecture? One may collapse Layers 1, 2, and 3 into an aggregate
input layer, and Layer5 and 6 (assuming for simplicity, the output layer is a
zeroth order output function) into one aggregate output layer, then we have a
general input layer followed by a logistic function layer and then followed by
a general output layer. It is tempting to try to compare this with the MLP
architecture. However, this architecture is different from the classic MLP ar-
chitecture in that in the classic MLP architecture, the output is formed by a
combination of the outputs of hidden layer neurons. On the other hand, in the
proposed architecture, the output is formed by a single weight wr associated
with a logistic function with an adjustable slope. In other words, in the pro-
posed architecture, the output is formed from the output of a single “hidden
layer neuron” (in the language of neural networks), and hidden layer neuron
in this case, has an adjustable slope. This is different from the classic MLP
architecture in which the logistic function normally has a fixed slope.
Indeed there were research work which associate a MLP architecture with
the output of the neuro-fuzzy architecture [12]. However, in this case, it was
shown that the performance of the cascaded architecture of ANFIS architec-
ture, and MLP architecture is not good. This is one reason why we choose the
proposed architecture which draws individual output from individual logistic
functions with each output of the “hidden layer neurons” (in the language of
neural networks). As will be shown in later sections this provides much better
outputs.
4.3. Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 113
2. The consequence of the above remark is that we cannot say anything about
the universal approximation capability of the proposed EANFIS architecture.
Indeed it is known that a neuro-fuzzy architecture does not have universal
approximation capability. Hence it is suspected that the proposed architec-
ture does not have universal approximation capability. However, on the other
hand, it is rather difficult to extract “meaningful” rules from a classic MLP
architecture. On the other hand, with a neuro-fuzzy architecture it is possible
to extract meaningful rules from the architecture. Indeed as will be shown in
a later section, the proposed architecture can extract meaningful rules from
the trained architecture. Hence this can be considered as a tradeoff between
the universal approximation capability and the ability to extract rules from
the trained architecture.
3. It is known in the literature [2] that there may be situations when the outputs
of the neuro-fuzzy architecture is ambiguous, in that the outputs are similar
but yet they belong to different classes. In this case, one solution is to use
the concept of “certainty factor” [2], which manually assign some weighting
on the outputs to reflect what the user consider as important. In the pro-
posed architecture, we have introduced an automatic method in assigning the
importance of the outputs through the deployment of an adjustable slope of
a logistic function. Indeed it is possible to derive the adjustable slope of a
logistic function through a Bayesian theorem, very much in the same spirit as
some of the work in classic certainty factor approach. However, we decided to
introduce the concept as indicated above as we believe this may give a better
direct insight into the proposed architecture.
4.4. Structure determination of the proposed neuro-fuzzy architecture 114
4.4 Structure determination of the proposed neuro-
fuzzy architecture
In this section, we will consider the issue of “pruning” the number of rules. This issue
arises because in a neuro-fuzzy network, in the antecedent, the normal approach is
to form all possible pairwise combination of inputs as indicated in Layer 2 in Section
2.5.1. As might be surmised, this leads to an explosion of the number of rules if
the input dimension is high. For example, if we have an input dimension of 10,
assume that there are two membership functions per dimension and the output is
continuous, then the total number of rules will be 210 = 1024. The formation of
these rules is undertaken whether the rules are required or not in the neuro-fuzzy
architecture. Note that while some pruning process may be applied, e.g. see [47,48],
nevertheless, this is only applied after the rules are formed. In other words, whether
the rules are required or not, they need to be generated first. This explosion effect is
a limiting factor for the neuro-fuzzy architecture to be applied to a practical system
with high input dimensionality. In this section, we will propose a method guiding
the formation of the rules. The method is inspired by the Apriori algorithm in data
mining using associative rules [41]. However, as it will become clear, the proposed
method is not the same as the Apriori algorithm [41]. It is only in the spirit of the
Apriori algorithm. Note that in this case, only the rules which are required will be
formed rather than forming all possible combinations of the rules. In other words,
the proposed method will form the required rules (those that will be fired by the
neuro-fuzzy network), and will not form those rules which will not be “fired” by
the network, in response to inputs. Obviously, the rules which need to be formed
4.4. Structure determination of the proposed neuro-fuzzy architecture 115
is dependent on the input set. This method will facilitate the proposed neuro-fuzzy
architecture to be used for practical problems, as the method is only bounded by
the total number of rules required for the problem for a particular set of inputs.
Obviously if the underlying characteristics of the input set change, the rule set will
need to be changed as well. This can be incorporated in a “sentinel” module, which
monitors the underlying characteristics of the inputs, and then decide on when to
change the rule sets. In this paper, we will assume that the input set characteristics
are relatively static, and hence would not need to have a “sentinel” unit to monitor
its characteristics, and thus have multiple rule sets to characterize the problem. This
will be reserved as a future problem of research.
We will first briefly described the Apriori algorithm before we describe the pro-
posed method.
The Apriori algorithm (a shortened form of a priori algorithm) is a popular
algorithm used in data mining using associative rules [41]. It was originally proposed
in [41] to study the “shopping basket” problem. The “shopping basket” problem
may be stated briefly as follows: if a customer is purchasing a certain group of
items, what is the likelihood of the person buying another group of items in the
same shopping session. The set of items that a customer purchases is called an
itemset. Assume for convenience that each customer’s transaction has k items. The
algorithm finds all itemsets Lk greater than a threshold that each transaction has.
The Lk is then used to generate the candidate set Ck. The candidate set is the
union of Lk ∪Lk−1. The candidate set is used to form another new larger itemset by
removing all those itemsets which are below the threshold. The algorithm repeats
4.4. Structure determination of the proposed neuro-fuzzy architecture 116
until Lk is empty [41].
We will illustrate this algorithm using a simple example. The procedure is shown
graphically in Figure 4.2.
TID Items
Transaction
T100 T101 T102 T103 T104 T105
1,2,4 2,3 2,4 1,2 1,2,3 1,3
Itemset Counts
{1} {2} {3} {4}
4 5 3 2
L1
Remove all itemsets that below threshold
Itemset
{1,2} {1,3} {1,4} {2,3} {2,4} {3,4}
C2
Itemset Counts
{1,2} {1,3} {2,3} {2,4}
3 2 2 2
L2
Remove all itemsets that below threshold
Join
3 2 1 2 2 0
Counts
Itemset Counts
{1,2,3} {1,2,4} {1,3,4} {2,3,4}
1 1 0 0
C3
Join
Assume threshold = 1
Figure 4.2: An example illustrating the determination of the maximum itemset inthe Apriori algorithm.
The input data to the Apriori algorithm consists of a set of transaction records. The
TID (Transaction Identifier) column is the transaction ID and the ‘items’ is the item
4.4. Structure determination of the proposed neuro-fuzzy architecture 117
number involved in each transaction. For example, in the transaction ID T100, the
items purchased are 1, 2, and 4. In this example, there are six transactions. We
assume that a threshold of 1 has been set. Thus, we look at the TID column, and
find out if the occurrence of items 1, 2, 3, or 4 is less than or equal to one. In this
example, all items occurred at least twice. Hence we cannot eliminate any item.
In this case, we will form an itemset denoted as Table L1 in Figure 4.2. Since all
items are present (as none of them is below the threshold), and hence we have four
itemsets in this set: L1. Since there is only one candidate item in each itemset, hence
the count column in this table essentially counts the occurrence of each item. From
the TID table we find that there are four occurrences of the item 1 (in T100, T103,
T104, and T105), and hence the entry in Table L1 for itemset set {1} is 4. The
Table C2 is obtained by joining the itemsets in Table L1 together. Here the joining
is performed in a lexicographical order, with non-repeats. Thus, for example, from
itemset {1} in Table L1, we can form the following itemsets {1, 2}, {1, 3}, {1, 4}.Now we can compare the pattern of the itemset {1, 2} with the TID column and
find out the number of occurrence of this pattern. In this case we find that there are
three occurrences (T100, T103, and T104). Hence the entry in the column Count in
Table C2 is 3. We can remove all itemsets in Table C2 which are below the threshold.
In this case, we have itemsets {1, 4} and {3, 4} which are below the threshold, and
hence they will not be considered further. Table L2 is formed by removing all these
itemsets which are below the threshold with the corresponding count of occurrences.
Table C3 can be formed by joining (concatenating) the itemsets in Table L2 together
with those in Table L1. Thus the first candidate itemset would be {1, 2} from Table
L2 with itemset {1}. However, this cannot happen as item 1 already exists in the
4.4. Structure determination of the proposed neuro-fuzzy architecture 118
itemset {1, 2}. Hence the only possibility would be {1, 2} in Table L2 concatenate
with itemset {3}. This results in candidate itemset {1, 2, 3}. Then we compare this
pattern with those in Table TID, and find that there is only one such occurrence
(T104). It is found that from Table C3 that all occurrences are less than or equal
to the threshold. Hence the process stops.
In this “shopping basket” example, we find that not every item combination
exists in the transaction record. The Apriori algorithm removes such a combination
if it does not exist or if it is below a prescribed threshold. The procedure may be
extended to provide information on the support and confidence of a particular rule
found.
4.4.1 A proposed algorithm for rule formation
In this section, we will give details of a proposed algorithm for rule formation. This
algorithm is inspired by the ways of finding the maximum itemset in the Apriori
algorithm. However, our proposed algorithm is different from the one used in the
Apriori algorithm.
In a way the proposed algorithm is like running the maximum itemset determi-
nation algorithm backwards. Instead of considering each item by itself, we will start
with the clusters identified in the self organising mountain clustering membership
function method (for details please see Section 4.5.). In the self organising mountain
clustering membership function technique, as will be described in Section 4.5, we
identify sets of clusters. These clusters are formed by the closeness of a particular
4.4. Structure determination of the proposed neuro-fuzzy architecture 119
set of data points with respect to the underlying grid points. for example, in the
d-th dimension, there are Gd grid points. Then we can evaluate the closeness of a
particular data point xd,i with respect to each grid point using the following:
ωτd,i = arg max
g∈Gd
⎛⎜⎝exp
⎛⎜⎝−
(xτ
d,i − zτd,g
)22(zτ
d,(g+1) − zτd,g
)2
⎞⎟⎠⎞⎟⎠ (4.4)
This operation will identify the association of the data points xd,i, i = 1, 2, . . . , I
with particular grid points in the underlying grid. Note that it is possible that each
grid point may be associated with more than one data point xd,i, i = 1, 2, . . . , I. As-
suming that each grid point is associated with ηg data points, where g = 1, 2, . . . , Gd.
Note that ηg may be 0, as it may occur that none of the data points are close to a
particular grid point. Then the user can decide to group a number of grid points
together to form segments. Such segments will be the equivalent of the itemsets in
the associative rule mining techniques. Let us denote each segment as si. Thus, for
example, in one particular dimension, we may have, say, data points {1, 2, 3} which
are closest to, say, grid point 2; data points {4, 5, 6} are closest to grid point 4,
and data points {7, 8} are closest to grid point 7. Then the user decides that there
should be two segments (where two is an arbitrary decision; it could have been easily
three segments), one stretching from grid point 1 to grid point 5, while the second
segment will cover grid points 6 to 10. Then in this case, segment 1 will contain
data labels S1 = {1, 2, 3, 4, 5, 6} while segment 2 will contain data labels S2 = {7, 8}.The data in this dimension is then represented as S1 ∪ S2. Thus the segments Sj
plays the same role as the itemsets in associative rule formation.
4.4. Structure determination of the proposed neuro-fuzzy architecture 120
These will be the starting point of the proposed algorithm for rule formation.
Assume that there are clusters of grid points formed on each input dimension.
Assume that there are D input dimensions, each dimension consists of Cd, d =
1, 2, . . . , D, clusters. We will consider each dimension in turn. We will consider
first d = 1. In this dimension there are C1 clusters. These will be the itemsets
for the next iteration. We will call this set of itemsets L1. Then we concatenate
the clusters of dimension d = 1 with those in dimension d = 2 by forming the
operation {ωτ1,1, ω
τ1,2, . . . , ω
τ1,κτ
1,1}∩{ωτ
2,1, ωτ2,2, . . . , ω
τ2,κτ
2,1} where ωτ
d,g denotes the data
label stored in g-th grid point in the d-th dimension. Assuming that there are
κd clusters found in this process, for each input dimension xd. The value of ωτd,i
is determined as shown in Eq(4.4). From this concatenation operation, we detect
the common elements. If the number of elements in the concatenation is above
a prescribed threshold, this will be passed on in the next iteration as an element
in the itemset. On the other hand, if the number of common elements in the
concatenation is smaller than the prescribed threshold, this will be eliminated from
further discussions. It is discarding of elements from further considerations that
is the key to reducing the number of rules which need to be formed. The process
repeats until d = D. Then the rules will be obtained from an inspection of the final
table Lk.
We will consider a simple example to illustrate the proposed rule formation
method. We will first consider the d = 1 axis. There are two clusters: {1, 2, 3} and
{4, 5, 6} respectively. This is shown in the table called Clustered Data in Figure
4.3. For convenience we will label {1, 2, 3} as cluster 1, and {4, 5, 6} as cluster
4.4. Structure determination of the proposed neuro-fuzzy architecture 121
2. Each cluster consists of three data labels. This information is displayed in the
table called L1 in Figure 4.3. The threshold is 1. The column itemset has two
elements each denotes the cluster {1, 2, 3} and cluster {4, 5, 6} respectively. The
column “Combination of clusters” denotes the label we provided these two clusters,
i.e., cluster 1, and cluster 2 respectively. The column “Count of elements” denotes
the number of elements in the cluster. In both cases, there are three elements in
the cluster. Then we concatenate the clusters in dimension d = 1 with those in
dimension d = 2 using a join operation. As there are two clusters in the d = 2
dimension: {1, 2} and {3, 4, 5, 6}, we can concatenate the clusters of the dimension
d = 1 with those in dimension d = 2 and find the common elements. Thus, in
the table called C2 in Figure 4.3, the first element in the column “itemset” shows
the concatenation of the first cluster {1, 2, 3} in dimension d = 1 with the first
cluster {1, 2} in dimension d = 2. This is denoted by {1, 2, 3} ∩ {1, 2}. This is
denoted by (1), 1 in column “Combination of clusters” in Figure 4.3. In this case,
we find that there are two common elements 1, and 2 and hence the entry of 2 in the
column “Counts of common elements”. In a similar manner, we can find the values
on all the columns of the table: C2. Since the threshold is 1, we can remove the
entry corresponding to {4, 5, 6} ∩ {1, 2}, as there are no common elements in this
concatenation. Then we transfer this information into the table: L2. The entries
in the column “Itemset” are the concatenation of the clusters and the common
elements. Thus for example, the first entry of column “itemset” is obtained by
the concatenation of cluster 1 {1, 2, 3} in d = 1 dimension and cluster 1 {1, 2} in
dimension d = 2. The common elements in these two clusters is {1, 2}. This will
be the itemset. There are only two common elements, and hence the entry in the
4.4. Structure determination of the proposed neuro-fuzzy architecture 122
column “Counts of common elements” is 2. The entry in the column “Combination
of clusters” denoted by (1), 1 signifies the result is obtained by the concatenation
of cluster 1 in d = 1 dimension with that of cluster 1 in the d = 2 dimension. The
entries in the table called C3 denote the join operation of the results of Table L2
with those clusters on the d = 3 dimension. Thus the first element is formed by
concatenation of the common elements found by concatenating cluster 1 in dimension
d = 1 and cluster 1 in dimension d = 2 with cluster 1 in the d = 3 dimension.
This is denoted by {1, 2} ∩ {1, 2, 3, 4}. Thus the meaning of the first entry in
the column “Combination of clusters”: (1, 1), 1. Here we find that there are two
common elements, viz. {1, 2}. Hence the first entry in the column “Counts of
common elements” is 2. This process is repeated for the other clusters, and Table
C3 is fully populated. As the threshold is 1, and hence we can eliminate entries
{1, 2} ∩ {5, 6}, and {4, 5, 6} ∩ {1, 2, 3, 4}. The remaining information is transferred
to the table called L3. Here there are only two values which are above the threshold
{1, 2} ∩ {1, 2, 3, 4}, and {4, 5, 6} ∩ {5, 6}. Hence the entries in the final column of
Table L3 are both 2 denoting that there are only two common elements. The entries
in the column “Combination of clusters” denote the way in which the clusters are
formed. For example, the first element is formed by the concatenation of cluster 1
in d = 1 dimension, cluster 1 in d = 2 dimension and cluster 1 in d = 3 dimension.
The “Itemset” column denotes the common elements as a result of the concatenation
process. Since there are only three input dimensions, and hence the process stops.
In this example, we finally conclude that there are two fuzzy rules (as in Table
L3 in Figure 4.3 there are only two remaining entries):
4.4. Structure determination of the proposed neuro-fuzzy architecture 123
Clustered Data
c {1
, 2, 3
} {4
, 5, 6
}
{1, 2
} {3
, 4, 5
, 6}
{1
, 2, 3
, 4}
{5, 6
}
d
{1,2,3} {4,5,6}
L1
Itemset
{{1,2,3} ∩ {1,2}}
{{1,2,3} ∩ {3,4,5,6}}
{{4,5,6} ∩ {1,2}}
{{4,5,6} ∩ {3,4,5,6}}
C2
Itemset Counts of common elements
{1,2} {4,5,6}
2 3
L2
Join
Join
Assume threshold = 1
Itemset Counts of common elements
3 3
Itemset
{{1,2} ∩ {1,2,3,4}}
{{1,2} ∩ {5,6}}
{{4,5,6} ∩ {1,2,3,4}}
{{4,5,6} ∩ {5,6}}
C3
Itemset Counts of common elements
{1,2} {5,6}
2 2
L3
(1),1
(1),2
(2),1
(2),2
Counts of common elements Combination
of clusters
(1),1 (2),2
Combination of clusters
(1,1),1
(1,1),2
(2,2),1
(2,2),2
Combination of clusters
(1,1),1 (2,2),2
Combination of clusters
1 2
Combination of clusters
2
1
0
3
2
0
1
2
Counts of common elements
Figure 4.3: An example to illustrate the proposed rule formation algorithm.
4.4. Structure determination of the proposed neuro-fuzzy architecture 124
Rule1: IF cluster1 in d = 1 dimension ∧ cluster1 in d = 2 dimension ∧ cluster1 in
d = 3 dimension THEN Consequence1.
Rule2: IF cluster2 in d = 1 dimension ∧ cluster2 in d = 2 dimension ∧ cluster2 in
d = 3 dimension THEN Consequence2.
From this description it can be observed that the proposed procedure is quite
different from the maximum itemset determination in the Apriori algorithm. It
seeks to find the combination of clusters such that there are common elements in the
clusters. Note that these common elements are represented by the data labels store
in the clusters. Nevertheless the proposed algorithm is inspired by the maximum
itemset determination algorithm in the Apriori algorithm.
It is possible similar to the Apriori algorithm [49] to compute the support and
the confidence of the rules formed. Assume X is the input data, A is the premise
and B is the consequence. The support measures the number of data items that has
both the premises and the consequence in the whole data set:
Sr (A ⇒ B) =
∑i∈Iτ
πτi,r
# (∀X)(4.5)
where S denotes the support and # denotes the number of incidences. I = {I1∪I2∪. . .∪IT}, Iτ , τ = 1, 2, . . . , T denotes the training data set belonging to the class τ , πi,r
denotes the response of the normalised rule πr to the i-th input 5. The confidence
5Each rule has a dependency on the input value. This is not explicitly denoted in the symbol
πr.
4.5. Determination of candidate membership function and the required number ofmembership functions in each input variable 125
measures the number of data items that has both the premises and the consequence
in the data set that has the premises:
Cr (A ⇒ B) =
∑i∈Iτ
πτi,r
I∑i
πτi,r
(4.6)
where C denotes the confidence. High support and high confidence indicate high
strength of the particular rule. Thus, the measures of support and confidence provide
a means of ascertaining the importance of a particular rule found.
This section provides a method whereby it is possible to form only the rules
which are important to the problem, rather than forming all the rules upfront,
irrespective if they will be necessary to the problem or not. By using this rule
formation algorithm the proposed architecture can be applied to practical problems,
as it is not bounded by the input dimension D.
4.5 Determination of candidate membership func-
tion and the required number of membership
functions in each input variable
The Self-organizing mountain clustering membership function [44] is a fully data
driven method to find a suitable membership function for the neuro-fuzzy network.
It does not have a pre-defined shape. The final shape of the membership function
4.5. Determination of candidate membership function and the required number ofmembership functions in each input variable 126
is determined by the data.
We have D inputs, xd. Each dimension of the input xd is sampled. Assume
that we have the input data sampled as xd,i, i = 1, 2, . . . , I, and d = 1, 2, . . . , D.
We further assume that there is an underlying grid in which the closeness of the
input samples, xd,i is measured. With this underlying grid it will be possible to
consider how well the input samples are grouped into clusters. It is assumed that
each input dimension is endowed with a grid with Gd samples in each dimension for
each output class τ . In general, I = Gd. There are two situations dependent on
whether the output is discrete or continuous. We will denote each grid point as zτd,g;
d = 1, 2, . . . , D, and g = 1, 2, . . . , Gd. τ = 1, 2, . . . , T the number of output classes.
T = 1 for continuous outputs.
zτd,g = xmind
+
(xmaxd
− xmind
Gd − 1
)(g − 1) (4.7)
where xmindand xmaxd
are respectively the minimum and the maximum values of
xd in d-th dimension. In other words, we create a grid for each output class τ in the
case of discrete outputs, and in the case of continuous outputs, we will have only
one underlying grid. Figure 4.4 illustrates the grid point distribution in the d-th
dimension.
In many cases, a uniform grid regime is normally used as shown in Equation(4.7).
However, a non-uniform grid regime may be useful for solving some particular prob-
lems when it is intuitively clear that a non-uniform grid regime will be beneficial
to the approximation of the inputs. In this case, we can use a simple method for
4.5. Determination of candidate membership function and the required number ofmembership functions in each input variable 127
maxdxmind
x
,1dzτ
,2dzτ
,d gzτ,d Gzτ
z
Figure 4.4: Illustration of the distribution of grid points in the d-th dimension ofthe input xd.
determining a non-uniform grid as proposed in Chapter 3. When the non-uniform
grid regime is used, each output type τ has its own grid.
Once the grid points are obtained, whether they are using a uniform grid regime,
or a non-uniform grid regime, we wish to investigate how well the input data clus-
tered into groups. For this, the self organising mountain clustering algorithm [44] is
deployed to calculate the input data density distribution using the following:
Υτd,g =
1
N τ
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
∑i∈Iτ
exp
(− (xτ
d,i−zτd,g)
2
2�zτd,(g+1)
−zτd,g
�2
)
∑g∈Gd
∑i∈Iτ
exp
(− (xτ
d,i−zτd,g)
2
2�zτd,(g+1)
−zτd,g
�2
)⎫⎪⎪⎪⎪⎬⎪⎪⎪⎪⎭
(4.8)
where N τ is the number of points which correspond to the output class τ , I =
{I1 ∪ I2 ∪ . . . ∪ IT}, Iτ , τ = 1, 2, . . . , T denotes the training data set belonging to
the class τ . Note that this self organising mountain clustering algorithm differs
slightly from the one proposed in [44]. In Equation (4.8) the accumulated sum is
4.5. Determination of candidate membership function and the required number ofmembership functions in each input variable 128
normalised, while in [44], they use the un-normalised version. The value of Υτd,g
may be interpreted as follows: the numerator computes, for each output class τ , the
accumulated “strength” of each grid point zτd,g measured with respect to the input
xd. For uniform grid regime, zτd,(g+1) − zτ
d,g is a constant. The denominator is a
normalizing factor which will normalise the accumulated “strength” with respect to
the grid points so that the total contribution will sum to 1. In a way, the numerator
or the normalized value Υτd,g already provides information on the distribution of the
data density along particular dimension d. This can be used as a fuzzy membership
function. However, this does not take into account the distribution of the classes
at a particular grid point. Hence the correct formation of the membership function
will be the value of Υτd,g modulated by the distribution of the classes at each grid
point. This modified self organising membership function will provide the correct
membership function for the inputs.
The value of Υτd,g can be normalised across all values of τ at each grid point as
follows:
Υτd,g =
Υτd,g
T∑τ=1
Υτd,g
(4.9)
For continuous outputs, Υτd,g = 1 as τ = 1.
In addition, the Υτd,g can be normalised with respect to the input strength in
each input dimension d as follows:
4.5. Determination of candidate membership function and the required number ofmembership functions in each input variable 129
Υτd,g =
Υτd,g
‖Υd‖ ∀g, and ∀τ& for a particular value of d
=Υτ
d,g
max(svd(Υd))+β∀g and ∀τ & for a particluar value of d
(4.10)
where ‖Υd‖ denotes the matrix norm of Υd, and the T ×Gd matrix Υd is formed by
collecting the rows corresponding to the d-th dimension of values Υτd,g for all values
of τ and g. The algorithm will run for d = 1, 2, . . . , D. β is a small number and
SVD stands for singular value decomposition of a matrix. The evaluation of the
matrix norm ‖ · ‖ using singular value decomposition is a standard procedure [33].
The value β is included to prevent the situation when the largest singular value is
close to zero.
Thus the membership function can be obtained as follows:
ϕτd,g = Υτ
d,gΥτd,g (4.11)
Note that there will be T ×Gd membership functions associated with each input
dimension d.
In order to decrease the computational time, we combine the grid points to-
gether into a cluster if the grid points are close together. This can be performed
by examining the density strength Υ computed, where Υ is the D ×Gd × T matrix
formed by the normalised values Υτd,g. If the slope of the Υ along the grid direction
changes from positive to negative or if it is zero then we merge the data labels which
belong to the winning grid point ω to form a cluster. On the other hand, if the slope
along the grid point direction changes from negative to positive this indicates that
4.5. Determination of candidate membership function and the required number ofmembership functions in each input variable 130
it is the beginning of a new cluster. Note that in this case, there will be one such
graph associated with each input dimension d and each output class τ . The detailed
algorithm is shown in Figure 4.5.
For each output type τFor each dimension d
InitializationFor each grid point g
If (lastpoint > Υτd,g)
slope=-1elseif (lastpoint < Υτ
d,g)slope=1
elseslope=0
End ifIf (lastslope == -1 AND slope == 1)
Store grid position to κτd,c=floor(last grid point + g)/2
Store data label stored in ωτd,g to new cluster
elseMerge data label stored in ωτ
d,g to current clusterEnd iflastpoint=Υτ
d,gIf (slope != 0)
Update lastslope = slopeUpdate last grid point = g
End ifEnd
EndEnd
Slope1
0
-1
FirstCluster
SecondCluster
Figure 4.5: The pseudo code implementation of the proposed algorithm for theformation of clusters. The small diagram on the top right hand corner illustrateswhen a new cluster is formed, and when grid points are merged together to form acluster.
After the input data distribution is analyzed using the self organising mountain
clustering membership function as indicated, the outputs obtained are discrete, due
to the grouping of points together. Such discrete values will need to be interpolated
so that the intermediate values are available. There are many possible interpolation
schemes. In this paper, we will use a particular scheme, called Hermite interpolation
function. We chose to use the Hermite interpolation function because it is a simple
4.5. Determination of candidate membership function and the required number ofmembership functions in each input variable 131
one to apply. In Equations (4.12) and (4.13), there are three parameters which need
to be set: (a) the data input xd, (b) an array {zd,κτc−Δ
, zd,κτc+1+Δ
} that contains the
grid points from the beginning of a cluster to the end of the cluster for the output
class τ and Δ is the discretization interval, and, (c) the values {Υτd,κτ
c−Δ, Υτ
d,κτc+1+Δ
}where the array κ(·) stores the location of the points which are in the same cluster
c. The range of the points which contains the cluster is extended by Δ on either side
to prevent the input data falling out of the kernel to overcome “boundary” effects.
Pd,c (τ) =
f(xd,
{zτ
d,κτd,c−Δ
...zτd,κτ
d,c+1+Δ
},{
Υτd,κτ
d,c−Δ...Υτ
d,κτd,c+1+Δ
}) (4.12)
Pd,c (xd|τ) =
f(xd,
{zτ
d,κτd,c−Δ
...zτd,κτ
d,c+1+Δ
},{
Υτd,κτ
d,c−Δ...Υτ
d,κτd,c+1+Δ
}) (4.13)
where f(·) is the Hermite interpolation function.
We will need a way to combine the probability measure and the density measure
together so as to give the desired membership function. In this case we will use
the naive Bayes classifier [46]. The naive Bayes classifier algorithm classifies a data
point x by using a density function P (x|τ) and a probability function P (τ):
P (τ |x) =P (x|τ) P (τ)
P (x)(4.14)
where P (x) is a normalizing factor which can be omitted.
4.5. Determination of candidate membership function and the required number ofmembership functions in each input variable 132
When an input x belongs to a particular output class τ , the Bayes’ function
belonging to that particular output class will have a large response.
Because the input dimension is partitioned into clusters in the proposed al-
gorithm, the prior probability P (τ) is the probability function within the cluster
as computed using Equation (4.12) and P (x|τ) is the density probability function
within the cluster as computed using Equation (4.13).
Hence from Bayes’ theorem, we have:
ϕτd,c (xd) = Pd,c (xd|τ) Pd,c (τ) (4.15)
where d = 1, 2, . . . , D is d-th dimension of the input, and c, c = 1, 2, . . . , C is the c-th
cluster in the membership function. Note that there will be as many membership
functions in each output class τ as required.
An example of the Self-organizing mountain clustering membership function
smoothed with Hermite interpolation function is shown in Figure 4.6. In this ex-
ample, there is a one-dimensional input, with two output classes. The values of the
input corresponding to each output class are shown in the table on the top left hand
corner. We create a five point grid for each class. The values of Υτd,g are computed
as shown in the top right hand corner. Then, the values of Υτd,g and Υτ
d,g are shown
respectively in the two tables in the middle row. Then the values of the member-
ship function for each output class is computed using the naive Bayes theorem, as
indicated previously in Equation (4.15). These membership functions are shown as
4.6. Parameter estimation 133
graphs in Figure 4.6.
The self organising mountain clustering membership function is a general tech-
nique for constructing a membership function from the given data. Hence this
may be applied to a general ANFIS architecture. In this case, the self organising
mountain clustering membership function will provide an alternative for finding the
centers and spreads of the Gaussian function membership function commonly used
in the ANFIS architecture as follows:
ϕτd,c (xd) = exp
⎛⎜⎜⎜⎝−
(xd −
zτd,κτ
d,c+zτ
d,κτd,c+1
2
)2
(zτ
d,κτd,c+1
− zτd,κτ
d,c
)2
⎞⎟⎟⎟⎠ (4.16)
Thus this may be a viable alternative to finding the centers and spreads of the
Gaussian functions.
4.6 Parameter estimation
The adjustable parameters in the architecture can be estimated from the training
data. There are two approaches. The first approach is to assume that all parame-
ters can be simultaneously adjusted, by an output error function, and the second
approach is a phased approach, which adjusts the parameters in each section of
the architecture individually by holding the parameters in other sections of the ar-
chitecture constant. There are three natural sections in the architecture: (1) the
4.6. Parameter estimation 134
Raw Data
0.49 0.38 0.38 0.44 0.33 0.11 0.33 0.22
1 1 1 1 1 2 2 2
Strength output ϒ from the mountain function
0.2703 0.3091 0.2513 0.1342 0.0351
0.0087 0.0702 0.2406 0.3807 0.2998
Type 1 Type 2
Gri
d
Probability Measure ϒ
0.9687 0.8150 0.5109 0.2606 0.1049
0.0313 0.1850 0.4891 0.7394 0.8951
Type 1 Type 2
Gri
d
Density Measure ϒ
0.4160 0.4756 0.3867 0.2065 0.0541
0.0134 0.1080 0.3702 0.5859 0.4613
Type 1 Type 2
Gri
d
Type
Use 5 grid points in a dimension
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Grid
Out
put
Final Membership Function
Class 1Class 2
Figure 4.6: Example of finding the membership function using the proposed selforganising mountain clustering membership function method.
4.6. Parameter estimation 135
input section (the formation of the membership function), (2) the parameters γc
c = 1, 2, . . . , R in the error correction layer (Layer 4), and (3) the output layer
(Layer 6). The simultaneous adjustment of all parameters together is quite diffi-
cult, in that it is a highly complex optimisation problem. Hence in this paper, we
will consider the second approach, viz., to adjust the parameters of a section of the
architecture, while holding the parameters of the other sections constant.
The adjustment of the parameters of the membership function can be carried
out once the type of membership function is chosen. For example, if we choose
the Gaussian function as the membership function, then the parameters pertaining
to the Gaussian membership function may be determined from the training data.
On the other hand, if a self organising mountain clustering membership function is
chosen, then the method shown in Section 4.5 can be used to automatically find the
parameters of the membership function. In this section, we will only be considering
the issues of training the parameters in the error correction layer, and the parameters
in the output layer.
4.6.1 Error Correction Layer (Layer 4) parameter learning
We are given the following training data: xd,i, i = 1, 2, . . . , I, and d = 1, 2, . . . , D,
with an associated desired output set dτi , i = 1, 2, . . . , I; and τ = 1, 2, . . . , T . Here
for simplicity we assume that there is only a scalar output. The training method for
a continuing output and for the discrete outputs (with output classes τ = 1, 2, . . . , T )
are different.
4.6. Parameter estimation 136
1. Continuous value
It minimizes the output errors ei = di − yi while for the discrete outputs, it
minimizes the error of the type misclassifications.
γnewi = γold
i − η∂E
∂γi(4.17)
The parameters to be adjusted in the error correction layer (layer 4) as de-
scribed 6 by Equation (4.2). The gradients are shown in Equation (4.18). This
is no more than the common chain rule of differentiation.
∂E∂γr
= ∂E∂Ei
I∑i=1
∂Ei
∂ei
∂ei
∂yi
R∑r=1
∂yi
dπir
∂πir
∂πir
∂πir
∂γr(4.18)
Note that the values of πir etc are produced from a particular choice of mem-
bership functions with particular parameters. Hence ∂E∂γr
depends directly on
the parameters of the membership function. The gradient descent algorithm
then provides a way to adjust the parameters of the membership functions.
2. Discrete output classes
6Note that Eq(4.2) describes the signals which are generated from the deployment of the mem-
bership functions used in the antecedent. The parameters associated with these membership func-
tions are the ones to be adjusted. As there are many possibilities for the membership functions,
e.g., Gaussian membership function, self organising mountain clustering membership function, we
decided to refer to the equation, instead of providing an indication of the parameters to be adjusted.
4.6. Parameter estimation 137
The training of the case with discrete output classes is different from that of
the continuous output case. In this case it is necessary to provide the negative
output class as well. In other words, instead of considering only the particular
output class, we need to consider the possibility of the negative counterpart:
the possibility of the output not in the output class. This situation is shown in
Figure 4.7. It is observed that in this case, the positive part and the negative
part are “intertwined” together. In the case of the continuous output, as
τ = 1, and hence there is no need for the introduction of the negative part of
the architecture.
Error Correction layer
21,2ϕ
dcτϕ �
x
11φ �
12φ �
21φ �
22φ �
rτφ
11φ �
12φ �
21φ �
22φ �
11π �
12π �
21π �
22π �
12π �
21π
22π
rτφ r
τπ � rτπ
��
wr
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6
11π �
ANFIS
21,1ϕ
11,2ϕ
11,1ϕ
��
τ
τ¬
Figure 4.7: A diagram to illustrate the training of the EANFIS for the case ofdiscrete output classes. The notation ¬ denotes the negative of the output class τ .
4.6. Parameter estimation 138
Thus the training for the discrete output classes is to minimize the misclassifi-
cation errors of the output cluster classes in the rule layer. In the classification
process, each fuzzy rule should give an output which belongs to one output
class. Hence, the normalized output for the correct output cluster from the
summation of different fuzzy rule sets in layer 4 should be 1. On the other
hand, the normalized output for the other output clusters should be 0. The
error function for the correct output cluster is shown in Equation (4.19) and
for all the other output clusters is shown in Equation (4.20).
eτi = 1 −
∑r∈τ
πτir (4.19)
e¬τi = 0 −
∑r∈τ
π¬τir (4.20)
Apply the gradient descent algorithm to minimize these errors result in Equa-
tions 4.21 respectively.
∂E
∂γτr
=∂E
∂Ei
I∑i=1
∂Ei
∂eτi
· ∂eτi
∂πτir
· ∂πτir
∂πir· ∂πir
∂γτr
(4.21)
Again the values of πir etc. depends on the parameters of the membership func-
tions, and hence, it would not be possible to provide a more explicit formula
for the gradient function. Once a particular membership function is chosen,
then it is possible to provide a gradient descent algorithm for the adjustment
of the parameters so as to minimize the error function.
4.6. Parameter estimation 139
4.6.2 Output layer parameter learning
Training in the output layer is straight forward as it is minimising a cost function
with respect to a linearly parameterized equation. The weights are calculated using
a pseudo-inverse method by using the common normal equation [33].
4.6.3 Summary
To summarise the development up to this point, the system consists of four main
parts:
(1) Membership Function. The membership function can be any classic mem-
bership function such as Gaussian function or the proposed self organising
mountain clustering membership function. if we are using the self organis-
ing mountain clustering membership function, the grid over the input space
can be either a linear grid or non-uniform grid. For the determination of the
non-uniform grid regime we can use the method described in Chapter 3. The
self organising mountain clustering membership function can also be used to
determine the spreads and centers of the Gaussian functions if they are used
as membership functions.
(2) Rule Layer. The rule formation method is inspired by the Apriori Algorithm
which provides a rule formation method. The reduction in the number of rules
which need to be formed would save system memory as it is not necessary
to generate the redundant rules (rules that would not have sufficient firing
4.7. Experimental Results 140
strength even if they are formed).
(3) Error Correction Layer. This layer is used for error correction. It uses a
logistic function. For each rule. This can lead to an improvement of the
overall performance of the EANFIS architecture.
(4) Output Layer. There are three alternatives in the output layer. They are
(a) discrete output training method, (b) Zeroth order single weight method
and (c) First order Takagi-Sugeno-Kang method. The weight in each rule is
adjusted by using either a steepest descent algorithm or a normal equation.
The proposed EANFIS architecture is very flexible. The error correction layers (lay-
ers 4 and 5) can be switched off and leave us with the classic ANFIS architecture.
On the other hand, if layers 4 and 5 are switched on then this can lead to improve-
ments in the performance of the overall network. This claim is supported by the
application of such techniques to a number of practical problems.
4.7 Experimental Results
In this section, we will apply the proposed methods to a number of examples 7. It in-
cludes a classic non-linearly separated Exclusive OR gate problem, the Sunspot cycle
time series based on real world data, the Mackey Glass chaotic time series which is a
7The implementation of the algorithms has been carried out using Matlab, as this allows rapid
evaluation of the various parameters in the algorithm. The attached weights in the output layer
are trained by normal equation.
4.7. Experimental Results 141
computer generated example, the iris dataset which is known to be a nonlinearly sep-
arated problem, a high multi-dimensional real world problem: the Wisconsin Breast
Cancer dataset and a well known neuro-fuzzy control system “Inverted Pendulum
on a cart”. These examples are chosen because they exhibit various characteristics
which are useful in illustrating the properties of the proposed architecture. The
Exclusive OR problem is a classic benchmark problem which will reveal some of
the underlying property of the proposed architecture, e.g. the number of rules and
their values. The sunspot cycle time series is a classic benchmark problem which
will indicate how well the proposed architecture can estimate a set of rules for the
problem. The Mackey Glass time series is a chaotic time series. This will show how
well the proposed architecture will approximate such a chaotic time series. The iris
dataset will allow us to find out how well the proposed architecture can estimate
the rules. The Wisconsin breast cancer problem is a multi-dimensional problem.
This will provide a demonstration of the well the proposed architecture handles a
high dimensional dataset. The inverted pendulum problem is a well known control
problem which is used to benchmark fuzzy control algorithms. Hence by applying
our proposed architecture on this classic problem we will be able to show how well
it works as compared to other algorithms. In addition, we will compare the results
with those obtained using an ANFIS architecture.
4.7.1 Exclusive OR problem
The first experiment is the classic exclusive OR (XOR) problem. This is the simplest
possible nonlinear problem which cannot be linearly separated [51]. In other words,
4.7. Experimental Results 142
it cannot be solved using a linear classifier. By applying the proposed EANFIS
architecture to this very simple problem, it is hoped that it will shed some insight
into the operation of the architecture. For the XOR problem, there are two discrete
output classes: 0 and 1 respectively. Applying the proposed self organising moun-
tain clustering membership function method to this problem results in two clusters
for each input dimension (there are two input dimensions altogether) as shown in
Figure 4.8. The results are shown in Table 4.1.
Table 4.1: The input output pairs of the exclusive OR problem.
Training Data Output Testing Data Output
0 0 0 0.4 0.4 0
1 0 1 0.6 0.4 1
0 1 1 0.4 0.6 1
1 1 0 0.6 0.6 0
Note that in the exclusive OR problem, we are normally given the training data set,
but not the testing data set, as the testing data set is the same as the training data
set. In Table 4.1 we have created a testing data set (so that the training data set
and the testing data set do not have any overlaps) which consists of four values as
shown. It is observed that the trained model correctly identified the output classes
of the testing data set.
Further analysis of the self organising mountain clustering membership function
together with the error correction layers reveal that the problem is solved by using
four fuzzy rules. This is shown in Table 4.2.
4.7. Experimental Results 143
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4MF of output type 0 in dimension 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4MF of output type 0 in dimension 2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4MF of output type 1 in dimension 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4MF of output type 1 in dimension 2
Figure 4.8: The resulting data clusters for the exclusive OR problem after theapplication of the mountain clustering membership function; ‘*’ denotes the firstcluster, ‘o’ denotes the second cluster.
4.7. Experimental Results 144
Table 4.2: The fuzzy rules found for the exclusive OR problem using our proposedmethod for rule formation.
R1: If x1 is ϕ11 and x2 is ϕ21 then
Output class is type 0 (Support:25%, Confidence:100%)
R2: If x1 is ϕ12 and x2 is ϕ22 then
Output class is type 0 (Support:25%, Confidence:100%)
R3: If x1 is ϕ11 and x2 is ϕ22 then
Output class is type 1 (Support:25%, Confidence:100%)
R4: If x1 is ϕ12 and x2 is ϕ21 then
Output class is type 1 (Support:25%, Confidence:100%)
4.7. Experimental Results 145
Here the functions ϕij(·) denotes the membership function as found by the self
organising mountain clustering membership function method (their shapes are as
shown in Figure 4.8), where i denotes the i-th dimension and j denotes the j-th
cluster, i = 1, 2; and j = 1, 2.
In addition, we compare the performance of our proposed method with two other
methods:
• The EANFIS architecture with the self organising mountain clustering mem-
bership function method.
• The EANFIS architecture but with Gaussian membership functions. The cen-
ters and spreads of the Gaussian functions are determined using the self or-
ganising mountain clustering membership function method.
• The ANFIS architecture with Gaussian membership functions. Again the
centers and spreads of the Gaussian membership functions are determined
using the self organising mountain clustering membership function method.
The membership functions are shown in Figure 4.9.
The results of comparison of these three methods are shown in Table 4.3. For
simplicity, the outputs of the three methods all use the single weight mechanism as
discussed in Section 2.5.1.
The architecture found for this problem is shown in Figure 4.10.
The following observations can be made from the results of this experiment:
4.7. Experimental Results 146
0 0.2 0.4 0.6 0.8 10
0.5
1Gaussian Membership Function, Dimension 1
Input
Out
put
0 0.2 0.4 0.6 0.8 10
0.5
1Gaussian Membership Function, Dimension 2
Input
Out
put
Figure 4.9: The Gaussian membership functions for the ANFIS architecture for theexclusive OR problem. There are two membership functions per dimension.
Table 4.3: The results of the XOR problem by comparing three methods: EANFISarchitecture with mountain clustering membership function, EANFIS architecturewith Gaussian membership functions, and ANFIS architecture with Gaussian mem-bership functions.
Desired Output EANFIS EANFIS ANFIS
(Mountain MF) (Gaussian MF) (Gaussian MF)
Output Layer Single weight Single weight Single weight
0 0 0.463422 0.476742
1 1 0.534846 0.523258
1 1 0.534846 0.523258
0 0 0.466803 0.476742
4.7. Experimental Results 147
X1
X2
y
Figure 4.10: The architecture of the XOR probelm as found by the proposed EAN-FIS architecture.
• The results of the EANFIS architecture with self organising mountain cluster-
ing membership functions performs the best.
• The outputs of the EANFIS architecture with Gaussian membership functions
and the ANFIS architecture are comparable. They both yield the correct
classifications if we incorporate a threshold unit to their outputs, so that an
output value above 0.5 will be classified as 1, while a value below 0.5 it will
be classified as 0.
• In all cases, four fuzzy rules are required. In this instance the EANFIS ar-
chitecture with the self organising mountain clustering membership function
method is not effective in reducing the number of rules required.
4.7. Experimental Results 148
4.7.2 Sunspot Cycle Time Series
In this section, we will apply the proposed methods to study the sunspot cycle
time series considered in the last chapter. We will take the average of the sunspot
numbers of each month and use 55 years of data for training and the rest 200 years
for prediction. For convenience the sunspot number time series is as shown in Figure
4.11. The training data is shown in bold while the testing data is shown in lighter
colour. This arrangement of the training data and testing data conforms with the
standard ways [65, 66] in which this time series has been applied in benchmarking
various estimation algorithms.
1750 1800 1850 1900 1950 20000
50
100
150
200
250
300
Date
Sun
spot
Num
ber
Figure 4.11: The monthly average Sun spot number. The training data (55 years)are shown in dark, while the testing data (200 years) are shown in lighter colour.
We arrange the data in the format of input-output pair as follows:
[x(t-4), x(t-3), x(t-2), x(t-1); x(t)]
4.7. Experimental Results 149
In other words, we will express the nonlinear relationship between the current
data point, as a function of the immediate past four data points.
x(t) = f(x(t − 1), x(t − 2), x(t − 3), x(t − 4)) (4.22)
where ˆx(t) is the predicted value of x(t) as a function of the immediate past four
data points, and f(·) is a nonlinear function. The aim is to minimize the error
function E:
E =
N∑t=1
e(t)2 =
N∑t=1
(x(t) − x(t))2 (4.23)
where N is the total number of data points in the testing data set. As indicated
previously this time series is a standard benchmark for various estimation algorithms
[65,66]. In this thesis we will not be interested to obtain the best results as compared
to those found in the literature [65,66]. Our aim instead is to use the sunspot cycle
time series to illustrate the versatility of the proposed methods in this thesis. In
particular, we wish to illustrate its potential in reducing the number of rules which
need to be formed. We will however be comparing the results using the proposed
method with those obtained using ANFIS architectures.
We apply the proposed methods using the self organising mountain clustering
membership function on the training data. The resulting self organising mountain
clustering membership function is as shown in Figure 4.12.
It is noted that the membership functions obtained are asymmetrical. This may be
4.7. Experimental Results 150
0 50 100 150 2000
0.2
0.4
Membership Function in Dimension 1
0 50 100 150 2000
0.2
0.4
Membership Function in Dimension 2
0 50 100 150 2000
0.2
0.4
Membership Function in Dimension 3
0 50 100 150 2000
0.2
0.4
Membership Function in Dimension 4
Figure 4.12: The self organising mountain clustering membership functions ofsunspot cycle time series ‘*’ denotes the first cluster, ‘o’ denotes the second cluster.
4.7. Experimental Results 151
one reason why so far the other prediction methods [65,66] have not been perform-
ing too well. They have not considered the possibility that the input membership
function may be asymmetrical. Note that if we use Gaussian functions to approxi-
mate the asymmetrical shape, this would require quite a large number of Gaussian
functions for adequate approximation.
It is noted that there are two clusters per dimension (there are four dimensions).
The final output which combines the output from Rule 1 and Rule 2 is shown in
Figure 4.13. The details of the fuzzy rules are shown in Table 4.4.
X1,X2,X3,X4 �
� ����∑Rule 1
Rule 2
���
����
Figure 4.13: The combination of the fuzzy rules for the EANFIS architecture.
It is noted that for the fuzzy rules, the first three dimensions x(t − 1), x(t − 2),
and x(t − 3) pertain to the same condition, while x(t − 4) is the discriminating
dimension in that its change “triggers” the change of the rules. This “triggers” the
change in the linear combination weights in forming the output of layer 1.
The prediction outputs using a linear grid regime with self organising mountain
clustering membership function is shown in Figure 4.14. The prediction outputs
using an ANFIS architecture with Gaussian membership functions is as shown in
Figure 4.15.
The architecture found by using the EANFIS architecture is shown in Fig 4.16.
In the following we will compare the results of using the following architectures
4.7. Experimental Results 152
Table 4.4: The fuzzy rules found for the sunspot cycle time series.
R1: If x1 >= −34.13 and x1 <= 273.03 AND
x2 >= −34.13 and x2 <= 273.03 AND
x3 >= −34.13 and x3 <= 273.03 AND
x4 >= −21.72 and x4 <= 43.43 THEN
y = −0.076x1 + 0.246x2 + 0.417x3 + 0.505x4 − 1.866
(Support: 23.98%, Confidence:100%)
R2: If x1 >= −34.13 and x1 <= 273.03 AND
x2 >= −34.13 and x2 <= 273.03 AND
x3 >= −34.13 and x3 <= 273.03 AND
x4 >= 0 and x4 <= 260.62 THEN
y = 0.147x1 + 0.116x2 + 0.121x3 + 0.552x4 + 3.839
(Support: 76.02%, Confidence:100%)
4.7. Experimental Results 153
1750 1800 1850 1900 1950 20000
50
100
150
200
250
300
Date
Sun
spot
Num
ber
Figure 4.14: The prediction results of the monthly average sunspot number timeseries using an EANFIS architecture with linear grid regime using the self organisingmountain clustering membership functions.
1750 1800 1850 1900 1950 20000
50
100
150
200
250
300
Date
Sun
spot
Num
ber
Figure 4.15: The prediction results of the monthly average sunspot number timeseries using an ANFIS architecture with Gaussian membership functions.
4.7. Experimental Results 154
X(t-4)
X(t-3)
X(t-2)
X(t-1)
X(t)
Figure 4.16: The architecture found by using the proposed EANFIS architecture.
(see Table 4.5):
• An ANFIS architecture using two ways of forming the outputs (1) the single
weight method, and (2) the TSK mechanism. In this case, there are a total of
16 rules required. It is observed that the one trained using the TSK mechanism
has a smaller training error, the prediction error on the testing data set is
much worse than the one using a single weight output regime. This may
be attributed to the fact that the one using TSK mechanism exhibit over-
training phenomenon. In other words, while the training error might be small,
the model is trained such that it “accommodates” all the noise content in
the training data set. Hence when it is used to evaluate its generalisation
capabilities, the prediction error on the testing data set is much worse. On
the other hand, the one trained using a single weight output mechanism is
much better in that the error value for the training data set is comparable to
that in the testing data set.
• An EANFIS architecture with Gaussian membership function. Here the EAN-
4.7. Experimental Results 155
Table 4.5: The RMS errors of applying various methods on the sunspot numbertime series. Please see the text for explanation of the experimental conditions.
ANFIS
Output Layer Single weight Output Layer TSK
Rule 16 Rule 16
Training Error Prediction Error Training Error Prediction Error
16.0084 16.7495 14.1094 163.0768
EANFIS (Gaussian MF)
linear setup grid Nonlinear setup grid
Output Layer TSK Output Layer TSK
Rule 2 Rule 2
Training Error Prediction Error Training Error Prediction Error
16.1170 16.0396 16.1612 16.0400
EANFIS (Mountain MF)
linear setup grid Nonlinear setup grid
Output Layer TSK Output Layer TSK
Rule 2 Rule 2
Training Error Prediction Error Training Error Prediction Error
16.1568 15.9939 16.1611 15.9901
4.7. Experimental Results 156
FIS architecture finds that only two rules will be sufficient. We have evaluated
the performances of using a linear grid regime and a nonlinear grid regime re-
spectively. It is found that the linear grid regime appears to perform slightly
better than the nonlinear grid regime. If the underlying problem is rather
uniform, then the nonlinear grid method may not be as efficient as the linear
grid method. The nonlinear grid method tunes the neuron centers according
to the training data. It may lead to overtraining.
• An EANFIS architecture with self organising mountain clustering membership
functions. Here again we investigate both the linear grid and nonlinear grid
regimes respectively. Again it is found that the EANFIS architecture informs
us that only two rules are required in each case. We also investigated the
performance of the architecture if we use the single weight and TSK output
mechanisms. It is found that there is a slight difference in the performance of
the two methods in that it appears the one using the TSK mechanism has a
slight edge over the one using a single weight regime.
In general, from Table 4.5 we can make the following observations:
• The EANFIS architecture is capable of using a reduced number of rules. Note
that the number of rules is found automatically using our proposed method.
• The EANFIS architecture performs better than the ANFIS architecture.
• There is virtually no difference in the performance of the EANFIS architecture
using a linear grid or a nonlinear grid.
4.7. Experimental Results 157
• There is almost no difference in the performance of the EANFIS architecture
using a single weight or a TSK output mechanism.
From this experiment, we can confirm that our proposed methods appear to
work well for this example.
4.7.3 Mackey-Glass Time Series example
Another well known benchmark problem for the evaluation of various prediction
algorithms is the Mackey-Glass chaotic equation (see Equation (4.24)).
x (t) =0.2x (t − τ )
1 + x10 (t − τ)− 0.1x (t) (4.24)
where x(t) is the output of the equation. Note that this is a nonlinear delay differen-
tial equation with a delay of τ in the argument. As this equation will produce a con-
tinuous time output, we will use a fourth-order Runge-Kutta method [52] to integrate
the equation and obtain the solution. The initial condition is x (0) = 1.2, τ = 17 and
the step size used is 0.1. In order to eliminate the effect due to the initial condition,
we only extract the data from t = 118 to 1117. The first 500 pairs are used for
training and the rest are used for testing. The input-output data format is arranged
as follow:
[x(t-18), x(t-12), x(t-6), x(t); x(t+6)]
4.7. Experimental Results 158
In other words, we wish to estimate a model of the following form:
x(t + 6) = f(x(t), x(t − 6), x(t − 12), x(t − 18)) (4.25)
where f is a nonlinear function. Note that in this case we are only sampling once
every 6 samples. This is the standard way in which data processing for this equation
has been carried out [67]. Hence we will follow this standard approach. Note that
again we are not interested to “pitch” our proposed algorithm against the other
prediction methods [67]. But instead we wish to use this equation to illustrate the
potential of our proposed method. We will however compare the results of our
proposed methods with those obtained using the ANFIS architecture. The results
are shown in Table 4.6.
Table 4.6: The RMS errors of the Mackey-Glass compares with ANFIS, EANFISwith the self organising mountain clustering membership function and EANFIS withGaussian membership function.
ANFIS
Output Layer TSK
Training Error Prediction Error Rule
0.00245 0.00234 16
EANFIS (Mountain MF) EANFIS (Gaussian MF)
Output Layer TSK Output layer TSK
Rule 12 Rule 12
Training Error Prediction Error Training Error Prediction Error
0.01146 0.01136 0.00279 0.00275
4.7. Experimental Results 159
The results show that the self organising mountain clustering membership func-
tion performs inferior to the Gaussian membership function. So for this experiment,
we propose to use the self organising mountain clustering membership function
method to determine the spreads and centers of the Gaussian membership func-
tion to be used in the EANFIS architecture and compare them with the ANFIS
using the same Gaussian membership functions. The EANFIS architecture using
Gaussian membership functions requires 12 fuzzy rules while the ANFIS architec-
ture requires 16 fuzzy rules. It is observed from Table 4.6 that the performance of
the EANFIS architecture and the ANFIS architecture are comparable.
Figure 4.17 shows the output of the Mackey-Glass equation, while Figure 4.18
shows the membership function using the EANFIS architecture with the Gaussian
membership functions.
700 800 900 1000 11000.4
0.6
0.8
1
1.2
1.4
Time
Am
plitu
de
Figure 4.17: The output of the Mackey-Glass equation.
4.7. Experimental Results 160
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.30
0.5
1Dimension 1
0 0.2 0.4 0.6 0.8 1 1.20
0.5
1Dimension 2
0 0.2 0.4 0.6 0.8 1 1.20
0.5
1Dimension 3
0 0.2 0.4 0.6 0.8 1 1.20
0.5
1Dimension 4
Figure 4.18: The Gaussian membership function for the Mackey-Glass equationexample.
4.7. Experimental Results 161
Figure 4.19 shows the output using an ANFIS architecture with 16 Gaussian
membership functions, while Figure 4.20 shows the output using an EANFIS archi-
tecture with 12 Gaussian membership functions. It may be noted that there is very
little difference in the two output signals from different architectures.
700 800 900 1000 11000.4
0.6
0.8
1
1.2
1.4
Time
Am
plitu
de
Figure 4.19: The outputs of a neuro-fuzzy network using the ANFIS architecturewith 16 Gaussian membership functions for the Mackey-Glass equation.
As a result it is more convenient to consider the differences of the outputs of
these networks and compare them with the original output of the Mackey-Glass
equation. These are shown in Figure 4.21 and Figure 4.22 respectively.
It is observed from Figures 4.21 and 4.22 that there is not much difference be-
tween these differences.
The architecture found using the EANFIS architecture is shown in Figure 4.23.
4.7. Experimental Results 162
700 800 900 1000 11000.4
0.6
0.8
1
1.2
1.4
Time
Am
plitu
de
Figure 4.20: The outputs of the EANFIS architecture with 12 Gaussian membershipfunctions for the Mackey-Glass equation.
700 800 900 1000 1100−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
Time
Err
or
Figure 4.21: The difference between the output of the ANFIS architecture with16 Gaussian membership functions and the original signal for the Mackey-Glassequation.
4.7. Experimental Results 163
700 800 900 1000 1100−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
Time
Err
or
Figure 4.22: The difference between the output of the EANFIS architecture with12 Gaussian membership functions and the original signal for the Mackey-Glassequation.
These results naturally raise a question: what are the added advantages of using
an EANFIS architecture. An obvious advantage would be that the EANFIS archi-
tecture uses a smaller number of rules. However, this may not be sufficient to justify
the addition of two extra layers. This slightly unexpected observation: comparable
performance between the ANFIS architecture and the EANFIS architecture may
be explained as follows: the Mackey-Glass equation does not contain any noise. In
other words, the advantage of having two extra layers in the case of the EANFIS
architecture may have been lost. In other words, the EANFIS architecture will
work better if there are noise in the output, e.g., the sunspot cycle time series. This
may be the reason why the ANFIS architecture performs slightly better than the
EANFIS architecture.
4.7. Experimental Results 164
X(t-18)
X(t-12)
X(t-6)
X(t)
X(t+6)
Figure 4.23: The architecture found by using the proposed EANFIS architecture.
4.7. Experimental Results 165
4.7.4 Iris Dataset
In this section, we will apply the EANFIS architecture to the Iris example considered
in the last chapter. This dataset consists of three types of iris flowers. Two of them
are linearly inseparable. We will randomly select 51 data points for testing and the
rest 99 data points for training purposes respectively. Once the training and testing
data sets are chosen, they will remain fixed for one experiment cycle using both
the ANFIS architecture and the EANFIS architecture. This evaluation will run
100 times each time using a different set of training and testing data sets, and the
average of the performances of these two architectures are then used for comparison.
These are the values reported in Table 4.7.
We carried out the following experiments:
(1) ANFIS architecture with either single weight output scheme or TSK output
scheme. In this case, we use 16 rules. The membership function is Gaussian
membership function.
(2) EANFIS architecture without the error correction layer using the self organ-
ising mountain clustering membership functions and using either the single
weight output scheme or the TSK output scheme.
(3) EANFIS architecture with the error correction layer using the self organising
mountain clustering membership function and using either the single weight
output scheme or the TSK output scheme.
The main reason why we carried out experiments (2) is that we wish to isolate the
4.7. Experimental Results 166
effect between having the error correction layer and not having the error correction
layer. Note that without an error correction layer the main difference between
the ANFIS architecture and the EANFIS architecture is the membership function.
Hence by comparing the results of experiments (1) and (2) we will be able to conclude
the effectiveness of the self organising mountain clustering membership function
as compared with the Gaussian membership function. The other difference is the
number of rules used. In the case without an error correction layer for the EANFIS
architecture, the number of rules used will be 16, while with a probability layer, the
number of rules used will be as observed in the table, only 3. Hence by comparing
the results of experiments (2) and (3) it will inform us on the effectiveness of using
the reduced number of rules.
From Table 5.3, we may make the following observations:
• For the ANFIS architecture with 16 rules, and using Gaussian membership
function, the single weight output scheme performs significantly better than
the TSK output scheme. This implies that the outputs would not be dependent
on the inputs (as the TSK output scheme incorporates influences from the
inputs directly).
• For the EANFIS architecture without the error correction layer using the self
organising mountain clustering membership functions with 16 rules, the TSK
output scheme works better than the single weight output scheme. This result
is at variance with the one using the ANFIS architecture.
• Similarly with an error correction layer, the EANFIS architecture performs
4.7. Experimental Results 167
Table 4.7: The outcomes of applying the EANFIS architecture and the ANFISarchitecture on the Iris data set. The values reported in this table are obtained froman average of 100 experiments using randomly selected 99 training data samples and51 testing data samples.
ANFIS
Output Layer Single Weight Output Layer TSK
Accuracy Variance Rule Accuracy Variance Rule
98.1569% 3.48 16 74.0392% 52.90 16
EANFIS (Mountain MF)
Output Layer TSK Output Layer TSK
Without Probability Layer With Probability Layer
Accuracy Variance Rule Accuracy Variance Rule
98.4314% 3.11 16 98.4314% 3.11 3
EANFIS (Mountain MF)
Output Layer Single weight Output Layer Single weight
Without Probability Layer With Probability Layer
Accuracy Variance Rule Accuracy Variance Rule
83.1176% 15.06 16 82.7451% 14.45 3
4.7. Experimental Results 168
better using a TSK output scheme instead of the single weight output scheme.
• Comparing the results of the EANFIS architecture and the ANFIS architec-
ture, the EANFIS architecture with TSK output scheme performs better than
the ANFIS architecture with single weight output scheme.
This result is quite difficult to explain in isolation with other results. Hence we
will delay the discussions on these results until later when we have considered the
other results on the Wisconsin breast cancer example.
The following figures will show some of the details of the results. Figure 4.24
shows the membership function used in the EANFIS architecture.
It is observed that the membership functions show asymmetry, and hence it is not
surprising that the ANFIS architecture requires 16 rules in order to “approximate”
these shapes adequately.
Table 4.8 shows the fuzzy rules in the EANFIS architecture for the Iris dataset.
It is interesting to note that the three rules are all based on the same bracketed
region:
x1 >= 0.0429 x1 <= 0.1425
x2 >= 0.0388 x2 <= 0.1553
x3 >= −0.0177 x3 <= 0.2066
x4 >= −0.0403 x4 <= 0.2358
It is noted that only the consequence part is different among all three rules.
4.7. Experimental Results 169
0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.110
0.05
0.1
0.15
0.2
Membership Function in Dimension 1
0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.110
0.05
0.1
0.15
0.2
Membership Function in Dimension 2
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.120
0.1
0.2
0.3
0.4
Membership Function in Dimension 3
0.02 0.04 0.06 0.08 0.1 0.12 0.140
0.1
0.2
0.3
0.4
Membership Function in Dimension 4
Iris−setosaIris−versicolorIris−virginica
Figure 4.24: The membership function (mountain clustering) for the Iris data set.
4.7. Experimental Results 170
Table 4.8: The extracted fuzzy rules for the Iris Dataset. These rules are used inthe EANFIS architecture.
R1: If x1 is ϕ11,1(x1, 0.0429, 0.1425) AND
x2 is ϕ12,1(x2, 0.0388, 0.1553) AND
x3 is ϕ13,1(x3,−0.0177, 0.2066) AND
x4 is ϕ14,1(x4,−0.0403, 0.2358) THEN
y = 0.1377x1 − 1.5479x2 + 9.8673x3 + 2.1638x4 + 0.8182
(Support: 33%, Confidence:96%)
R2: If x1 is ϕ21,1(x1, 0.0429, 0.1425) AND
x2 is ϕ22,1(x2, 0.0388, 0.1553) AND
x3 is ϕ23,1(x3,−0.0177, 0.2066) AND
x4 is ϕ24,1(x4,−0.0403, 0.2358) THEN
y = −30.1001x1 − 17.5581x2 + 5.7272x3 + 16.4500x4 + 3.9185
(Support: 21%, Confidence:70%)
R3: If x1 is ϕ31,1(x1, 0.0429, 0.1425) AND
x2 is ϕ32,1(x2, 0.0388, 0.1553) AND
x3 is ϕ33,1(x3,−0.0177, 0.1692) AND
x4 is ϕ34,1(x4,−0.0403, 0.2358) THEN
y = −23.4493x1 − 15.1980x2 + 32.4464x3 + 2.5574x4 + 2.6153
(Support: 24%, Confidence:69%)
where ϕτd,c(input, begin a cluster, end of a cluster) is a mountain clustering
membership function
4.7. Experimental Results 171
The architecture found using the proposed EANFIS architecture is shown in Fig
4.25.
X1
X2
X3
X4
y
Figure 4.25: The architecture found using the proposed EANFIS architecture.
4.7.5 Wisconsin Breast Cancer example
The Wisconsin breast cancer data set [53] consists of 699 instances in 9 dimensions;
16 instances have missing attributes. We will randomly select 200 instances for
training and 499 instances for testing purposes.
In this example, we have carried out the following experiments:
• ANFIS architecture with 512 rules (2 rules per dimension and 9 dimensions)
with Gaussian membership functions.
4.7. Experimental Results 172
• EANFIS architecture, without the error correction layer with self organising
mountain clustering membership functions with either a linear grid regime, or
a nonlinear grid regime.
• EANFIS architecture with the error correction layer with the self organising
mountain clustering membership functions with either a linear grid regime or
a nonlinear grid regime.
In all cases we will only use the single weight output scheme.
In the ANFIS architecture, it uses two membership functions per dimension.
The total number of fuzzy rules is 512. For the EANFIS architecture, it uses 2 fuzzy
rules. The results are shown in Tables 4.9 and 4.10 respectively using single weight
output scheme and TSK output scheme.
The following observations may be made from the results:
• The ANFIS architecture with 512 Gaussian membership functions gives an
accuracy of 90%.
• The EANFIS architecture without error correction layer but with the self or-
ganising mountain clustering membership functions with linear and nonlinear
grid regimes both give about 94% prediction accuracy.
• The EANFIS architecture with error correction layer and with the self organ-
ising mountain clustering membership functions with linear and nonlinear grid
regimes both give 96% prediction accuracy.
4.7. Experimental Results 173
Table 4.9: The comparison of the prediction capabilities of the ANFIS architecturewith membership functions, the EANFIS architecture, with linear and nonlineargrid regimes using single weight output layer.
EANFIS ANFIS
Mountain Clustering MF
(Without (With (Without (With
probability probability probability probability
layer & layer & layer & layer & Gaussian
linear linear nonlinear nonlinear MF Function
setup grid) setup grid) setup grid) setup grid)
Output Layer: Single weight
Rules Rules Rules Rules Rules
2 2 2 2 512
Accuracy Accuracy Accuracy Accuracy Accuracy
94.44% 96.24% 94.70% 96.80% 88.28%
Variance Variance Variance Variance Variance
0.59 0.20 0.36 0.16 4.76
4.7. Experimental Results 174
Table 4.10: The comparison of the prediction capabilities of the ANFIS architecturewith membership functions, the EANFIS architecture, with linear and nonlineargrid regimes using TSK output layer.
EANFIS ANFIS
Mountain Clustering MF
(Without (With (Without (With
probability probability probability probability
layer & layer & layer & layer & Gaussian
linear linear nonlinear nonlinear MF Function
setup grid) setup grid) setup grid) setup grid)
Output Layer: TSK
Rules Rules Rules Rules Rules
2 2 2 2 512
Accuracy Accuracy Accuracy Accuracy Accuracy
96.24% 96.27% 96.27% 96.37% 90.32%
Variance Variance Variance Variance Variance
0.38 0.44 0.39 0.43 3.95
4.7. Experimental Results 175
Figure 4.26 shows the membership functions used in the EANFIS architecture.
Again it is quite clear from the shapes of the rules it will require a significant number
of Gaussians to approximate them adequately.
The architecture found by using the proposed EANFIS architecture is shown in
Fig 4.27.
4.7.6 Inverted Pendulum on a cart
The EANFIS is applied to a classic fuzzy control system, viz. the control of an
inverted pendulum sitting on a moving cart. A rod is hinged on top of a moving
cart. The cart is free to move in the horizontal plane, and the objective is to balance
the rod to keep it in the upright position and keep the cart in the center position.
The mechanical system is as shown in Figure 4.28. The system takes four inputs:
θ: the angle of the rod makes with the vertical axis, θ is the angular velocity of
the rod, y is the cart position with respect to the center position, and y is the cart
velocity. The aim is to use these four inputs to calculate the control force u [72]
which is required.
This model is simulated by software. The initial parameters are M (mass of
cart) = 2kg, m (mass of rod) = 0.1kg, L (length of rod) = 0.5m and g (gravity)
= 9.81m/sec2. The cart and rod should be back in the desired angle and position
within 2 second. This is simulated using Equation 4.26. A block diagram of the
control system is shown in Figure 4.29. The state-space equation of the system is
given in Equation 4.27.
4.7. Experimental Results 176
2 4 6 8 100
0.2
0.4Membership Function in Dimension 1
2 4 6 8 100
0.2
0.4Membership Function in Dimension 2
2 4 6 8 100
0.2
0.4Membership Function in Dimension 3
2 4 6 8 100
0.2
0.4Membership Function in Dimension 4
2 4 6 8 100
0.2
0.4Membership Function in Dimension 5
2 4 6 8 100
0.2
0.4Membership Function in Dimension 6
2 4 6 8 100
0.2
0.4Membership Function in Dimension 7
2 4 6 8 100
0.2
0.4Membership Function in Dimension 8
2 4 6 8 100
0.2
0.4Membership Function in Dimension 9
Figure 4.26: The self organising mountain clustering membership functions for theWisconsin breast cancer example. Solid line denotes “benign”, and dashed linedenotes “malignancy”.
4.7. Experimental Results 177
Table 4.11: The extracted fuzzy rules for the Wisconsin Breast Cancer. These rulesare used in the EANFIS architecture.
R1: If x1 is ϕ11,1(x1, 1, 10) AND x2 is ϕ1
2,1(x2, 1, 10) AND
x3 is ϕ13,1(x3, 1, 10) AND x4 is ϕ1
4,1(x4, 1, 10) AND
x5 is ϕ15,1(x5, 1, 10) AND x6 is ϕ1
6,1(x6, 1, 10) AND
x7 is ϕ17,1(x7, 1, 10) AND x8 is ϕ1
8,1(x8, 1, 10) AND
x9 is ϕ19,1(x9, 1, 10) THEN
y = 0.0310x1 + 0.0428x2 + 0.0475x3 + 0.0531x4 + 0.0433x5 + 0.1129x6
−0.0108x7 + 0.0574x8 − 0.0785x9 − 1.4208
(Support: 48.76%, Confidence:82.70%)
R2: If x1 is ϕ21,1(x1, 1, 10) AND x2 is ϕ2
2,1(x2, 1, 10) AND
x3 is ϕ23,1(x3, 1, 10) AND x4 is ϕ2
4,1(x4, 1, 10) AND
x5 is ϕ25,1(x5, 1, 10) AND x6 is ϕ2
6,1(x6, 1, 10) AND
x7 is ϕ27,1(x7, 1, 10) AND x8 is ϕ2
8,1(x8, 1, 10) AND
x9 is ϕ29,1(x9, 1, 10) THEN
y = 0.0178x1 − 0.0216x2 − 0.0007x3 + 0.0107x4 − 0.0020x5 − 0.0043x6
+0.0177x7 − 0.0093x8 + 0.0108x9 + 0.9290
(Support: 39.80%, Confidence:96.98%)
where ϕτd,c(input, begin a cluster, end of a cluster) is a mountain clustering
membership function
4.7. Experimental Results 178
X1
X2
X3
X4
X5
X6
X7
X8
X9
y
Figure 4.27: The architecture found using the proposed EANFIS architecture.
4.7. Experimental Results 179
u = −kx (4.26)
where k is the desired feedback gain vector and x is the input vector x = [ x1 x2 x3 x4 ]T.
In this example k = [ −298.15 −60.697 −163.099 −73.394 ].
⎡⎢⎢⎢⎢⎢⎢⎢⎣
x1
x2
x3
x4
⎤⎥⎥⎥⎥⎥⎥⎥⎦
=
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0 1 0 0
M+mML
g 0 0 0
0 0 0 1
−mM
g 0 0 0
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎡⎢⎢⎢⎢⎢⎢⎢⎣
x1
x2
x3
x4
⎤⎥⎥⎥⎥⎥⎥⎥⎦
+
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0
− 1ML
0
1M
⎤⎥⎥⎥⎥⎥⎥⎥⎦
u (4.27)
where x1 = θ, x2 = θ, x3 = y and x4 = y
Figure 4.28: Inverted pendulum on a cart.
We train the system with initial conditions θ = 0.3, θ = 0, y = 0 and y = 0. In
order to obtain the rod in the desired angle and position, we apply a control force
4.7. Experimental Results 180
Neural Fuzzy System
θθ�yy�
u
Demux
Figure 4.29: Block diagram of the Inverted pendulum on a cart control system.
shown in Figure 4.30. After the control force is applied, it induces a new position,
velocity, rod angle and angular velocity. This is shown in Figure 4.31.
We test the system with another initial condition θ = −0.3, θ = 0, y = −1 and
y = 0. The control force using EANFIS is shown in Figure 4.32 and the input status
is shown in Figure 4.33. We can observe that there is no difference between the
EANFIS control force and the desired control force. While the ANFIS generates a
different control force it still can balance the rod (as shown in Figure 4.34). The
input status using ANFIS is shown in Figure 4.35. In Table 4.12, it is shown that
the EANFIS architecture performs better than ANFIS architecture.
The architecture found using the proposed EANFIS architecture is as shown in
Fig 4.36.
4.7. Experimental Results 181
0 0.5 1 1.5 2 2.5 3 3.5 4−20
0
20
40
60
80
100
Time
Con
trol
For
ce (
N)
Figure 4.30: Control force of the training system
Table 4.12: The table shows the RMS error compare with ANFIS and differentEANFIS architecture.
EANFIS ANFIS
Mountain Clustering MF
(Without (With (Without (With
probability probability probability probability
layer & layer & layer & layer & Gaussian
linear linear nonlinear nonlinear MF Function
setup grid) setup grid) setup grid) setup grid)
Output Layer: TSK
Rules Rules Rules Rules Rules
7 7 7 7 16
RMS RMS RMS RMS RMS
0.0000001775 0.0000001816 0.0000001106 0.0000002000 16.4280339280
4.7. Experimental Results 182
0 0.5 1 1.5 2 2.5 3 3.5 4−0.4
−0.2
0
0.2
0.4
Ang
le (
rad)
Time (sec)
0 0.5 1 1.5 2 2.5 3 3.5 4−3
−2
−1
0
1
Time
Ang
ular
Vel
ocity
(ra
d/se
c)
0 0.5 1 1.5 2 2.5 3 3.5 4−0.1
0
0.1
0.2
0.3
Time
Pos
ition
(m
)
0 0.5 1 1.5 2 2.5 3 3.5 4−1
0
1
2
Time
Vel
ocity
(m
/s)
Figure 4.31: Input status of the Inverted pendulum on a cart control system
4.8. Conclusions 183
0 0.5 1 1.5 2 2.5 3 3.5 4−300
−250
−200
−150
−100
−50
0
50
100
Time
Con
trol
For
ce (
N)
Desire inputEANFIS
Figure 4.32: Control force of the EANFIS system
4.8 Conclusions
The proposed EANFIS architecture is a robust and efficient adaptive neuro-fuzzy
inference system. It has six feed-forward layers and four learning phases. The
network structure is flexible. The self-organizing mountain clustering membership
function, the Apriori rule formulation method and the error correction layer are
independent which can be replaced with other classic methods. The self-organizing
mountain clustering membership function is completely data driven. It does not
have a pre-defined shape. It can handle complex data shapes that the traditional
membership functions find them difficult to describe. The determination of the
number of fuzzy rules is inspired by a data mining method ‘Apriori Algorithm’
which explores the number of required rules first before they are generated. So, this
method only generates useful rules. This method is different from the other neuro-
4.8. Conclusions 184
0 0.5 1 1.5 2 2.5 3 3.5 4−0.5
0
0.5
1
Time (sec)
Ang
le (
rad)
Desire inputEANFIS
0 0.5 1 1.5 2 2.5 3 3.5 4−5
0
5
10
Time
Ang
ular
Vel
ocity
(ra
d/se
c)
Desire inputEANFIS
0 0.5 1 1.5 2 2.5 3 3.5 4−2
−1
0
1
Time
Pos
ition
(m
)
Desire inputEANFIS
0 0.5 1 1.5 2 2.5 3 3.5 4−4
−2
0
2
4
Time
Vel
ocity
(m
/s)
Desire inputEANFIS
Figure 4.33: Input status of the Inverted pendulum on a cart control system usingEANFIS
4.8. Conclusions 185
0 0.5 1 1.5 2 2.5 3 3.5 4−300
−250
−200
−150
−100
−50
0
50
100
Time
Con
trol
For
ce (
N)
Desire inputANFIS
Figure 4.34: Control force of the ANFIS system
fuzzy systems which expand the network first then shrink it using a pruning process.
Moreover this algorithm can extract the rule support and the rule confidence. It is
information which indicates the rule strength to let the domain expert to implement
the neuro-fuzzy system. The fault tolerant mechanism is also considered in the
network. The error correction layer can improve the accuracy in a noisy environment.
The output layer is independent of the architectural considerations; it can be a single
weight scheme, a TSK scheme or any other methods. We cannot draw the conclusion
that the TSK mechanism would performs better than the single weight output layer.
The TSK mechanism can easily lead the network to be over-trained on the training
data set (see Section 4.7.2 and Section 4.7.4, The ANFIS architecture with TSK).
So, using whichever type of output mechanism depends on the data set and practical
situation. In practice it is best to use both the single weight output mechanism and
the TSK output mechanism and see which one will provide better results.
4.8. Conclusions 186
0 0.5 1 1.5 2 2.5 3 3.5 4−1
−0.5
0
0.5
1
Time (sec)
Ang
le (
rad)
Desire inputANFIS
0 0.5 1 1.5 2 2.5 3 3.5 4−5
0
5
10
Time
Ang
ular
Vel
ocity
(ra
d/se
c)
Desire inputANFIS
0 0.5 1 1.5 2 2.5 3 3.5 4−2
−1
0
1
Time
Pos
ition
(m
)
Desire inputANFIS
0 0.5 1 1.5 2 2.5 3 3.5 4−5
0
5
Time
Vel
ocity
(m
/s)
Desire inputANFIS
Figure 4.35: Input status of the Inverted pendulum on a cart control system usingANFIS
4.8. Conclusions 187
θ
θ�
y�
yu
Figure 4.36: The architecture found using the proposed EANFIS archit4ecture.
In the above example, the EANFIS architecture achieves good prediction and
classification accuracies when applied to some practical problems. However, it is
essentially a local method in that it only considers one dimension at a time in the
formation of membership functions using the self organising mountain clustering
algorithm. On the other hand, if this self organising mountain clustering algorithm
is combined with a method which can extract global information from the training
data set, then this combined architecture may further improve the performance.
This topic will cover in the next chapter.
Chapter 5
Combining Local and Global
Input Structures for the Extended
Adaptive Neuro-Fuzzy Inference
System
5.1 Motivation
The self organising mountain clustering membership function determination algo-
rithm in the extended adaptive neuro-fuzzy inference system (EANFIS) proposed in
the last chapter considers one dimension of the input at a time in a multi-dimensional
input situation. This stems from the kernel density method upon which the self or-
188
5.2. Introduction 189
ganising mountain clustering membership function is based on. In the kernel density
method, it is common to tackle the multi-dimensional input by dealing with one di-
mension at a time. This approach, while very useful and convenient, ignores possible
interactions when the multi-dimensional nature of the input is taken into account.
On the other hand, there are a number of methods, e.g. principal component analy-
sis method [55], Fisher’s discriminant analysis method [57], which can be used to
tackle the type of information which is contained in a multi-dimensional input con-
text. In this chapter, we will consider the possibility of combining a local method,
e.g. self organising mountain clustering membership function method, and a global
method, e.g. a principal component analysis method, with a view to enhance the
performance of the combined architecture.
5.2 Introduction
In the last chapter we introduced the concept of extending the traditional neuro-
fuzzy network [22] with the addition of two extra layers, one implementing a simple
competitive learning paradigm in discrete output situation and the second one a
common normalisation layer so that the outputs are normalised. In addition, we
consider the self organising mountain clustering membership function method, which
is an approximation of the classical kernel density method. It uses the same approach
as kernel density methods when dealing with multi-dimensional inputs, that is, it
considers one input dimension at a time.
However, there are a number of methods commonly used in statistics which take
5.2. Introduction 190
account of the multi-dimensional aspects of the inputs. For example, principal com-
ponent analysis (PCA) [55] considers the covariance matrix of the multi-dimensional
inputs, project it down onto a lower dimension in an attempt to “extract” the
“essence” of the input structure. Here in PCA, the “essence” has specific meanings.
It determines the number of dimensions in which most of the “energy” are conserved.
It is this lower dimensional input which is used as input for further analysis.
There are many other such “global” methods, Fisher’s linear discriminant analy-
sis (LDA) algorithm [57] in which the global structure of the inputs is determined
first, and a “reduced input” is used as input instead for further analysis.
In this chapter, we will explore the possibility of extending the EANFIS ar-
chitecture to encompass some of the information which might be contained in the
“global” structure of the inputs, and use it as additional information to “guide”
the adaptation process.
The structure of this chapter is as follows: in Section 5.3, we propose a general
architecture in which the “global” method, e.g., PCA [55], and “local” method, e.g.,
EANFIS, can be combined. Then we describe some possible global methods, e.g.,
PCA, LDA, in Section 5.4, before applying these global methods in combination
with the EANFIS to some practical examples.
5.3. Possible architectures for combining the local and global methods 191
5.3 Possible architectures for combining the local
and global methods
There are three possible alternatives to combine the EANFIS with a global method.
• Preprocessing stage. In this case, the global method can be considered as a
preprocessing stage for the EANFIS architecture. This architecture is shown
in Figure 5.1 in the case of EANFIS architecture, and Figure 5.2 in the case
of ANFIS architecture.
Global Method EANFIS
Figure 5.1: A block diagram to show the preprocessing method of combining theEANFIS architecture and global method.
Global Method ANFIS
Figure 5.2: A block diagram to show the preprocessing method of combining ANFISarchitecture and a global method.
Thus, in this case, the global method, e.g. PCA, LDA acts a preprocessing
stage for the EANFIS architecture. In other words, the PCA or LDA extracts
the “essence” of the inputs and feed the transformed inputs into the EANFIS
architecture or the ANFIS architecture.
• Parallel architecture. There are two ways in which the global processing mod-
5.3. Possible architectures for combining the local and global methods 192
ule may be connected in parallel with the EANFIS architecture. This is be-
cause there are two possible connection points in the EANFIS architecture,
one connecting with the output of the membership function module, and the
other one at the end of the EANFIS architecture. We will show this possibil-
ity in Figure 5.3 in which the global module is connected in parallel with the
membership function module.
Membership Function
Normalization Layer & Error
Correction Layer
Global Method
Figure 5.3: A block diagram showing the parallel connection of the global modulewith the membership function module in an extended EANFIS architecture.
In this case, the intuitive idea is that if we can provide the inputs from trans-
formed inputs (from the global module), and connect it in parallel with the
outputs of the membership function and present the combined inputs into the
competitive and normalisation layers. In this manner, one may adjust the level
of relative influence between the global module and the membership module
into the competitive and normalisation modules.
• Series-parallel architecture. In this case, the arrangement of the global module
is connected with the EANFIS architecture as shown in Figure 5.4. The global
module is connected in parallel with the series connection of the membership
function module and the competitive and normalisation layers.
5.3. Possible architectures for combining the local and global methods 193
Membership Function
Normalization Layer & Error
Correction Layer
Global Method
Figure 5.4: A block diagram showing the series-parallel connection of the globalmodule and the series connection of the membership function module and the com-petitive and normalisation layers in the EANFIS architecture.
In this approach, the intuitive idea is that we extract the “essence” of the input
and use it in parallel with the output of the competitive and normalisation
module outputs. In other words, we wish to combine the relative influence of
the outputs of the transformed inputs (through the global module) and the
outputs of the competitive and normalisation modules. This is in the same
spirit of the TSK mechanism to influence the outputs of the network by the
inputs.
Obviously there are other possibilities. For example, instead of using a global
method as a preprocessing module, it is possible to use it as a post-processing
module. However, we will not consider this further, as it is at the input end that we
wish to consider if it is possible to extract the “essence” of the inputs rather than
at the output end.
Given these choices, one may ask: which one will provide the best performance.
In other words, there is no reason a priori to select one way or the other to combine
the outputs of the global module with those of the competitive and normalisation
5.4. Possible Global methods 194
modules. This can only be ascertained from practical examples. We will carry out
such an experiment after we have described some possible global methods.
5.4 Possible Global methods
In this section, we will describe two simple global methods. These are: principal
component analysis (PCA) method, and the linear discriminant analysis (LDA)
method. Note that these are only examples of possible global methods. There are
many other possible methods, e.g., one considering the manifold of the input space,
one considering the positive matrix factorisation. However in this thesis we will only
explore simple global methods, and hence we will not be concerned with the more
advanced methods.
5.4.1 Principal Component Analysis (PCA)
Principal component analysis (PCA) is a commonly used global method. It works
by considering the covariance matrix of the inputs, and find out the dimensions in
which maximum variances occur. The dimensions with lower variances are ignored.
Then, it preserves these high variance dimensions through a transformation of the
inputs.
Consider that we are given a set of inputs X which may have a number of classes,
c = 1, 2, . . . , C. Let the inputs associated with the class c be denoted by Xc. This
5.4. Possible Global methods 195
is a D × N c matrix, where N c is the number of input vectors associated with class
c, and D is the dimension of the input vector. We will form the covariance D × D
matrix Sc for class c as shown in Equation (5.1).
Sc =1
N c − 1
(Xc − Xc
) (Xc − Xc
)T(5.1)
where Xc is the mean of Xc of class c over the samples of class c and superscript T
denotes the transpose of a matrix.
Since the matrix Sc is symmetric, it is possible to find the eigenvalues of the
matrix as well as its corresponding eigenvectors. Let this be represented compactly
as shown in Equation (5.2):
VcΛc = ScVc (5.2)
where Λc is a diagonal matrix with the diagonal values λc1, λ
c2, . . . , λ
cD as the eigen-
values of the matrix Sc, Vc ∈ �D×D each vector is the normalized eigenvector of
the covariance matrix Sc. For convenience the eigenvalues λc1, λ
c2, . . . , λ
cD are sorted
such that λc1 ≥ λc
2 ≥ · · · ≥ λcD. Because the matrix is symmetric, the eigenvalues
are all positive. In addition, we have V cV cT = V cT V c = I. In other words, the
eigenvectors are orthonormal.
The eigenvalues, e.g., λci can be thought of as related to the “energy” level
associated with that particular dimension i. Since the eigenvalues are sorted in a
descending order, this implies that the “importance” of the dimensions are sorted
5.4. Possible Global methods 196
in a descending order. Hence it may be possible to ignore dimensions which do not
have a high energy content. This can be performed by associating an energy index
Θci associated with the normalised energy level of dimension i:
Θci =
λci∑D
i=1 λci
× 100% (5.3)
By construction Θc1 ≥ Θc
2 ≥ · · · ≥ ΘcD. If we say that a dimension with nor-
malised energy level less than τ , a preset threshold, can be ignored, then we will
have Θc1 ≥ Θc
2 ≥ · · · ≥ Θcnc
, where nc is the total number of significant dimensions
for class c. In this case, the diagonal of the eigenvalue matrix Λc may be expressed
as follows: [λc1, λ
c2, . . . , λ
cnc
, 0, . . . , 0]. It is possible to reconstruct an approximation
of the covariance matrix Sc:
Sc = VcΛcVcT (5.4)
where Sc is the approximation of the covariance matrix Sc and Λc is a D × D
diagonal matrix with the diagonal elements being given by [λc1, λ
c2, . . . , λ
cnc
, 0, . . . , 0].
It is noted that Equation (5.4) can be written equivalently as follows:
Sc = VcΛcVcT
(5.5)
where D×nc matrix Vc consists of the first nc columns of the matrix Vc, and Λc is
a nc × nc diagonal matrix with diagonal elements λ1, λ2, . . . , λnc .
5.4. Possible Global methods 197
Now if we define a transformation: Xc as follows:
Xc = Xc +(Xc − Xc
)Vc
{1...j}VcT
{1...j} (5.6)
where Xc is the mean of Xc, Vc ∈ �D×D is the normalized eigenspace, j ∈ {1 . . .D},1 ≤ j ≤ D is the number of transformed dimension, Vc
1...j = [vc1,v
c2, . . . ,v
cj ]. Each
dimension is orthonormal.
We will gain a deeper insight of the operations of the PCA through an example.
Figure 5.5 plots the data in the first and second dimensions of the raw data of iris
flower data set.
0.05 0.06 0.07 0.08 0.09 0.1 0.110.06
0.07
0.08
0.09
0.1
0.11
0.12
d1
d 2
Iris−setosaIris−versicolorIris−virginica
Figure 5.5: The raw data of the first and second dimensions of the iris data set.
It is noted that the three classes are intermingled together in this projection from a
5.4. Possible Global methods 198
three dimensional space onto a two dimensional space.
Figure 5.6 shows the iris flower dataset projected down to one transformed di-
mension with PCA.
0.05 0.06 0.07 0.08 0.09 0.1 0.110.06
0.07
0.08
0.09
0.1
0.11
0.12
d1
d 2
Iris−setosaIris−versicolorIris−virginica
Figure 5.6: The iris flower dataset projected down to one transformed dimensionusing the PCA algorithm.
It is noted that in this case, the data clustered very well along the one trans-
formed dimension. Figure 5.7 shows the iris flower dataset projected down to two
transformed dimensions using the PCA algorithm. Referring to Figure 5.6, it shows
that the iris-versicolor and iris-virginica data sets are close together whereas in Fig-
ure 5.7 these two data sets overlap considerably which means that the PCA algorithm
transforms the data such that it maximizes the variance within class regardless of
the relationship between class. This is a disadvantage of using PCA if it is expected
that there are significant differences in the between class relationships. One way in
5.4. Possible Global methods 199
which this may be overcome is to use the Fisher’s linear discriminant analysis which
we will consider next.
0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.120.06
0.07
0.08
0.09
0.1
0.11
0.12
d1
d 2
Iris−setosaIris−versicolorIris−virginica
Figure 5.7: The iris flower dataset projected from three dimensions to two trans-formed dimensions using the PCA algorithm.
5.4.2 Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) derives from the Principal Component Analysis
(PCA) [55]. LDA has a better between-class classification than the PCA. LDA
utilizes Fisher’s Linear Discriminant (FLD) analysis [59] to replace the covariance
matrix in PCA and maximizes the ratio of between-class scatter to within-class
scatter [60]. There are two alternatives of LDA method; they are respectively class-
dependent transformation and class-independent transformation.
5.4. Possible Global methods 200
The class-dependent transformation uses a different covariance matrix for dif-
ferent input types while class-independent transformation uses a single covariance
matrix for different input types. The choice of the type of LDA depends on the
data set and the goals of the classification problem. If the generalization property of
the resulting classification method is important, then the class independent variety
should be used. On the other hand, if good discrimination property is important,
then the class dependent variety is a good choice. [58]
In this section, we will first briefly discuss Fisher’s linear discriminant analysis
method before considering the class-dependent and class-independent LDA methods.
Let xc be the inputs related to class c. Let μc be the mean of the inputs related
to class c. This is defined as follows:
μc =1
Nc
∑i∈c
xi (5.7)
where Nc is the total number of inputs belonging to class c. Moreover we define the
overall mean of the inputs as:
x =1
N
∑i
xi =∑
c
Nc
Nμc (5.8)
Then we can define the between-class covariance matrix and the within-class covari-
ance matrix respectively as follows:
5.4. Possible Global methods 201
SB =∑
c
Nc(μc − x)(μc − x)T (5.9)
SW =∑
c
∑i∈c
(xi − μc)(xi − μc)T (5.10)
Then the Fisher discriminant analysis is obtained by maximising the following
cost criterion:
maxw
J(w) =wTSBw
wTSWw(5.11)
The maximisation problem can be transformed into the following problem:
minw
−1
2wTSBw (5.12)
subject to the constraint
wT SWw = 1 (5.13)
Using the Lagrange multiplier this constrained optimisation problem can be
transformed to the minimisation of the following Lagrangian function:
L = −1
2wTSBw +
1
2λ(wT SWw − 1) (5.14)
5.4. Possible Global methods 202
Differentiating this function with respect to w and setting it to zero, we obtain:
SBw = λSWw (5.15)
This equation can be transformed to the following generalised eigenvalue problem:
S−1W SBw = λw (5.16)
This can be solved by noting that SB is positive and symmetric. Hence it is
possible to form SB = UΛUT , where U is an orthonormal matrix, and Λ is a
diagonal matrix. Then SB = S1/2B S
1/2B , where S
1/2B = UΛ1/2UT . Let v = S
1/2B w,
then we have:
S1/2B S−1
W S1/2B v = λv (5.17)
This is a common eigenvalue problem which can be solved readily. The maximum
eigenvalue found using this process corresponds to the solution of Fisher’s discrim-
inant analysis problem. Indeed, if we find all the eigenvalues, λi, and the corre-
sponding eigenvectors vi, i = 1, 2, . . . , D, then it is possible to express the projected
matrix as follows:
SV = V Λ (5.18)
5.4. Possible Global methods 203
where Λ is a diagonal matrix with diagonal elements λ1, λ2, . . . , λD, arranged such
that λ1 ≥ λ2 ≥ · · · ≥ λD, and V is the corresponding eigenvector matrix. The matrix
S = S1/2B S−1
W S1/2B . In this case, similar to the PCA situation, it is possible to consider
a reduced dimension problem, dependent on the magnitude of the eigenvalues λi,
i = 1, 2, . . . , D. Thus, if λj >> λj+1, then it is possible to assume that λj+1 = 0
and the dimension of the input space is reduced to j instead of D. In this case, the
transformed vector is obtained as follows:
xi = μ + (xi − μ)v[1,2,...,j]vT[1,2,...,j] (5.19)
where μ is the class independent mean of the inputs. v[1,2,...,j] denotes the matrix
formed by the first j eigenvectors.
This method is called a class independent LDA method.
In a very similar manner, it is possible to formulate a class dependent LDA
method as follows: In this case, the cost criterion to be maximised is given as
follows:
maxw
J(w) =wTSBw
wTScw(5.20)
where Sc = 1Nc−1
(xi − μc)(xi − μc)T , and μc is as defined previously. Carrying out
the same derivation as before, we will obtain the following characterisation of the
class dependent LDA method:
5.4. Possible Global methods 204
S1/2B S−1
c S1/2B v = λv (5.21)
In this case, in a very similar manner, the transformed input vector is given by
xi = μc + (xi − μc)v[1,2,...,j]vT[1,2,...,j] (5.22)
where μc is the class dependent mean of the input vectors xi, and xi ∈ c, the class
c.
Thus the only difference between the class dependent and class independent LDA
method is the covariance matrix. In the case of the class independent LDA method
the covariance matrix SW is used, while in the case of the class dependent LDA
method, the covariance matrix Sc is used. The class independent LDA method
maximises the overall gap between the classes, irrespective of their classes, while the
class dependent LDA method maximises the gap in between the classes.
To gain a deeper insight into the differences between the class dependent and
class independent LDA methods, we will apply both methods to the iris flower data
set.
Figure 5.8 shows the iris flower dataset projected down to one transformed dimen-
sion with a class-dependent LDA method while Figure 5.9 shows the same dataset
projected down to two transformed dimensions.
It is obvious from Figures 5.8 and 5.9 respectively that the projection onto one
5.4. Possible Global methods 205
−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01−0.035
−0.03
−0.025
−0.02
−0.015
−0.01
−0.005
0
0.005
0.01
d1
d 2
Iris−setosaIris−versicolorIris−virginica
Figure 5.8: The iris flower data set projected down onto one transformed dimensionusing the class-dependent LDA method.
−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02−0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
d1
d 2
Iris−setosaIris−versicolorIris−virginica
Figure 5.9: The iris flower data set projected down to two transformed dimensionswith a class-dependent LDA method.
5.4. Possible Global methods 206
dimension does not separate the various classes while the projection onto two di-
mensions (as shown in Figure 5.9) the various classes are separated into different
clusters.
Using the class independent LDA method, we can carry out the same operations.
The results are shown in Figures 5.10 and 5.11 respectively for projection onto one
dimension and two dimensions.
−0.04 −0.03 −0.02 −0.01 0 0.01 0.02−0.05
−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
d1
d 2
Iris−setosaIris−versicolorIris−virginica
Figure 5.10: Iris flower data set projects down to one transformed dimension withclass-independent LDA
It is observed that the projection onto one dimension does not separate the classes,
while the projection onto two dimensions separates the various classes.
We note the differences in the performance of the class dependent and class
independent LDA methods (referring to Figures 5.9 and 5.11 respectively) in that
the class dependent LDA method appears to be able to separate classes better than
5.4. Possible Global methods 207
−0.02 −0.01 0 0.01 0.02 0.030.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
d1
d 2
Iris−setosaIris−versicolorIris−virginica
Figure 5.11: Iris flower data set projects down to two transformed dimension withclass-independent LDA
the class independent LDA method. However, though it is not obvious from these
diagrams, the class independent LDA method has a better generalisation capability
than the class dependent LDA method.
Note that there are basic assumptions for the validity of the Fisher discrimi-
nant analysis methods. These are that the underlying distribution of the inputs are
Gaussian in nature. If this assumption is not satisfied, then the performance of the
Fisher linear discriminant analysis cannot be predicted. In this case it might be nec-
essary to consider more complex discriminant methods, e.g. nonlinear discriminant
analysis methods. In this thesis, we will not consider these more complex topics,
but reserve them as topics for further research.
5.4. Possible Global methods 208
5.4.3 Selection of the combined architecture
As indicated in the previous section, there are a number of possible combinations
of the LDA methods with the EANFIS architecture. In this subsection, we will
carry out some experiments to indicate which may be a better architecture for the
combined methods. In this case, we run a number of experiments with different
architectures, e.g. ANFIS architecture, EANFIS architecture, and with various
combinations of LDA methods, or PCA methods. We use a number of practical
examples, for these evaluations: iris flower data set, Wisconsin breast cancer data
set, and the sunspot number time series.
The following architectures have been considered (together with the notations
as indicated in Table 5.1):
For the iris flower data set, and the Wisconsin breast cancer data set, we carried
out the following experiments:
• The ANFIS architecture (ANFIS)
• The LDA method in series with the ANFIS architecture (LDA ⇒ ANFIS). In
this case, the LDA is used as a preprocessing algorithm.
• The EANFIS architecture (EANFIS)
• The LDA method is used in series with the EANFIS architecture (LDA ⇒EANFIS). In this case the LDA method is used as a preprocessing algorithm.
5.4. Possible Global methods 209
• The LDA method is used in parallel with the self organising mountain clus-
tering membership function method (LDA // MF)
• One LDA method is used as a preprocessing method and then a second LDA
method used in parallel with the self organising mountain clustering member-
ship function method (LDA ⇒ (LDA // MF))
The LDA methods are carried out using both the class dependent and class inde-
pendent LDA methods.
For the sunspot number time series, we carried out the following experiments:
• The ANFIS architecture (ANFIS)
• The PCA method used as a preprocessing method for the ANFIS architecture
(PCA ⇒ ANFIS)
• The EANFIS architecture (EANFIS)
• The PCA method used as a preprocessing method for the EANFIS architecture
(PCA ⇒ EANFIS)
• PCA in series with a parallel architecture involving the LDA method and the
self organising mountain clustering membership function method (PCA ⇒(LDA // MF))
Again the LDA method in this case is carried out both with the class dependent
and class independent version. Note that the main reason why we carried out the
5.4. Possible Global methods 210
time series experiments with a PCA instead of a LDA method is that the proposed
method will consider the continuous output as a one class system. Thus, it is not
as conducive to the LDA method.
Table 5.1: Using the LDA methods as pre-processing methods in combination withthe ANFIS or EANFIS architectures.
LDA ⇒ LDA ⇒
LDA ⇒ LDA ⇒ (LDA (LDA (LDA (LDA
ANFIS ANFIS EANFIS EANFIS \\MF) \\MF) \\MF) \\MF)
(Class independent) (Class dependent)
Breast 88.40% 96.58% 94.95% 95.84% 96.87% 96.39% 96.44% 96.39%
Cancer
Variance 4.05 0.27 0.18 0.43 0.20 0.33 0.51 0.33
Rule 512 512 2 2 2 2 2 2
Iris 98.22% 95.75% 98.43% 97.88% 98.18% 98.08% 97.84% 98.02%
Variance 2.73 7.54 2.80 3.63 2.82 3.18 3.53 3.15
Rule 16 16 3 3 3 3 3 3
PCA ⇒ PCA ⇒
PCA ⇒ PCA ⇒ (LDA (LDA (LDA (LDA
ANFIS ANFIS EANFIS EANFIS \\MF) \\MF) \\MF) \\MF)
(Class independent) (Class dependent)
Sunspot 16.7495 17.4060 15.9901 16.3934 15.9915 N/A 15.9894 N/A
Rule 16 16 2 1 2 1 2 1
From Table 5.1, it is observed that
5.4. Possible Global methods 211
• The ANFIS architecture with or without the preprocessing unit has an inferior
performance compared with the various EANFIS architectures.
• The preprocessing unit (using LDA or PCA) when connected in series with the
EANFIS architecture does not have good performance compared with other
combination of EANFIS architectures.
• The parallel connection between the preprocessing unit and the self organis-
ing mountain clustering membership function algorithm by itself appears to
have superior performance when compared with other combination of EANFIS
architectures.
• The addition of a LDA preprocessing unit to the parallel connection between
a self organising mountain clustering membership function algorithm does not
appear to improve the performance.
Based on these experiments we conclude that the parallel connection between
a self organising mountain clustering membership function algorithm appears to
provide the best performance. This is then followed by the normalisation layer
before the output layer. This combined architecture is shown in Figure 5.12. This
will be the combined architecture which we will use in the rest of this chapter.
Note that this combination of the LDA method and the EANFIS architecture
can be explained using the following heuristics: the LDA method is very good in
separating the input data into clusters. The self organising mountain clustering
membership function algorithm is one method for dealing with the clustering of the
inputs. Hence if we combine the two in parallel, they will be aiding one another.
5.4. Possible Global methods 212
However by combining them in series, e.g. using the LDA as a preprocessing unit for
the EANFIS architecture, we are not using the best capability of the LDA method,
and hence it gives a lower performance.
Thus in this case, the output similarity of the input data can be calculated using
a Gaussian function as shown in Equation (5.23) where σr is the standard deviation
of the projected output xr.
θr = exp
(−∥∥xr − ¯xr
∥∥2
2σ2r
)(5.23)
where the ¯xr is the mean of the projected cluster r, xr is the projected output of
input x using r eigenspace.
Note that in this combined architecture, it is obvious that the LDA method
and the error correction layer in the EANFIS architecture are working in concert
with one another. The LDA method cannot solve some problems by itself, e.g.,
the exclusive OR problem. The LDA method combined with the error correction
layer in the EANFIS architecture can help the decision making in marginal cases.
It is possible to combine both the error correction layer in the EANFIS architecture
and the LDA method as shown in Equation (5.24). φr is the output from the error
correction layer of the EANFIS architecture. θr is the output from the LDA method.
α is a weight to adjust the weighting between the two methods.
Layer 2A:
εr = α · φr∑r
φr+ (1 − α) · θr∑
r
θr(5.24)
5.4. Possible Global methods 213
rθ
11ϕ
12ϕ
21ϕ
22ϕ
dcϕ
x1
x2
1φ
2φ
3φ
4φ
φ r
1φ �
2φ �
3φ �
4φ �
1π �
2π �
3π �
4π �
1π �
2π �
3π �
4π �
rφ rπ rπ
��
wr
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6
1ε �
2ε �
3ε �
4ε �
α
β
xd
y
1θ
2θ
3θ
4θ
Error Correction Layer
Global Method (LDA)
ANFIS
Figure 5.12: The extended adaptive neuro-fuzzy inference system with the LDAmethod.
5.5. Application Examples 214
5.5 Application Examples
In this section, we will illustrate the application of the concept of combining the
local method, e.g. EANFIS architecture together with a global method, e.g. a LDA.
We demonstrate a Sunspot number cycle which is a time series data, a non-linear
separated Iris dataset and a high dimensional Wisconsin Breast Cancer dataset.
5.5.1 Sunspot Cycle
The NOAA sunspot number time series is compiled by the US National Oceanic and
Atmospheric Administration (NOAA). The number of sunspots has been measured
daily since January 1749 at the Zurich Observatory [54]. The time series is quite
“jugged”. This is a common benchmark problem for time series prediction algo-
rithms. Instead of using the daily figure, we take the average for each month and, as
used commonly in time series benchmark studies, use 55 years of data for training
and the rest 200 years of data for prediction. The entire time series is shown in Fig-
ure 5.13. It is observed that the time series appears to exhibit a cyclic behaviour.
However, this is a hotly debated topic: whether the sunspot time series exhibits a
cyclic behaviour. There are various estimates of the cycle, e.g. 11 years. Others
indicated that this time series may exhibit a chaotic behaviour with no observable
periodic behaviour.
We rearrange the data format as input-output pairs as follow.
5.5. Application Examples 215
1750 1800 1850 1900 1950 20000
50
100
150
200
250
300
Date
Sun
spot
Num
ber
Figure 5.13: Monthly Average Sunspot number time series in which the first 55 yeardata is used for training, and the rest of 200 year data is used for testing. Thetraining data is shown in continuous line, while the testing data is shown in dottedline.
[x(t-4), x(t-3), x(t-2), x(t-1); x(t)]
In other words, we use the immediate past four data points x(t− i), i = 1, 2, 3, 4 as
the inputs and attempt to predict the next data point: x(t). We then applied the
combined architecture as indicated in Section 5.3 to the time series.
Refer to the results in Table 5.2, the combined method using a class dependent
LDA technique has the smallest RMS error. The difference between the EANFIS
architecture, and the combined architecture though is small. Here we are compar-
ing the performance of an ANFIS architecture, which uses 16 rules, and Gaussian
functions, with those of the EANFIS architecture, and the combined architectures,
using a TSK output mechanism, and only 2 rules.
5.5. Application Examples 216
Table 5.2: Prediction RMS Errors for the Sunspot number time series.
Combined EANFIS Combined EANFIS
ANFIS EANFIS Class dependent Class Independent
Output Single weight TSK TSK TSK
Layer
Setup Grid Linear Nonlinear Nonlinear Nonlinear
RMS Error 16.7495 15.9901 15.9894 15.9915
Rules 16 2 2 2
Figures 5.14 and 5.15 respectively show the LDA technique in projecting the data set
to one transformed dimension using class dependent method and class independent
method respectively. It is noted that in both cases, the data sets align well within
the projected dimension. This may be the reason why the prediction using the
combined architecture works well. The LDA technique, either class dependent or
class independent, appears to be able to align the data so that they fall within a
transformed dimension.
Note that the ANFIS architecture using a TSK output mechanism shows insta-
bility (as indicated in Section 4.5), so we use a single weight output mechanism in
the output layer. On the other hand the EANFIS architecture, or the combined ar-
chitectures do not show any instability using a TSK output mechanism, and hence
in this case, we compare their results with those in the ANFIS architecture using a
single weight output mechanism. The results show that the class dependent LDA
method has a slightly better classification ability. The combined method can signif-
icantly predict the trend of the sunspot number time series as shown in Figure 5.16.
5.5. Application Examples 217
−5 0 5 10 15 20−20
−10
0
10
20
30
40
d1
d 2
Data points support Rule 1Data points support Rule 2
Figure 5.14: The Sunspot number time series data set using a class dependent LDAtechnique to project it to one transformed dimension.
0 5 10 15 20 250
5
10
15
20
25
30
35
40
d1
d 2
Data points support Rule 1Data points support Rule 2
Figure 5.15: The Sunspot number time series data set using a class independentLDA technique to project it to one transformed dimension.
5.5. Application Examples 218
The predicted output errors are shown in Figure 5.17.
1750 1800 1850 1900 1950 20000
50
100
150
200
250
300
Date
Sun
spot
Num
ber
Figure 5.16: The prediction outputs of the class dependent LDA method combinedwith the EANFIS architecture.
The maximum error in Figure 5.17 is 93.4475 and the total RMS error is 15.9894
as shown in Table 5.2. Note the prediction errors as shown in Figure 5.17. They
appear reasonably balanced around the zero axis, and that the “undulations” around
this axis appear to be reasonably balanced as well.
5.5.2 Iris Dataset
Fisher’s Iris dataset consists of three types of iris leaves: Iris-Virginica, Iris-Versicolor
and Iris-Setosa. Each species has measurements pertaining to the sepal length, the
sepal width, the petal length, and the petal width. Iris-Virginica and Iris-Versicolor
are linearly inseparable. We randomly select 51 data points for testing and 99 data
points for training.
5.5. Application Examples 219
1750 1800 1850 1900 1950 2000−150
−100
−50
0
50
100
150
Date
Err
or
Figure 5.17: The prediction output errors of the class dependent LDA methodcombined with the EANFIS architecture.
The first two dimensions of the raw data are plotted as in Figure 5.5. It is noted
that the data are well “intertwined”.
Once the training and testing data sets are chosen, these same data sets will be
put into different neuro-fuzzy architectures used for performance evaluation. The
evaluation process will run for 100 times and the average over the 100 runs will
be used for comparison purposes. In order to show the effect of the combined
method, we use a zeroth order polynomial function (constant weights) in the output
layer. Figure 5.18 shows the membership function using the self-organizing mountain
clustering membership function method.
Figure 5.8 to Figure 5.11 show the effect of LDA transformation.
The combined architecture using an EANFIS architecture, and a LDA method
outperforms that of an EANFIS architecture. The classification accuracy is rel-
5.5. Application Examples 220
0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.110
0.05
0.1
0.15
0.2
Membership Function in Dimension 1
0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.110
0.05
0.1
0.15
0.2
Membership Function in Dimension 2
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.120
0.1
0.2
0.3
0.4
Membership Function in Dimension 3
0.02 0.04 0.06 0.08 0.1 0.12 0.140
0.1
0.2
0.3
0.4
Membership Function in Dimension 4
Figure 5.18: The membership function formed using the self-organizing mountainclustering membership function method. ’.’ denotes the iris-setosa, ’o’ denotes theiris-versicolor and ’+’ denotes the iris-virginica.
5.5. Application Examples 221
atively close to that obtained using an ANFIS architecture, though the ANFIS
architecture uses 16 rules and the EANFIS architecture only uses 3 rules. Table 5.3
shows the classification accuracies and root mean square errors of various architec-
tures on the testing data set. It shows that the RMS errors in the combined EANFIS
architecture is smaller than that obtained using an ANFIS architecture. As shown
in Table 5.3, we have evaluated various architectures. We have used the following
architectures:
• ANFIS architecture
• An EANFIS architecture without the error correcting mechanisms
• An EANFIS architecture with the error correcting and normalisation layers
turned on
• An EANFIS architecture without the error correcting and normalisation layers
turned on, and with a class dependent LDA method
• An EANFIS architecture without the error correcting and normalisation layers
turned on, and with a class independent LDA method
• An EANFIS architecture with the error correcting and normalisation layers
turned on, and with a class independent LDA method
5.5. Application Examples 222
Table 5.3: Prediction classification accuracy comparison on the Iris data set.
EANFIS Combined EANFIS ANFIS
Without With Error Corr. & Error Corr. &
Error Corr. Error Corr. Class Class Class Class
Layer Layer Dependent Independent Dependent Independent
Output Layer: Single weight
Accuracy
87.92% 87.82% 97.90% 98.25% 97.90% 98.31% 98.10%
Variance of Accuracy
16.06 15.40 3.90 3.18 3.90 3.03 3.53
Rule
3 3 16
RMS Error
0.3068 0.3076 0.1638 0.1550 0.1607 0.1515 0.1920
Variance of RMS Error
0.00055 0.00055 0.00110 0.00082 0.00114 0.00086 0.00035
5.5. Application Examples 223
5.5.3 Wisconsin Breast Cancer
The Wisconsin breast cancer dataset [53] consists of 699 instances in 9 dimensions.
A total of 16 instances have missing attribute. We randomly select 200 instances for
training and use the entire 499 instances for testing purposes as commonly carried
out in using this data set for benchmark purposes. This evaluation will run for 100
times and we take the average of the 100 runs for comparison purposes. In the ANFIS
architecture, it uses two membership functions per dimension. The total number
of fuzzy rules is 512. For the EANFIS architecture and the combined method, it
uses only 2 fuzzy rules. Figure 5.19 shows the breast cancer dataset projected down
to one transformed dimension with class dependent LDA method and Figure 5.20
shows the transformed data with class independent LDA method.
−2 0 2 4 6 8 10−2
0
2
4
6
8
10
d1
d 2
BenignMalignant
Figure 5.19: The breast cancer dataset projected down onto one transformed di-mension with the class dependent LDA method.
5.6. Conclusions 224
0 1 2 3 4 5 60
2
4
6
8
10
12
d1
d 2
BenignMalignant
Figure 5.20: The breast cancer dataset projected down to one transformed dimensionwith the class independent LDA method.
The results shown in (Table 5.4 and Table 5.5) indicate that the accuracy in the
combined method is much higher than that obtained using the ANFIS architecture.
5.6 Conclusions
An adaptive neuro-fuzzy inference system uses membership functions to compare
the output similarities. The self organising mountain clustering membership func-
tion uses a local method to find the parameters of an asymmetrical membership
function. It works on one dimension first, and then work on another dimension
and so on. Hence this process does not incorporate any global information in the
input space. A method which can incorporate to some extent the global information
5.6. Conclusions 225
Table 5.4: The output accuracy comparison of the Wisconsin breast cancer data setusing various architectures.
EANFIS (Mountain MF) ANFIS
Output Layer: Single weight TSK
Rules 2 2 2 2 512
Error Corr. N Y N Y N/A
Setup Grid linear linear nonlinear nonlinear linear
Accuracy 94.43% 96.22% 94.63% 96.77% 90.58%
Variance 0.45 0.21 0.28 0.16 3.87
Table 5.5: The output accuracy comparison on the Wisconsin breast cancer dataset.
Combined EANFIS
Number of rules: 2
Output Layer: Single weight
Global LDA LDA
Method Class Dependent Class Independent
Setup Grid Linear Nonlinear Linear Nonlinear
Error Corr. N Y N Y N Y N Y
Accuracy 95.28% 96.13% 95.69% 96.59% 95.31% 96.59% 95.69% 96.81%
Variance 0.50 1.27 0.39 0.89 0.45 0.31 0.40 0.27
5.6. Conclusions 226
on the input space like the LDA method which transforms the input data to an-
other space by maximizing the between class scatter and minimizing the within class
scatter. However, the LDA method assumes that the data distribution is Gaussian
otherwise it may fail to provide a good classification, for instance, in the case of
the exclusive OR problem. The combined architecture involving a global processing
method, e.g., the LDA method and an EANFIS architecture in some cases can pro-
vide much improved performance when compared with the EANFIS architecture by
itself. Hence a judicious application of this combined architecture could improve on
the performance of the EANFIS architecture.
Chapter 6
Conclusions and
Recommendations
6.1 Conclusions
This thesis presents a novel extension to the architecture of the adaptive neuro-fuzzy
inference system (ANFIS). The extension is inspired by the observation that often in
using the ANFIS architecture on discrete output classification problems, the results
are improved if we include an additional logistic function layer (error correction) in
the architecture, in a manner similar to the use of a logistic function in multilayer
perceptrons when used as classifiers. Hence, the extended architecture is to represent
each class of output by a separate membership function structure to represent the
input variables. In the case of continuous output, there is only one membership
227
6.1. Conclusions 228
function structure. The total number of rules required will be increased by many
folds in comparison with the ANFIS architecture. The idea is to insert additional
sections to the ANFIS architecture incorporating a logistic function for each rule
formation and its associated normalisation. This counter-intuitive approach (as it
will greatly increase the number of rules) allows us to follow up with two innovative
ideas:
• Structural determination. By investigating the input variable structure, using
a method inspired by the associative rule determination from data mining
concepts, we are able to determine which rule needs to be formed, prior to
learning the system parameters. The implicit assumption is that those rules
which are not formed would not be needed in the final architecture. What this
says is that even if these “unused” rules are formed, their firing strength would
be negligible from the set of training data point of view. Hence, there is no need
to form them in the EANFIS architecture. Thus, by being able to determine
which rule needs to be formed before the learning process commences, we are
able to “reduce” the number of rules, and thus, allowing the architecture to be
used for potentially problems with a large number of inputs. This approach
overcomes one of the major difficulties in the ANFIS architecture, viz., it is
difficult to use the architecture for practical problems with a large number
of input variables, as the number of rules which need to be formed grows
explosively with the number of input variables. By limiting the formation of
rules which are necessary to the inference we are able to use the proposed
architecture for problems with a large number of input variables.
6.1. Conclusions 229
• Membership function determination. Traditionally in fuzzy system studies,
the idea is to select a membership function from a set of candidate member-
ship functions. Then the parameters of the selected membership function are
determined. In our proposed approach, we suggest investigating the input
variables with a view to find a suitable, possibly non-symmetric membership
function which is most appropriate to the problem at hand. In other words,
we do not fix the shape of the membership function a priori, but we use the
input data to determine the required shape. This approach is called the self
organising mountain clustering membership function approach. If it happens
to be non-symmetric, so be it. Thus this proposed approach allows the input
data to determine the required membership function.
These new algorithms were evaluated on a number of benchmark problems, e.g.
sunspot cycle problem, Wisconsin breast cancer problem. It was found that in
general, the results support our intuition, i.e. by including the softmax-type layers
on the ANFIS architecture, the results are imporved. We compared our results on
the extended ANFIS architecture with those obtained by the ANFIS architecture,
and we can claim that in all cases, the EANFIS architecture provides an imporved
result.
We further enhance the proposed extended ANFIS architecture by asking the
question: what happens if there is additional information available on the input
variables. This is based on the concept that the membership function representa-
tion used in the EANFIS architecture, viz. the self organising mountain clustering
membership function approach, is based on “local” properties of the input variables.
6.1. Conclusions 230
What if some information on the “global” structure of the input variables is avail-
able, would this additional information improve the performance of the EANFIS
architecture. The idea is obvious. If there is additional global information on the
input variables, then the aggregate input may be a linear combination of the “local”
information as represented by the self organising mountain clustering membership
functions, and the “global” information as represented in our case by the linear
discriminant analysis.
We have again evaluated the proposed algorithm on a set of benchmark prob-
lems, including the sunspot cycle problem, and the Wisconsin breast cancer problem.
Again we found that the proposed combination of local and global information im-
proves the performance of the proposed architecture, compared with those obtained
by using EANFIS alone.
We have also en route to study the ANFIS considered a minor problem, how
to provide a non-uniform grid on the input structures. We have devised a method
which can be used to provide a non-uniformed grid on a set of input variables.
This idea of a non-uniformed grid, instead of a uniform grid may represent the
inputs more “faithfully”. The application in this case is the radial basis function
neural network (or equivalently an ANFIS architecture with a Gaussian function
membership function).
We compared the results of applying this algorithm to a number of practical
cases, e.g. the currecny exhcnage problem, and found that the proposed algorithm
performs better than the one using a linear grid. Thus, the nonlinear grid deter-
mination idea may be used whenever there is a need to interpolate a signal in a
6.2. Future areas of research 231
nonlinear sampling fashion.
6.2 Future areas of research
There are a number of areas of future research which may be interesting following
on the work which was performed in this thesis. These include:
• We have used a stage-wise approach in the training of the proposed EANFIS
architecture. We first determine the membership function and its associated
parameters, and then the parameters of the inference mechanism. This de-
coupling of the training process appears to work well. The question is: can
we combine the parameters of the inference mechanism, and the parameters
of the membership function and use a combined training regime. In other
words, instead of a stage-wise training process, can we combine the training of
the membership function parameters, and the inference mechanisms together.
This is a highly nonlinear process. However, if this can be performed then it
will make the proposed method more automatic from the users’ point of view,
as they do not need to consider various steps, but instead can use the method
as a “black box” with little adjustment required.
• There are many ways to consider the global information on input structures,
e.g. principal component analysis. The one chosen in this thesis is Fisher’s
linear discriminant analysis. However, there are assumptions of Gaussianity of
the input variables in this approach. There are other approaches which allow
6.2. Future areas of research 232
the extraction of the global structures of the input variables. The question is:
would the use of a different “global” information method alter the behaviour
of the proposed EANFIS?
• In the rule “selection” method used to select which rule to be formed in the
EANFIS architecture, the assumption is that the characteristics of the training
data and the testing data sets are similar. Hence, by selecting which rules to
form, these rules would be useful for deployment on the testing data set. An
interesting question arises: what if the characteristics of the testing data set
and the training data set are different? Would it be possible to find a method
which allows rules to be switched on or off dependent on the data. If this can
be done, then it will lead to a more adaptive structure modelling the training
data set.
These questions will form fruitful areas of further research work in this area.
References
[1] A. C. Tsoi, S. Tan, “Recurrent neural networks: A constructive algorithm, and
its properties”, Neurocomputing, vol. 15, pp. 309-326, June. 1997
[2] J. Durkin, Expert Systems, Design and Development, Prentice Hall, 1994.
[3] M. Minsky, S. A. Papert, Perceptrons: An Introduction to Computational Com-
plexity MIT Press, Cambridge, MA, 1969.
[4] W. McCulloch, W. Pitt, “A logical calculus of the idea immanent in nervous
activity” Bulletin of Mathematical Biophysics, Vol 5, pp. 115-133, 1943.
[5] F. Rosenblatt, Principles of Neurodynamics Spartan, Washington DC, 1962.
[6] J. Perl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference, Morgan Kaufman, 1988.
[7] M. Negnevitsky, Artificial Intelligence, A Guide to Intelligent Systems, Addison
Wesley, First Edition, 2002.
[8] E. Cox, The fuzzy systems handbook: a practitioner’s guide to building, using,
and maintaining fuzzy systems, Academic Press, 1994.
233
References 234
[9] R. J. Schalkoff, Artificial Neural Networks, McGraw-Hill, 1997.
[10] S. Haykin, Neural Networks, Prentice Hall, Second Edition, 1999.
[11] T. M. Mitchell, Machine Learning, McGraw Hill, 1997.
[12] J.S.R. Jang, C.T. Sun, E. Mizutani, Neuro-Fuzzy and Soft Computing, A Com-
putational Approach to Learning and Machine Intelligence, Prentice Hall, 1997.
[13] J. L. McClelland, D. Rumelhart and the PDP Research Group Parallel Dis-
tributed Processing Volumes 1 and 2, MIT Press, Cambridge, MA, 1986.
[14] K. Hornik, M. Stinchcombe, H. White, “Multilayer feedforward networks are
universal approximators” Neural Networks Vol. 2, pp.359-366, 1989.
[15] C. M. Bishop, Neural Networks for Pattern Recognition Oxford University
Press, 1995.
[16] W. Hardel, Applied Non-parametric Regression Cambridge University Press,
New York, 1990.
[17] Zadeh, L.A., “Fuzzy sets”, Information and Control, Vol. 8, pp. 338-353, 1965
[18] S. Yasunobu, S. Miyamoto, “Automatic train operation system by predictive
fuzzy control” in Industrial Applications of Fuzzy Control M. Sugeno Ed. Ams-
terdam: North Holland, 1985.
[19] E. H. Mamdani, “Application of Fuzzy Logic to Approximate Reasoning Using
Linguistic Synthesis”, IEEE Trans. Computers, vol. 26, pp. 1182-1191, 1977
References 235
[20] E. H. Mamdani, S. Assilian, “An experiment in linguistic synthesis with a fuzzy
logic controller”, International Conference on Fuzzy Systems, vol. 7, pp. 1-13,
1975
[21] T. Takagi, M. Sugeno, “Fuzzy identification of systems and its applications to
modeling and control”, IEEE Trans. on Systems, Man and Cybernetics, vol. 15,
pp. 116-132, Jan. 1985
[22] J.S. Jang, “ANFIS: Adaptive-Network-Based Fuzzy Inference System”, IEEE
Trans. on Systems, Man and Cybernetics, vol. 23, pp. 665-685, May/June 1993.
[23] C.-T. Lin, Y.-C. Lu, “A Neural Fuzzy System with Linguistic Teaching Signals”,
IEEE Trans. on Fuzzy Systems, vol. 3, pp. 169-189, May 1995.
[24] C. Li, C.-Y. Lee, K. H. Cheng, “Pseudoerror-Based Self-Organizing Neuro-
Fuzzy System”, IEEE Trans. on Fuzzy Systems, vol. 12, pp. 812-819, Dec. 2004.
[25] L. Rutkowski, K. Cpalka, “Flexible Neuro-Fuzzy Systems”, IEEE Trans. on
Neural Networks, vol. 14, pp. 554-574, May. 2003.
[26] D. Chakraborty, N. R. Pal, “A Neuro-Fuzzy Scheme for Simultaneous Fea-
ture Selection and Fuzzy Rule-Based Classification”, IEEE Trans. on Neural
Networks, vol. 15, pp. 110-123, Jan. 2004.
[27] C. S. Velayutham, S. Kumar, “Asymmetric Subsethood-Product Fuzzy Neural
Inference System (ASuPFuNIS)”, IEEE Trans. on Neural Networks, vol. 16, pp.
160-174, Jan. 2005.
[28] W.L. Tung, C. Quek, “GenSoFNN: A Generic Self-Organizing Fuzzy Neural
Network”, IEEE Trans. on Neural Networks, vol. 13, pp. 1075-1086, Sept. 2002.
References 236
[29] L.-X. Wang, A Course in Fuzzy Systems and Control, Prentice Hall, 1997.
[30] R. C. Berkan, S. L. Trubatch, Fuzzy Systems Design Principles, Building Fuzzy
IF-THEN Rule Bases, IEEE Press, 1997.
[31] R. Beale, T. Jackson, Neural Computing: an Introduction, IOP Publishing
Ltd., 1990.
[32] G. B. Arfken, H. J. Weber Mathematical Methods for Physicist, Academic
Press, Fourth Edition, 1995.
[33] R. Larson, B. H. Edwards, D. C. Falvo, Elementary Linear Algebra, Houghton
Mifflin, Fifth Edition, 2004.
[34] R. Larson, R. Hostetler, B. H. Edwards, Calculus of a single variable, Houghton
Mifflin, Seventh Edition, 2002.
[35] R. Reed, “Pruning Algorithms-A Survey”, IEEE Trans. on Neural Networks,
vol. 4, pp. 740-747, Sept. 1993.
[36] P. P. Kanjilal, D. N. Banerjee, “On the Application of Orthogonal Transfor-
mation for the Design and Analysis of Feedforward Networks”, IEEE Trans. on
Neural Networks, vol. 6, pp. 1061-1070, Sept. 1995.
[37] K. I. Diamantaras, S. Y. Kung, Principal component neural networks: theory
and applications, John Wiley & Sons Inc, Fifth Edition, 1996.
[38] J. H. Mathews, K. Fink, Numerica; Methods Using MATLAB, , Fourth Edition,
2003.
References 237
[39] R. J. Schilling, J. J. Carrol, A. Al-Ajlouni, “Approximation of Nonlinear sys-
tems with Radial Basis Function Neural Networks”, IEEE Trans. on Neural
Networks, vol. 12, pp. 1-15, Jan. 2001.
[40] T. Kohonen, Self Organising Maps, Spring-Verlag, Second Edition, 1997.
[41] R. Agrawal, R. Srikant, “Fast Algorithms for Mining Association Rules”, Pro-
ceedings of the 20th VLDB Conference, Santiago, Chile, 1994.
[42] W. Chu, S. S. Keerthi, and C. J. Ong, “Bayesian Support Vector Regression
Using a Unified Loss Function”, IEEE Trans. on Neural Networks, vol. 15, pp.
29-44, Jan. 2004.
[43] A. Nurnberger, C. Borgelt, and A. Klose, “Improving Naive Bayes Classifiers
Using Neuro-Fuzzy Learning”, Dept. of Knowledge Processing and Language
Engineering, Otto-von-Guericke-University of Magdeburg, Germany
[44] R.R. Yager and D.P. Filev, “Approximate Clustering Via the Mountain
Method”, IEEE Trans. on Systems, Man and Cybernetics, vol. 24, pp. 1279-
1284, Aug. 1994
[45] R. L. Burden, J. D. Faires, Numerical Analysis, Thomson Learning Inc.,
Seventh Edition, 2001.
[46] J. C. Principe, N. R. Euliano, W. C. Lefebvre, Neural and Adaptive Systems:
Fundamentals through Simulations, John Wiley & Sons Inc, First Edition, 2000.
[47] P. P. Kanjilal, D. N. Banerjee, “On the Application of Orthogonal Transfor-
mation for the Design and Analysis of Feedforward Networks”, IEEE Trans. on
Neural Networks, vol. 6, pp. 1061-1070, Sept. 1995.
References 238
[48] R. Reed, “Pruning Algorithms-A Survey”, IEEE Trans. on Neural Networks,
vol. 4, pp. 740-747, Sept. 1993.
[49] J. Han and M. Kamber, Data Mining, Concepts and Techniques, Morgan Kauf-
mann, 2000.
[50] T. M. Mitchell, Machine Learning, McGraw-Hill, 1997.
[51] R. Beale, T, Jackson, Neural Computing: An Introduction, Institute of Physics
Publishing, 1994.
[52] S. C. Chapra, R. P. Canale, Numerical methods for engineers : with program-
ming and software applications, McGraw-Hill, c1998.
[53] K. P. Bennett, O. L. Mangasarian, “Robust linear programming discrimination
of two linearly inseparable sets”, Optimization Methods and Software 1, 1992,
23-34 (Gordon and Breach Science Publishers).
[54] David H. Hathaway, The Sunspot Cycle, 02 Sept., 2006.
http://solarscience.msfc.nasa.gov/SunspotCycle.shtml
[55] B. Flury, A First Course in Multivariate Statistics, Springer-Verlag, 1997.
[56] S. Z. Li, J. Lu, “Face Recognition Using the Nearest Feature Line Method”,
IEEE Trans. on Neural Networks, vol. 10, pp. 439-443, Mar. 1999.
[57] B.D. Ripley, Pattern recognition and neural networks, Cambridge University
Press, 1996.
[58] S. Balakrishnama, A. Ganapathiraju, Linear Doscriminant Analysis - A Brief
Tutorial, Mississippi State University, 2002.
References 239
[59] R.A. Fisher, “The Use of Multiple Measures in Taxonomic Problems”, Ann.
Eugenics, vol. 7, pp. 179-188, 1936.
[60] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. Fish-
erfaces: Recognition Using Class Specific Linear Projection”, IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 19, pp. 711-720, July. 1997.
[61] M.-H. Yang, N. Ahuja, D. Kriegman, “Face Recognition using kernel eigen-
faces”, IEEE International Conference on Image Processing, vol. 1, pp. 37-40,
Sept. 2000.
[62] G. J. McLachlan and T. Krishnan, The EM algorithm and extensions, John
Wiley, 1996.
[63] B. W. Silverman, Density Estimation for Statistics and Data Analysis. Chap-
man and Hall: London, 1986.
[64] J. E. Moody, C. Darken, “Fast learning networks of locally-tuned processing
units” Neural Computation, Vol. 1, pp. 281-294, 1989.
[65] H. Tong, Nonlinear time series: a dynamic system approach Clarendon Press,
Oxford, 1990.
[66] M. J. Morris, “Forecasting the sunspot cycle” Journal of the Royal Statistical
Society, Series A, Vol. 140, pp 437 - 468, 1977.
[67] M. Casdagli, “Nonlinear prediction of chaotic time series” Physica D, Vol 35,
pp. 335-356, 1989.
References 240
[68] Aggarwal R K, Xuan Q Y, Johns A T, Li F R, Bennett A, “A novel approach
to fault diagnosis in multi-circuit transmission lines using fuzzy ARTmap neural
networks” IEEE transactions on neural networks, Vol.10, No.5, pp.1214-1221.,
1999.
[69] Potter, C.; Negnevitsky, M., “ANFIS application to competition on artificial
time series (CATS)” IEEE Proceedings International Conference on Fuzzy Sys-
tems, Volume 1, Issue , 25-29 July 2004 pp.: 469-474 vol.1, 2004.
[70] A.Berizzi, C. Bovo, M. Delfanti, M.Merlo, “A Neuro-Fuzzy Inference System for
the Evaluation of Voltage Collapse Risk Indices”, Bulk Power System Dynamics
ans Control - VI, Cotina d’Ampezzo, Italy, 22-27 August. 2004.
[71] Duane Hanselman and Bruce Littlefield, The Student Edition of Matlab: ver-
sion 5, user’s guide, Prentice Hall, 1997.
[72] Hung T. Nguyen, Nadipuram R. Prasad, Carol L. Walker and Elbert A. Walker,
A First Course in Fuzzy and Neural Control, CRC Press, 2002.
[73] Sushmita Mitra and Yoichi Hayashi, “Neuro-Fuzzy Rule Generation: Survey
in Soft Computing Framework”, IEEE Trans. on Neural Networks, vol. 11, pp.
748-765, May, 2000
Appendix A
Network Training Algorithm
A.1 RBFN Network Training Algorithm
The aim of the steepest descent algorithm is to minimize the accumulated errors for
the whole network with respect to the underlying parameters. The total error for
the network is shown in Equation (A.1.1).
E =1
2
∑i
e2i (A.1.1)
where the error ei is defined as follows:
ei = δi − yi (A.1.2)
241
A.2. ANFIS Network Training Algorithm 242
where δi is the desired value of ith output and yi is the output from the network.
To minimize the error, we determine the derivative of total error function with
respect to the weights (parameters). Apply the chain rule [34] we obtain Equation
(A.1.3).
∂E∂wr
= ∂E∂ei
· ∂ei
∂yi· ∂yi
∂wr
∂E∂wr
= −eiBr (xi)(A.1.3)
Apply ∂E∂wr
to the delta rule [10] we obtain Equation (A.1.4).
wnewr = wold
r − η ∂E∂wr
wnewr = wold
r + ηeiBr (xi)(A.1.4)
where η is a learning constant, x ∈ �I×D, i = 1, 2, . . . , I, d = 1, 2, . . . , D.
A.2 ANFIS Network Training Algorithm
In this section, we will consider the derivation of the parameter adaptation method
for the ANFIS architecture. The task is to minimize the total error (shown in
Equation (A.2.5)).
E =1
2
∑i
e2i (A.2.5)
A.3. Gradient determination in EANFIS Layer 4 for continuous output values 243
where
ei = δi − yi (A.2.6)
where δi and yi are respectively the desired output and the output of the network.
The derivative of the total error function with respect to the weight in aggregation
layer is in Equation (A.2.7).
∂E∂αrd
= ∂E∂ei
· ∂ei
∂yi· ∂yi
∂wir· ∂wir
∂αrd
∂E∂αrd
= −eiπirxid
(A.2.7)
Apply Equation (A.2.7) to the delta rule we obtain Equation (A.2.8).
αnewrd = αold
rd − η ∂E∂αrd
αnewrd = αold
rd + ηeiπirxid
(A.2.8)
where η is a learning constant, x ∈ �I×D, i = 1, 2, . . . , I, d = 1, 2, . . . , D.
A.3 Gradient determination in EANFIS Layer 4
for continuous output values
In this section, we will consider the derivation of the learning rules for the EANFIS
architecture when the output is a continuous value. The total error is E =∑
i e2i ,
A.3. Gradient determination in EANFIS Layer 4 for continuous output values 244
where ei = δi − yi, δi and the desired output and the output of the EANFIS ar-
chitecture. The parameter estimation of the weight γr (please see Chapter 4 for
the definition of the parameters) depends on the evaluation of the following partial
derivatives:
∂E
∂γr=
∂E
∂Ei
I∑i=1
∂Ei
∂ei· ∂ei
∂yi·
R∑r=1
∂yi
∂πir·∂πir
∂πir· ∂πir
∂γr(A.3.9)
∂E
∂Ei=
1
I(A.3.10)
∂Ei
∂ei
= ei (A.3.11)
∂ei
∂yi= −1 (A.3.12)
∂yi
∂πir= wir (A.3.13)
∂πir
∂πir=
(∑r
πir
)− πir(∑
r
πir
)2 (A.3.14)
∂πir
∂γr= (1 − πir) (10πir − 5πir) (A.3.15)
A.4. Gradient determination in EANFIS Layer 4 for discrete output types 245
By substituting Equations (A.3.10), (A.3.11), (A.3.12), (A.3.13), (A.3.14), (A.3.15)
into Equation (A.3.9) we obtain Equation (A.3.16).
∂E
∂γr=
1
I·
I∑i=1
ei
R∑r=1
wir ·
(∑r
πir
)− πir(∑
r
πir
)2 · (πir − 1) (10πir − 5πir) (A.3.16)
A.4 Gradient determination in EANFIS Layer 4
for discrete output types
In this case, the derivation is quite similar to the one derived for the continuous
output values in the previous section. The total error is E =∑
i e2i =
∑i(δi − yi)
2,
where δi and yi are respectively the desired output and the output from the EANFIS
architecture. Then the learning rule depends on the evaluation of the following
partial derivative (please refer to Chapter 4 for the notations).
∂E
∂γτr
=∂E
∂Ei
I∑i=1
∂Ei
∂eτi
· ∂eτi
∂πτir
· ∂πτir
∂πir
· ∂πir
∂γτr
(A.4.17)
and we have:
∂E
∂Ei
=1
I(A.4.18)
A.4. Gradient determination in EANFIS Layer 4 for discrete output types 246
∂Ei
∂eτi
= eτi (A.4.19)
∂eτi
∂πτir
= −1 (A.4.20)
∂e¬τi
∂πτir
= −1 (A.4.21)
∂πτir
∂πir=
(∑r
πir
)− πir(∑
r
πir
)2 (A.4.22)
∂πir
∂γτr
= (1 − πir) (10πir − 5πir) (A.4.23)
By substituting Equations (A.4.18), (A.4.19), (A.4.20), (A.4.21), (A.4.22), (A.4.23)
into Equation (A.4.17) we obtain Equation (A.4.24).
∂E
∂γτr
=1
I
I∑i=1
eτi ·
(∑r
πir
)− πir(∑
r
πir
)2 · (πir − 1) (10πir − 5πir) (A.4.24)
Appendix B
An example of Linear
Discriminant Analysis
transformation
In this section, we will provide a detailed worked example of Fisher’s linear discrimi-
nant analysis based on the iris data set. The input data shown in Table B.1 contains
two discrete output data.
Figure B.1 shows the original data input.
Assume that two rule sets are formed during Apriori algorithm. The first rule set
contains data labels {1, 2, 3, 4} and the second rule set contains data label {5, 6, 7, 8}.The various covariances, Sr = 1
n−1
(Xr − Xr
)T (Xr − Xr
), r = 1, 2 are formed as
follows:
247
Appendix B. An example of Linear Discriminant Analysis transformation 248
Table B.1: Input Data
Data Label d1 d2 δ
1 0.2520 0.2036 1
2 0.2982 0.1018 1
3 0.2828 0.2962 1
4 0.2571 0.2314 1
X1 0.2725 0.2083
5 0.3908 0.3703 2
6 0.3959 0.4813 2
7 0.4217 0.4443 2
8 0.4628 0.4906 2
X2 0.4178 0.4466
X 0.3452 0.3274
0.4 0.5 0.6 0.7 0.8 0.90.1
0.2
0.3
0.4
0.5
0.6
d1
d 2
Type 1Type 2
Figure B.1: Input data
Appendix B. An example of Linear Discriminant Analysis transformation 249
• S(1): the covariance matrix of rule set 1:
S(1) = 14−1
⎛⎜⎜⎜⎜⎜⎜⎜⎝
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.2520 0.2036
0.2982 0.1018
0.2828 0.2962
0.2571 0.2314
⎤⎥⎥⎥⎥⎥⎥⎥⎦−
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.2725 0.2083
0.2725 0.2083
0.2725 0.2083
0.2725 0.2083
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎞⎟⎟⎟⎟⎟⎟⎟⎠
T
·
⎛⎜⎜⎜⎜⎜⎜⎜⎝
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.2520 0.2036
0.2982 0.1018
0.2828 0.2962
0.2571 0.2314
⎤⎥⎥⎥⎥⎥⎥⎥⎦−
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.2725 0.2083
0.2725 0.2083
0.2725 0.2083
0.2725 0.2083
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎞⎟⎟⎟⎟⎟⎟⎟⎠
S(1) = 14−1
⎛⎜⎜⎜⎜⎜⎜⎜⎝
⎡⎢⎢⎢⎢⎢⎢⎢⎣
−0.0206 −0.0046
0.0257 −0.1064
0.0103 0.0879
−0.0154 0.0231
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎞⎟⎟⎟⎟⎟⎟⎟⎠
T ⎛⎜⎜⎜⎜⎜⎜⎜⎝
⎡⎢⎢⎢⎢⎢⎢⎢⎣
−0.0206 −0.0046
0.0257 −0.1064
0.0103 0.0879
−0.0154 0.0231
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎞⎟⎟⎟⎟⎟⎟⎟⎠
S(1) =
⎡⎢⎣ 0.0004760 −0.0006981
−0.0006981 0.006540
⎤⎥⎦
(B.0.1)
• S(2): the covariance matrix of rule set 2 :
Appendix B. An example of Linear Discriminant Analysis transformation 250
S(2) = 14−1
⎛⎜⎜⎜⎜⎜⎜⎜⎝
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.3908 0.3703
0.3959 0.4813
0.4217 0.4443
0.4628 0.4906
⎤⎥⎥⎥⎥⎥⎥⎥⎦−
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.4178 0.4466
0.4178 0.4466
0.4178 0.4466
0.4178 0.4466
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎞⎟⎟⎟⎟⎟⎟⎟⎠
T
·
⎛⎜⎜⎜⎜⎜⎜⎜⎝
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.3908 0.3703
0.3959 0.4813
0.4217 0.4443
0.4628 0.4906
⎤⎥⎥⎥⎥⎥⎥⎥⎦−
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.4178 0.4466
0.4178 0.4466
0.4178 0.4466
0.4178 0.4466
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎞⎟⎟⎟⎟⎟⎟⎟⎠
S(2) = 14−1
⎛⎜⎜⎜⎜⎜⎜⎜⎝
⎡⎢⎢⎢⎢⎢⎢⎢⎣
−0.0270 −0.0764
−0.0219 0.0347
0.0039 −0.0023
0.0450 0.0440
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎞⎟⎟⎟⎟⎟⎟⎟⎠
T ⎛⎜⎜⎜⎜⎜⎜⎜⎝
⎡⎢⎢⎢⎢⎢⎢⎢⎣
−0.0270 −0.0764
−0.0219 0.0347
0.0039 −0.0023
0.0450 0.0440
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎞⎟⎟⎟⎟⎟⎟⎟⎠
S(2) =
⎡⎢⎣ 0.001082 0.001091
0.001091 0.002992
⎤⎥⎦
(B.0.2)
• Sb: the between-class scatter:
Appendix B. An example of Linear Discriminant Analysis transformation 251
Sb =R∑
r=1
(Xr − X
)T (Xr − X
)Sb =
([0.2725 0.2083
]−[
0.3452 0.3274
])T
·([
0.2725 0.2083
]−[
0.3452 0.3274
])
+
([0.4178 0.4466
]−[
0.3452 0.3274
])T
·([
0.4178 0.4466
]−[
0.3452 0.3274
])
Sb =
⎡⎢⎣ 0.01055 0.01731
0.01731 0.02841
⎤⎥⎦
(B.0.3)
• Sw: the within-class scatter:
Sw =R∑
r=1
Nr
NSr
Sw = 48
⎡⎢⎣ 0.0004760 −0.0006981
−0.0006981 0.006540
⎤⎥⎦+ 4
8
⎡⎢⎣ 0.001082 0.001091
0.001091 0.002992
⎤⎥⎦
Sw =
⎡⎢⎣ 0.0007789 0.0001963
0.0001963 0.0047661
⎤⎥⎦
(B.0.4)
There are two types of transformations: class independent and class dependent
ones. The normalized eigenspace and eigenvalues of FLD of class dependent, Sr−1Sb,
and of the class independent, S−1w Sb, are shown respectively in Equation (B.0.5) and
Equation (B.0.7).
• Class dependent LDA transformation
Appendix B. An example of Linear Discriminant Analysis transformation 252
λrvr = Sr−1Sbv
r
V(1) =
⎡⎢⎣ 0.7546 −0.6562
0.3413 0.9400
⎤⎥⎦ , λ(1) =
⎡⎢⎣ 40.6390
0
⎤⎥⎦
V(2) =
⎡⎢⎣ 0.7133 −0.7009
0.6891 0.7247
⎤⎥⎦ , λ(2) =
⎡⎢⎣ 11.9840
0
⎤⎥⎦
(B.0.5)
Xr = XrVr{1...j}V
rT
{1...j}, j = 1
X(1) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.2520 0.2036
0.2982 0.1018
0.2828 0.2962
0.2571 0.2314
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎡⎢⎣ 0.7546
0.3413
⎤⎥⎦⎡⎢⎣ 0.7546
0.3413
⎤⎥⎦
T
X(1) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.1959 0.0886
0.1961 0.0887
0.2373 0.1073
0.2060 0.0932
⎤⎥⎥⎥⎥⎥⎥⎥⎦
X(2) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.3908 0.3703
0.3959 0.4813
0.4217 0.4443
0.4628 0.4906
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎡⎢⎣ 0.7133
0.6891
⎤⎥⎦⎡⎢⎣ 0.7133
0.6891
⎤⎥⎦
T
X(2) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.3808 0.3679
0.4380 0.4231
0.4329 0.4182
0.4766 0.4604
⎤⎥⎥⎥⎥⎥⎥⎥⎦
(B.0.6)
• Class independent LDA transformation
Appendix B. An example of Linear Discriminant Analysis transformation 253
λv = S−1w Sbv
V =
⎡⎢⎣ 0.7511 −0.6601
0.4137 0.9104
⎤⎥⎦ , λ =
⎡⎢⎣ 17.8600
0
⎤⎥⎦ (B.0.7)
Xr = XrV{1...j}VT{1...j}, j = 1
X(1) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.2520 0.2036
0.2982 0.1018
0.2828 0.2962
0.2571 0.2314
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎡⎢⎣ 0.7511
0.4137
⎤⎥⎦⎡⎢⎣ 0.7511
0.4137
⎤⎥⎦
T
X(1) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.2055 0.1132
0.1999 0.1101
0.2516 0.1386
0.2170 0.1195
⎤⎥⎥⎥⎥⎥⎥⎥⎦
X(2) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.3908 0.3703
0.3959 0.4813
0.4217 0.4443
0.4628 0.4906
⎤⎥⎥⎥⎥⎥⎥⎥⎦
⎡⎢⎣ 0.7511
0.4137
⎤⎥⎦⎡⎢⎣ 0.7511
0.4137
⎤⎥⎦
T
X(2) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.3356 0.1848
0.3730 0.2054
0.3760 0.2071
0.4136 0.2278
⎤⎥⎥⎥⎥⎥⎥⎥⎦
(B.0.8)
The transformed data X using class dependent of rule set 1 and rule set 2 respec-
tively are respectively shown in Equation (B.0.6) and the transformed data X using
Appendix B. An example of Linear Discriminant Analysis transformation 254
class independent of rule set 1 and rule set 2 are respectively shown in Equation
(B.0.8).
• Class dependent LDA transformation
X(1) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.2520 0.2036
0.2982 0.1018
0.2828 0.2962
0.2571 0.2314
⎤⎥⎥⎥⎥⎥⎥⎥⎦→ X(1) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.1959 0.0886
0.1961 0.0887
0.2373 0.1073
0.2060 0.0932
⎤⎥⎥⎥⎥⎥⎥⎥⎦
X(2) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.3908 0.3703
0.3959 0.4813
0.4217 0.4443
0.4628 0.4906
⎤⎥⎥⎥⎥⎥⎥⎥⎦→ X(2) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.3808 0.3679
0.4380 0.4231
0.4329 0.4182
0.4766 0.4604
⎤⎥⎥⎥⎥⎥⎥⎥⎦
• Class Independent LDA transformation
X(1) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.2520 0.2036
0.2982 0.1018
0.2828 0.2962
0.2571 0.2314
⎤⎥⎥⎥⎥⎥⎥⎥⎦→ X(1) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.2055 0.1132
0.1999 0.1101
0.2516 0.1386
0.2170 0.1195
⎤⎥⎥⎥⎥⎥⎥⎥⎦
X(2) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.3908 0.3703
0.3959 0.4813
0.4217 0.4443
0.4628 0.4906
⎤⎥⎥⎥⎥⎥⎥⎥⎦→ X(2) =
⎡⎢⎢⎢⎢⎢⎢⎢⎣
0.3356 0.1848
0.3730 0.2054
0.3760 0.2071
0.4136 0.2278
⎤⎥⎥⎥⎥⎥⎥⎥⎦
Figures B.2 and B.3 respectively show the transformed data by class-dependent
LDA transformation and class-independent LDA transformation.
Appendix B. An example of Linear Discriminant Analysis transformation 255
0.2 0.25 0.3 0.35 0.4 0.45 0.50.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
d1
d 2
Type 1Type 2
Figure B.2: Class-dependent LDA transformation
0.2 0.25 0.3 0.35 0.4 0.45 0.50.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
d1
d 2
Type 1Type 2
Figure B.3: Class-independent LDA transformation