Extended adaptive neuro-fuzzy inference systems

University of Wollongong Thesis Collections

University of Wollongong Thesis Collection

University of Wollongong Year

Extended adaptive neuro-fuzzy inference

systems

Chun Yin LauUniversity of Wollongong

Lau, Chun Yin, Extended adaptive neuro-fuzzy inference systems, PhD thesis, Schoolof Information Technology and Computer Science, University of Wollongong, 2006.http://ro.uow.edu.au/theses/564

This paper is posted at Research Online.

http://ro.uow.edu.au/theses/564

Extended Adapative Neuro-FuzzyInference Systems

A thesis submitted in fulfillment of the requirements for the

award of the degree:

Doctor of Philosophy

UNIVERSITY OF WOLLONGONG

Lau Chun Yin Dip. CompSc. B.CompSc., M.CompSc.

Faculty of Informatics

School of Information Technology & Computer Science

2006

Certification

I, Lau Chun Yin, declare that this thesis, submitted in partial fulfillment of the

requirements for the award of the degree of Doctor of Philosophy, in the School of

Information Technology & Computer Science, Faculty of Informatics, University of

Wollongong, is wholly my own work unless otherwise referenced or acknowledged.

This document has not been submitted for qualifications at any other academic in-

stitutions.

Lau Chun Yin

3rd October 2006

ii

Acknowledgements

It is an unforgettable experience of pursuing a PhD degree in my life. After I received

a Master of Computer Science degree at the University of Wollongong, I worked in

Chu Hai College of Higher Education in Hong Kong. The College is in the process

of upgrading its standard to become a University. The accreditation process by

the Hong Kong Council for Academic Accreditation provided the impetus for the

College to upgrade the qualifications of its staff and facility accordingly. I am one

of those staff who participated in the staff development program to study for a PhD

degree.

I would like to thank my wife Freda and my family for their full support. I

would also like to express my sincere gratitude to my supervisor Prof. Ah Chung

Tsoi, formerly of the University of Wollongong, and now at Monash University, my

co-supervisor Prof. Tommy Chow Wai Shing, City University of Hong Kong for

their profound knowledge on the subject area of my research, their inspiration and

expert guidance on my research. Finally, I would like to thank Dr. Kong Yau Pak,

Chu Hai College and Prof. Ah Chung Tsoi again for providing me the opportunity

of studying for a PhD degree in the University of Wollongong.

iii

Abstract

This thesis presents a novel extension to the Adaptive Neuro-Fuzzy Inference Sys-

tem (ANFIS) which we call extended ANFIS (EANFIS). The extension includes the

introduction of an output class based membership function architecture, in which

each output class in a discrete output situation has its own membership function

and in the case of a continuous output, only one class; the possibility of determining

the structure of the rule base from the underlying structure of the input variables;

the determination of a possibly non-symmetric membership function the parameters

of which can be determined automatically from the given input variables; the pos-

sibility of incorporating global information on the input variables through a Linear

Discriminant Analysis in combination with the local input variable structure as rep-

resented by the membership functions. The possibility of determining the structure

of the rule section before the training process commences means that the proposed

EANFIS architecture can be applied to possibility large scale practical problems,

as it does not require the formation of all possible combination of rules before the

training process commences. In other words, the EANFIS architecture together

with its structure determining procedures overcomes the current limitation facing

ANFIS architecture when applied to systems with large number of inputs. The

possibility of determining a membership function from the input variables means

iv

v

the user no longer needs to select a membership function from a set of candidate

membership functions. The possibility of incorporating global information on the

input variables in addition to the local information on input variables means that

the EANFIS architecture can take advantage when such global information might be

useful in improving the performance of the Neuro-Fuzzy system. The new EANFIS

architecture is evaluated on a number of standard benchmark problems, and have

been found to have superior performance. In addition, as this is an EANFIS, rules

can be extracted from the trained system, thus providing information on the way

in which the underlying system is operating. The proposed EANFIS recommends

itself readily for applications in practical systems.

Abbreviations

ANFIS Adaptive Network Based Fuzzy Inference System

ANN Artificial Neural Network

COA Centroid Of Area

FIS Fuzzy Inference System

FLD Fisher Linear Discriminant

LDA Linear Discriminant Analysis

MLP Multilayer Perceptrons

N2Lmap Nonlinear to Linear mapping

NOAA National Oceanic and Atmospheric Administration

PCA Principal Component Analysis

QRcp QR factorization with column pivoting

RBFN Radial Basis Function Network

RMS Root Mean Square

vi

vii

SOM Self Organizing Map

SVD Singular Value Decomposition

SVM Support Vector Machines

TSK Takagi-Sugeuno-Kang Fuzzy Model

Notation

[ ]: continuous set

{}: discrete set

plaintext : a plain text indicates a variable

bold : bold face indicates a vector

BOLD : bold face and capital letter indicates a matrix

CAPITAL : capital letter indicates the upper bound of a variable

αrd: weight in Takagi-Sugeuno-Kang (TSK) Fuzzy Model of rule r in d dimension

δ: is a desire value

Δ: is a small constant

η: is a learning constant

θr: Output from r Linear Discriminant Analysis (LDA) node

πr: Output form the probability layer of rule r

πr: Normalized output from the probability layer in rule r

viii

ix

φr: Output from rule layer of rule r

φr: Normalized output from rule layer of rule r

τ : output cluster type where τ ∈ {1, T}, T is the upper bound of τ

σ: is the spread of a Gaussian function

ϕ: is a activation function

Br: rth basis function of Radial Basis Function Network (RBFN)

cr: membership function center of rule r where c ∈ �D

d: data dimension where d ∈ {1, D}, D is the upper bound of d

e: indicate an error value

g: index number of grid point where g ∈ {1, G}, G is the upper bound of g

i: index number of input vector where i ∈ {1, I}, I is the upper bound of i

r: index number of fuzzy rule where r ∈ {1, R}, R is the upper bound of r

xi: ith input vector where xi ∈ �D

wr: is a weight attached to rule r

yi: is a system output value of ith input pair

zgd: gth non-linear grid center in d dimension

Contents

Certification ii

Acknowledgements iii

Abstract iv

Abbreviations vi

Notation viii

1 Introduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Neuro-fuzzy systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 The contribution of this thesis . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Neuro-Fuzzy Systems 14

2.1 Background on Fuzzy concepts . . . . . . . . . . . . . . . . . . . . . . 15

x

Contents xi

2.2 Fuzzy rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Reasoning with fuzzy rules . . . . . . . . . . . . . . . . . . . . 16

2.3 Fuzzy inference System . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Neuro-Fuzzy Inference System . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Adaptive Neuro Fuzzy Inference System (ANFIS) . . . . . . . . . . . 24

2.5.1 A Feed-Forward Network . . . . . . . . . . . . . . . . . . . . . 25

2.5.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5.3 Network Pruning . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.6 Radial Basis Function Networks (RBFN) . . . . . . . . . . . . . . . . 31

2.7 Nonlinear Approximation Method proposed by Schilling et al. . . . . 35

2.8 Kohonen Self-Organising Map (SOM) . . . . . . . . . . . . . . . . . . 39

3 Non-uniform Grid Construction in a Radial Basis Function Neural

Network 42

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 Method of obtaining non-linear grid points . . . . . . . . . . . . . . . 43

3.3 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.3.1 Van der Pol Oscillator . . . . . . . . . . . . . . . . . . . . . . 56

3.3.2 Currency exchange rate between the US Dollar and the In-

donesian Rupiah . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.3.3 Sunspot Cycle Time Series . . . . . . . . . . . . . . . . . . . . 84

3.3.4 Experiments with the Iris Dataset . . . . . . . . . . . . . . . . 93

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Contents xii

4 Extended Adaptive Neuro-Fuzzy Inference Systems 100

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.2 Introduction to Extended Adaptive Neuro-Fuzzy Inference System . . 101

4.3 Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 104

4.3.1 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.4 Structure determination of the proposed neuro-fuzzy architecture . . 114

4.4.1 A proposed algorithm for rule formation . . . . . . . . . . . . 118

4.5 Determination of candidate membership function and the required

number of membership functions in each input variable . . . . . . . . 125

4.6 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

4.6.1 Error Correction Layer (Layer 4) parameter learning . . . . . 135

4.6.2 Output layer parameter learning . . . . . . . . . . . . . . . . . 139

4.6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

4.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

4.7.1 Exclusive OR problem . . . . . . . . . . . . . . . . . . . . . . 141

4.7.2 Sunspot Cycle Time Series . . . . . . . . . . . . . . . . . . . . 148

4.7.3 Mackey-Glass Time Series example . . . . . . . . . . . . . . . 157

4.7.4 Iris Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

4.7.5 Wisconsin Breast Cancer example . . . . . . . . . . . . . . . . 171

4.7.6 Inverted Pendulum on a cart . . . . . . . . . . . . . . . . . . . 175

4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

Contents xiii

5 Combining Local and Global Input Structures for the Extended

Adaptive Neuro-Fuzzy Inference System 188

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

5.3 Possible architectures for combining the local and global methods . . 191

5.4 Possible Global methods . . . . . . . . . . . . . . . . . . . . . . . . . 194

5.4.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . 194

5.4.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . 199

5.4.3 Selection of the combined architecture . . . . . . . . . . . . . 208

5.5 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 214

5.5.1 Sunspot Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

5.5.2 Iris Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

5.5.3 Wisconsin Breast Cancer . . . . . . . . . . . . . . . . . . . . . 223

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

6 Conclusions and Recommendations 227

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

6.2 Future areas of research . . . . . . . . . . . . . . . . . . . . . . . . . 231

References 233

Appendix 241

A Network Training Algorithm 241

Contents xiv

A.1 RBFN Network Training Algorithm . . . . . . . . . . . . . . . . . . . 241

A.2 ANFIS Network Training Algorithm . . . . . . . . . . . . . . . . . . . 242

A.3 Gradient determination in EANFIS Layer 4 for continuous output

values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

A.4 Gradient determination in EANFIS Layer 4 for discrete output types 245

B An example of Linear Discriminant Analysis transformation 247

List of Figures

2.1 A block diagram of a Fuzzy Inference System [12] . . . . . . . . . . . 17

2.2 Mamdani fuzzy inference system using min and max operators [12]. . 19

2.3 An alternative Mamdani fuzzy inference system using product and

max operators [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Tsukamoto fuzzy inference system [12]. . . . . . . . . . . . . . . . . . 21

2.5 Sugeno fuzzy inference system [12]. . . . . . . . . . . . . . . . . . . . 22

2.6 Architecture of a Neuro Fuzzy System. . . . . . . . . . . . . . . . . . 23

2.7 Architecture of ANFIS . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.8 Architecture of RBFN . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.9 An example of the distribution of Gaussian function centers in a uni-

formly distributed fashion. . . . . . . . . . . . . . . . . . . . . . . . . 34

2.10 An example of non-uniformly distributed scheme of Gaussian function

centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.11 The pseudo-code implementation of Schilling et al’s mapping function

[39]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.12 Block diagram of Schilling et al’s mapping method . . . . . . . . . . . 38

xv

List of Figures xvi

2.13 A flow chart showing the implementation of a non-linear grid in a

Radial Basis Function Neural Network. . . . . . . . . . . . . . . . . . 40

2.14 SOM Mexican hat update function . . . . . . . . . . . . . . . . . . . 41

3.1 A pseudocode representation of the proposed turning point detection

algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Uniform grid point distribution in the d-th dimension of a given signal. 46

3.3 Updating algorithm of finding the set of non-linear grid points. . . . . 48

3.4 A diagram illustrating a triangular function on uniform distribution

grid points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5 A diagram illustrating the determination of the centers and spreads

of a non-uniformly distributed set of grid points. . . . . . . . . . . . . 50

3.6 The magnitude update using triangular functions in the grid point

location updating algorithm. . . . . . . . . . . . . . . . . . . . . . . . 52

3.7 The magnitude update using Gaussian functions in the grid point

location updating algorithm. . . . . . . . . . . . . . . . . . . . . . . . 52

3.8 The magnitude of the update using one grid point on either sides of

the function center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.9 The update of the magnitude of the updating algorithm using two

grid point either side of grid function center b. . . . . . . . . . . . . . 55

3.10 Performance comparisons of RBFN using three different regimes: lin-

ear grid, nonlinear grid, and the transformation of the nonlinear grid

to the linear grid (N2Lmap) method. . . . . . . . . . . . . . . . . . . 57

List of Figures xvii

3.11 The set of turning points superimposed on the original signal for the

van der Pol equation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.12 The distribution of the set of grid points. The upper graph shows

the distribution using a linear grid while the lower graph shows the

location of the grid points using a nonlinear grid regime. The total

number of grid points used is 15. . . . . . . . . . . . . . . . . . . . . 59

3.13 The actual output of a RBFN using 15 grid points in a linear grid

regime. It is observed that the output is significantly different from

that of the original output of the van der Pol equation. . . . . . . . . 60

3.14 The differences in the output of the van der Pol equation, and the

reconstructed one using a RBFN with 12 grid points using a linear

grid regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.15 The output of a RBFN using 15 grid points and a nonlinear grid regime. 61

3.16 The output differences between the original signal from the van der

Pol equation and a reconstructed signal using a RBFN with 15 grid

points and using a nonlinear grid regime. . . . . . . . . . . . . . . . . 61

3.17 The output of a RBFN with 15 grid points using a nonlinear grid

regime. In this case, we use the nonlinear grid mapped onto a linear

grid using the method proposed by Shilling et al [39]. . . . . . . . . 62

3.18 The output differences between the original signal and the recon-

structed signal using a nonlinear to linear grid mapping as proposed

in Shilling et al. [39]. The number of grid points used is 15. . . . . . . 62

List of Figures xviii

3.19 The distribution of the grid points. The upper graph shows the linear

grid point distribution, while the lower graph shows the distribution

using a nonlinear grid. The number of grid points used is 40. . . . . . 64

3.20 The output of a RBFN with 40 grid points using a linear grid regime. 65

3.21 The output differences between the original signal and the reconstruc-

tion using a RBFN using a linear grid regime with 40 grid points. . . 65

3.22 The output of a RBFN using a nonlinear grid regime with 40 grid

points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66


tion using a RBFN with a nonlinear grid regime using 40 grid points. 66

3.24 The output of a RBFN using a nonlinear grid mapped onto a linear

grid with 40 grid points. . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.25 The output differences between the original signal and a reconstructed

signal using a RBFN with a nonlinear grid mapped onto a linear grid

using 40 grid points. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.26 The currency exchange time series between the US dollars and the

Indonesian Rupiah, between 1st January, 1994, and 31st December,

1999. Note that the vertical axis of this graph is normalised with 0

denoting 1 USD to 2160 IDR, while the maximum 1 denoting 1 USD

to 16,475 IDR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.27 The variation of the root mean square error values as a function of

the number of grid points used. . . . . . . . . . . . . . . . . . . . . . 71

List of Figures xix

3.28 The set of turning points in the time series of USD to IDR. Note that

we have connected the points so as to make it easier to discern where

the turning points are. . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.29 The distribution of the grid points. The upper graph shows the dis-

tribution of the linear grid points, while the lower graph shows the

distribution of the nonlinear grid points. . . . . . . . . . . . . . . . . 74

3.30 The output of a RBFN using 25 grid points with a linear grid regime. 75


tion using a RBFN with 25 grid points with a linear grid regime. . . . 76

3.32 The output of a RBFN using 25 grid points with a nonlinear grid

regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76


tion using a RBFN with 25 grid points in a nonlinear grid regime. . . 77

3.34 The output of a RBFN using 25 grid points but with a mapping from

the nonlinear grid to a linear grid. . . . . . . . . . . . . . . . . . . . . 77


tion using 25 grid points with a mapping from the nonlinear grid to

a linear grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.36 The distribution of the grid points. The upper graph shows the dis-

tribution of the linear grid points, while the lower graph shows the

distribution of the nonlinear grid points. The number of grid points

used is 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.37 The output of a RBFN using 100 grid points with a linear grid regime. 80

List of Figures xx


structed one from a RBFN using 100 grid points with a linear grid

regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.39 The output of a RBFN using a nonlinear grid regime with 100 grid

points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.40 The output differences between the original signal and the one re-

constructed from a RBFN with a nonlinear regime using 100 grid

points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.41 The output of a RBFN using a mapping from the nonlinear grid to a

linear grid regime using 100 grid points. . . . . . . . . . . . . . . . . 82


structed one from a RBFN with a mapping from the nonlinear grid

to a linear grid regime using 100 grid points. . . . . . . . . . . . . . . 83

3.43 The monthly average sunspot number time series from January 1749

to July 2004. The x-axis is normalised to lie between 0 and 1. Simi-

larly the y-axis is also normalised to lie between 0 and 1. . . . . . . . 85

3.44 The set of turning points for the NOAA sunspot number time series. 85

3.45 The variation of the RMS error values as a function of the number of

grid points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.46 Grid point distribution using 50 grid points. The upper graph shows

the distribution of the linear grid points, while the lower graph shows

the distribution of the nonlinear grid points. . . . . . . . . . . . . . . 87

List of Figures xxi

3.47 The output and differences of outputs between the original signal and

the reconstructed one using a RBFN with 50 linear grid points. . . . 88

3.48 The output and differences of the original signal and the reconstructed

one using a RBFN with a nonlinear grid regime using 50 grid points. 89

3.49 The output and differences in the original signal and the reconstructed

one using a RBFN with a mapping from the nonlinear grid to a linear

grid regime with 50 grid points. . . . . . . . . . . . . . . . . . . . . . 89

3.50 Grid point distribution using 100 grid points for the sunspot cycle

time series. The upper graph shows the linear grid distribution, while

the lower graph shows the distribution of grid points using a nonlinear

grid regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.51 The actual output and the differences in the original signal and the

output reconstructed using a RBFN with 100 linear grid points. . . . 91


reconstructed output using a RBFN with 100 grid points with a non-

linear grid regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91


reconstructed output using a RBFN with a mapping of the nonlinear

grid to a linear grid regime with 100 grid points. . . . . . . . . . . . . 92

3.54 The RMS error values as function of the number of grid points per

dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

List of Figures xxii

3.55 Grid point distributions of the iris data set. The upper graphs in

each sub-graph show the distribution of the linear grid points, while

the lower graphs show the nonlinear grid points. . . . . . . . . . . . . 96

3.56 basis functions cover in two dimension . . . . . . . . . . . . . . . . . 97

3.57 Number of neurons used in a RBFN when the input dimension is four. 98

4.1 EANFIS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.2 An example illustrating the determination of the maximum itemset

in the Apriori algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.3 An example to illustrate the proposed rule formation algorithm. . . . 123

4.4 Illustration of the distribution of grid points in the d-th dimension of

the input xd. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

4.5 The pseudo code implementation of the proposed algorithm for the

formation of clusters. The small diagram on the top right hand corner

illustrates when a new cluster is formed, and when grid points are

merged together to form a cluster. . . . . . . . . . . . . . . . . . . . . 130

4.6 Example of finding the membership function using the proposed self

organising mountain clustering membership function method. . . . . . 134

4.7 A diagram to illustrate the training of the EANFIS for the case of

discrete output classes. The notation ¬ denotes the negative of the

output class τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

4.8 The resulting data clusters for the exclusive OR problem after the ap-

plication of the mountain clustering membership function; ‘*’ denotes

the first cluster, ‘o’ denotes the second cluster. . . . . . . . . . . . . . 143

List of Figures xxiii

4.9 The Gaussian membership functions for the ANFIS architecture for

the exclusive OR problem. There are two membership functions per

dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

4.10 The architecture of the XOR probelm as found by the proposed EAN-

FIS architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

4.11 The monthly average Sun spot number. The training data (55 years)

are shown in dark, while the testing data (200 years) are shown in

lighter colour. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

4.12 The self organising mountain clustering membership functions of sunspot

cycle time series ‘*’ denotes the first cluster, ‘o’ denotes the second

cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

4.13 The combination of the fuzzy rules for the EANFIS architecture. . . . 151

4.14 The prediction results of the monthly average sunspot number time

series using an EANFIS architecture with linear grid regime using the

self organising mountain clustering membership functions. . . . . . . 153

4.15 The prediction results of the monthly average sunspot number time

series using an ANFIS architecture with Gaussian membership func-

tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

4.16 The architecture found by using the proposed EANFIS architecture. . 154

4.17 The output of the Mackey-Glass equation. . . . . . . . . . . . . . . . 159

4.18 The Gaussian membership function for the Mackey-Glass equation

example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

List of Figures xxiv

4.19 The outputs of a neuro-fuzzy network using the ANFIS architecture

with 16 Gaussian membership functions for the Mackey-Glass equation.161

4.20 The outputs of the EANFIS architecture with 12 Gaussian member-

ship functions for the Mackey-Glass equation. . . . . . . . . . . . . . 162

4.21 The difference between the output of the ANFIS architecture with

16 Gaussian membership functions and the original signal for the

Mackey-Glass equation. . . . . . . . . . . . . . . . . . . . . . . . . . . 162

4.22 The difference between the output of the EANFIS architecture with

12 Gaussian membership functions and the original signal for the

Mackey-Glass equation. . . . . . . . . . . . . . . . . . . . . . . . . . . 163

4.23 The architecture found by using the proposed EANFIS architecture. . 164

4.24 The membership function (mountain clustering) for the Iris data set. 169

4.25 The architecture found using the proposed EANFIS architecture. . . 171

4.26 The self organising mountain clustering membership functions for the

Wisconsin breast cancer example. Solid line denotes “benign”, and

dashed line denotes “malignancy”. . . . . . . . . . . . . . . . . . . . . 176

4.27 The architecture found using the proposed EANFIS architecture. . . 178

4.28 Inverted pendulum on a cart. . . . . . . . . . . . . . . . . . . . . . . 179

4.29 Block diagram of the Inverted pendulum on a cart control system. . . 180

4.30 Control force of the training system . . . . . . . . . . . . . . . . . . . 181

4.31 Input status of the Inverted pendulum on a cart control system . . . 182

4.32 Control force of the EANFIS system . . . . . . . . . . . . . . . . . . 183

List of Figures xxv

4.33 Input status of the Inverted pendulum on a cart control system using

EANFIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

4.34 Control force of the ANFIS system . . . . . . . . . . . . . . . . . . . 185

4.35 Input status of the Inverted pendulum on a cart control system using

ANFIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

4.36 The architecture found using the proposed EANFIS archit4ecture. . . 187

5.1 A block diagram to show the preprocessing method of combining the

EANFIS architecture and global method. . . . . . . . . . . . . . . . 191

5.2 A block diagram to show the preprocessing method of combining

ANFIS architecture and a global method. . . . . . . . . . . . . . . . . 191

5.3 A block diagram showing the parallel connection of the global mod-

ule with the membership function module in an extended EANFIS

architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

5.4 A block diagram showing the series-parallel connection of the global

module and the series connection of the membership function module

and the competitive and normalisation layers in the EANFIS archi-

tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

5.5 The raw data of the first and second dimensions of the iris data set. . 197

5.6 The iris flower dataset projected down to one transformed dimension

using the PCA algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 198

5.7 The iris flower dataset projected from three dimensions to two trans-

formed dimensions using the PCA algorithm. . . . . . . . . . . . . . . 199

List of Figures xxvi

5.8 The iris flower data set projected down onto one transformed dimen-

sion using the class-dependent LDA method. . . . . . . . . . . . . . . 205

5.9 The iris flower data set projected down to two transformed dimensions

with a class-dependent LDA method. . . . . . . . . . . . . . . . . . . 205

5.10 Iris flower data set projects down to one transformed dimension with

class-independent LDA . . . . . . . . . . . . . . . . . . . . . . . . . . 206

5.11 Iris flower data set projects down to two transformed dimension with

class-independent LDA . . . . . . . . . . . . . . . . . . . . . . . . . . 207

5.12 The extended adaptive neuro-fuzzy inference system with the LDA

method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

5.13 Monthly Average Sunspot number time series in which the first 55

year data is used for training, and the rest of 200 year data is used

for testing. The training data is shown in continuous line, while the

testing data is shown in dotted line. . . . . . . . . . . . . . . . . . . . 215

5.14 The Sunspot number time series data set using a class dependent

LDA technique to project it to one transformed dimension. . . . . . . 217

5.15 The Sunspot number time series data set using a class independent

LDA technique to project it to one transformed dimension. . . . . . . 217

5.16 The prediction outputs of the class dependent LDA method combined

with the EANFIS architecture. . . . . . . . . . . . . . . . . . . . . . 218

5.17 The prediction output errors of the class dependent LDA method

combined with the EANFIS architecture. . . . . . . . . . . . . . . . . 219

List of Figures xxvii

5.18 The membership function formed using the self-organizing mountain

clustering membership function method. ’.’ denotes the iris-setosa,

’o’ denotes the iris-versicolor and ’+’ denotes the iris-virginica. . . . . 220

5.19 The breast cancer dataset projected down onto one transformed di-

mension with the class dependent LDA method. . . . . . . . . . . . . 223

5.20 The breast cancer dataset projected down to one transformed dimen-

sion with the class independent LDA method. . . . . . . . . . . . . . 224

B.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

B.2 Class-dependent LDA transformation . . . . . . . . . . . . . . . . . . 255

B.3 Class-independent LDA transformation . . . . . . . . . . . . . . . . . 255

List of Tables

3.1 Output results comparisons of the van der Pol equation example. . . 69

3.2 The comparison of the root mean square values between using 25 grid

points and 100 grid points for the currency exchange time series. . . . 72

3.3 Output results comparisons. . . . . . . . . . . . . . . . . . . . . . . . 92

4.1 The input output pairs of the exclusive OR problem. . . . . . . . . . 142

4.2 The fuzzy rules found for the exclusive OR problem using our pro-

posed method for rule formation. . . . . . . . . . . . . . . . . . . . . 144

4.3 The results of the XOR problem by comparing three methods: EAN-

FIS architecture with mountain clustering membership function, EAN-

FIS architecture with Gaussian membership functions, and ANFIS

architecture with Gaussian membership functions. . . . . . . . . . . . 146

4.4 The fuzzy rules found for the sunspot cycle time series. . . . . . . . . 152

4.5 The RMS errors of applying various methods on the sunspot number

time series. Please see the text for explanation of the experimental

conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

xxviii

List of Tables xxix

4.6 The RMS errors of the Mackey-Glass compares with ANFIS, EANFIS

with the self organising mountain clustering membership function and

EANFIS with Gaussian membership function. . . . . . . . . . . . . . 158

4.7 The outcomes of applying the EANFIS architecture and the ANFIS

architecture on the Iris data set. The values reported in this table are

obtained from an average of 100 experiments using randomly selected

99 training data samples and 51 testing data samples. . . . . . . . . 167

4.8 The extracted fuzzy rules for the Iris Dataset. These rules are used

in the EANFIS architecture. . . . . . . . . . . . . . . . . . . . . . . . 170

4.9 The comparison of the prediction capabilities of the ANFIS archi-

tecture with membership functions, the EANFIS architecture, with

linear and nonlinear grid regimes using single weight output layer. . . 173

4.10 The comparison of the prediction capabilities of the ANFIS archi-

tecture with membership functions, the EANFIS architecture, with

linear and nonlinear grid regimes using TSK output layer. . . . . . . 174

4.11 The extracted fuzzy rules for the Wisconsin Breast Cancer. These

rules are used in the EANFIS architecture. . . . . . . . . . . . . . . . 177

4.12 The table shows the RMS error compare with ANFIS and different

EANFIS architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

5.1 Using the LDA methods as pre-processing methods in combination

with the ANFIS or EANFIS architectures. . . . . . . . . . . . . . . . 210

5.2 Prediction RMS Errors for the Sunspot number time series. . . . . . . 216

5.3 Prediction classification accuracy comparison on the Iris data set. . . 222

List of Tables xxx

5.4 The output accuracy comparison of the Wisconsin breast cancer data

set using various architectures. . . . . . . . . . . . . . . . . . . . . . . 225

5.5 The output accuracy comparison on the Wisconsin breast cancer data

set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

B.1 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

Chapter 1

Introduction

1.1 Introduction

The dream of a machine which can mimic human thinking has always been a research

topic in computer science. With the advent of more and more powerful computers,

this dream seems to be forever elusive. In the 1970s there were much research

carried out by researchers in what is commonly called “expert systems” as part of

the overall idea of having a computing machine which can mimic human thoughts,

and human actions. An expert system normally has two components: a knowledge

base, and an inference engine. The knowledge base stores relevant facts about

the system under study, in a machine understandable manner, while the inference

engine is a machine which is capable of reasoning based on the knowledge stored

in the knowledge base, and in response to queries by the user. Indeed there were a

1

1.1. Introduction 2

number of widely publicised expert systems, among them, one called MYCIN which

was capable of medical diagnosis and another one called PROSPECTOR, which is

capable of prospecting minerals [2]. These systems are alternatively called expert

systems, or rule based systems.

However, it soon transpired that while the expert systems, like MYCIN, and

PROSPECTOR worked well in their respective domains, it was quite difficult to

use the methodology on new domains. It also soon transpired that to implement

one of these expert systems, it would require a knowledge engineer (an expert in

acquiring the necessary knowledge, and able to “translate” them into compatible

format with expert systems) to solicit the knowledge from domain experts. The

domain experts are people who have the knowledge of the domains of interest, but

who may not be well versed in the expert system methodology. This knowledge

solicitation effort is often called knowledge acquisition. It was discovered that the

knowledge acquisition process is far from trivial and required extensive collaboration

of both the domain expert and the knowledge engineer. It also soon discovered that

expert systems are relatively “brittle”. In other words, these systems can provide

inference from user queries. However, if the query is “vague” or if the knowledge

stored in the knowledge base is “vague” then the expert systems seem to behave

in a relatively erratic manner. Here “noise” may mean imprecise knowledge in the

knowledge encoding process, or that the queries differ slightly from the knowledge

which was encoded in the knowledge base.

On the other hand, there was another innovation which can be dated even earlier

than expert systems, or rule based systems. This is the study of artificial neural

1.1. Introduction 3

networks. Artificial neural networks were first proposed by McCulloch and Pitt [4]

in the 1940s, based on their observation that the human brain consists of many

interconnected neurons. These neurons have seemingly simple mechanisms, and yet

the interconnection of these seemingly simple neurons constitutes what we know

as “human intelligence”. Hence, if we can model these neurons sufficiently then

by interconnecting these neurons together, the system will be able to infer from

given facts. Indeed Rosenblatt [5] studied such a concept in the 1950s, and provided

impressive examples to show that such a system is capable of mimicking human

thoughts. However, Minsky and Papert in their famous book called Perceptrons [3]

in the late 1960s showed that such a set of simple connected artificial neurons cannot

even solve the simple exclusive OR problem. This dampened the enthusiasm in this

area for considerable time.

In the meantime, Zadeh in the 1960s proposed a concept called fuzzy set [17]. He

argued that humans do not reason using numerical values, but instead, we reason

using “categories” which are not based on numerical values. For example, we may

say “today is hot”. The “hotness” depends on the person’s perception. Thus, to

someone living in the tropic, “hotness” may mean 40 degrees C, while to someone

living in more temperate zones, “hotness” may mean 20 degree C. Thus when we

reason using such a concept, like, “if it is hot and if it rains heavily, then I do not

play golf”. Here, the concept of “hotness” and raininess” are both “fuzzy”. In

other words, different people may have different conceptualisation of these qualities.

Thus, the main concept of a “fuzzy” set is that it is capable of providing information

between the continuum of 0 and 1, or between “black” and “white”, etc.

1.1. Introduction 4

This concept of “fuzzy” set soon found its way to rule based systems, called

“Fuzzy Rule Based Systems” [19]. The concept was greatly promoted when the

Japanese companies revealed that some of their most advanced railway systems

were controlled by using “fuzzy control” systems [18]. This generated much enthu-

siasm among researchers exploring the concept of fuzzy systems. However, it soon

discovered that for a fuzzy control system to work properly one needs to find an

appropriate fuzzy membership function [12]. The effort of designing an appropriate

fuzzy membership function is comparable to the effort of knowledge acquisition, as

in both cases, someone will need to discuss the concepts with the domain experts,

try to solicit the knowledge and then to represent such knowledge either in fuzzy

membership functions, or in the knowledge base.

In the meantime there was a revival of studies in artificial neural networks

through the seminal work of Rummelhart, McClelland, and Hinton [13] in 1986

in providing what they called a backpropagation training algorithm in the training

of a multilayer perceptron. This opened up the new era of artificial neural network

studies, as it was shown that such multilayer perceptrons can solve nonlinear prob-

lems, can learn from a set of training examples and capable of generalising them to

unseen examples.

Thus, it is observed from this brief description that there were two main streams

of thought in mimicking human reasoning by machines:

• Rule based systems. These are systems based on a set of rules. Often the rules

take on the following form: “If expression 1 is true, and expression 2 is true,

1.1. Introduction 5

and .. expression N is true, then consequence”. These rules are found by hu-

mans, and may be based on human experience mined by knowledge engineers

in the process of knowledge acquisition. The rules expressed our understand-

ing of the way that the system works. The rules may be expressed using fuzzy

membership functions in what is called fuzzy rule based systems, or expressed

using Bayesian reasoning. Bayesian reasoning assigns a conditional proba-

bility for each evidence and hypothesizes reasoning using Bayes’ theorem in

what is commonly called Bayesian rule based system. It will require significant

amount of work to elucidate the rules. However once the rules are obtained it

provides a way in which the human operator can understand the operations of

the system, and may be able to provide more rules as time progresses as more

and more operating experience is accumulated.

• Artificial neural networks. These are systems which intend to mimic human

reasoning by providing a set of interconnected artificial neurons. These arti-

ficial neurons are relatively simple devices; the main inference power of the

artificial neural networks comes from the way in which the artificial neurons

are interconnected. Once the architecture (the way the neurons are inter-

connected) is decided, then the weights (synaptic weights) can be learned by

presenting to the artificial neural network a set of examples. Once the weights

are trained then the artificial neural network can be used to generalise (infer)

to unseen examples. The issue here is that humans cannot understand the

way in which such a system works, as it is quite difficult to extract human-

understandable rules from such a trained network. However, such networks

have been shown to have good potentials. For example, one particular config-

1.1. Introduction 6

uration of such neurons is called a multilayer perceptron [13]. The multilayer

perceptron has been shown to be universal approximators, i.e. it is capable of

approximating to any arbitrary close manner to any given nonlinear function,

provided that there are sufficient number of hidden layer neurons [14].

Thus it is observed that there is considerable tension between the study using

artificial neural networks and rule based systems in their basic premises, and their

applications. Both methods have considerable following, and both methods claim

to provide good mimicking of human reasoning capabilities.

An innovative idea was engendered in the 1990s, and the researchers asked the

obvious question: is it possible to combine the best capabilities of the rule based sys-

tems, and artificial neural networks to provide machines with human-like reasoning

capabilities. In other words, it would be ideal if we have a reasoning system which

allows us to train the parameters based on a set of training examples, and be able to

use rule based type of reasoning to infer on user queries, so that the human operator

may understand the operation of the system. In other words, we wish to have a

system which will take away the difficulties related to the crafting or acquiring the

knowledge from experts, based on a set of training examples, but we also wish to

understand the actions of the system in a transparent manner so that the human

operator can understand. One such a system is called the neuro-fuzzy system.

1.2. Neuro-fuzzy systems 7

1.2 Neuro-fuzzy systems

As indicated in the previous section, neuro-fuzzy systems attempt to combine the

best features of an artificial neural network and a fuzzy rule based system. Before

we can understand the revolutionary way in which the neuro-fuzzy system proposes,

we will need to provide sufficient details 1 about a fuzzy inference system.

A fuzzy inference system is an inference system which is based on fuzzy concepts

in representing its underlying variables, and it will make its inference based on such

fuzzy variables. Thus, a fuzzy inference system consists of the following components:

• A fuzzy membership section. In this component, the knowledge or experience

is represented by a set of fuzzy membership functions. Such membership func-

tions may take a number of typical forms, e.g., triangular function, trapezoidal

function, or Gaussian function. Each of these candidate membership functions

will have a number of parameters. These parameters can be determined by an

expert.

• Rules section. In this component the outputs from the membership functions

are combined in various ways to form rules. There are usually no parameters

associated with the rule section. There are however different ways in which

the outputs of the fuzzy membership section can be combined.

1The concept of a neuro-fuzzy system will be discussed in more detail in the next chapter. In

this chapter, sufficient details will be provided so as to allow us to discuss the concepts, and to

understand the background to this thesis.


• Inference engine. Inferences may be made based on the rules. The simplest

inference mechanism will be: the output is a linear combination of the outputs

of the rule section. There are parameters associated with the inference engine.

The way in which the features of a fuzzy rule based system and artificial neural

network are exploited can be summarised as follows:

• Since the artificial neural networks can learn from a set of trained examples,

one way in which such capabilities can be deployed is to use some kind of

learning algorithms to learn the output weights. Note that often the fuzzy

membership parameters are not learned using the set of training examples,

as the relationships between the outputs and the fuzzy membership function

parameters are highly nonlinear. On the other hand, if a linear combination

is used for the inference engine section, then there is a very simple linear

relationship between the output and the linear combination weights.

• The rule section. This allows the combination of the outputs of the fuzzy

membership functions to be combined. There are no trainable weights in the

rule section, except the ways in which the outputs of the fuzzy membership

function are combined.

Thus provided that the parameters of the fuzzy membership functions can be

found somehow, such a neuro-fuzzy system combines the best features of both the

artificial neural networks, and a fuzzy rule based system. In general, it was found

that the neuro-fuzzy system is not too sensitive to the shape of the fuzzy membership

functions, or their parameters, as long as there is a sufficient number of them.


A particular neuro-fuzzy system is called an adaptive neuro-fuzzy inference sys-

tem (ANFIS) which has been quite popular among practitioners. [68, 69] Indeed,

it was shown that if the fuzzy membership function is Gaussian function, and the

inference mechanism is the simple linear combination of the outputs of the rule sec-

tion, then ANFIS is simular to a multilayer perceptron with a single hidden layer,

and the hidden layer neuron activation function is Gaussian [70]. Obviously if the

ANFIS uses a different inference mechanism, e.g. the Takagi-Sugeuno-Kang (TSK)

mechanism (details concerning this method is shown in the next chapter) then the

ANFIS is not equivalent to a multilayer perceptron with Gaussian functions as the

hidden layer neuron activation functions.

However, there are a number of issues which the ANFIS faces:

• The fuzzy membership function. As indicated previously ANFIS is relatively

insensitive to the shape and the number of fuzzy membership functions used.

However, intellectually this is unsatisfactory in that we somehow do not make

use of all the information provided in the problem. Thus, the question is:

would it be possible to model the input variables somehow, and what effect

would such modelling have on the performance of the ANFIS.

• The number of rules. As there is no a priori information provided to ANFIS,

hence in the ANFIS, all possible combination of rules need to be formed in

the rule section. Now this creates some issues if the number of inputs or the

number of fuzzy membership functions is high. For example, if the number

of inputs is 10, and say there are 3 fuzzy membership functions per input

variables, then the total number of rules need to be formed in the rule section

1.3. Objectives 10

is 310, which is a large number. However, without any a priori information on

the way the fuzzy membership function outputs are combined it is difficult to

see how the number of rules can be reduced. Thus this is a common bottleneck

in ANFIS, preventing its application to large scale problem which may include

many input variables.

• While it is not apparent, the ANFIS exploits only the local neighborhood of

the input variables. The issue is: if it there is information on the global nature

of the input variables, then would such information improve the performance

of ANFIS.

1.3 Objectives

The objectives of this thesis are

• to design a novel neuro-fuzzy system that will have the following capabilities:

– Automatic determiniation of the structure of the neuro-fuzzy system in

terms of the number of rules required for a particular problem

– Automatic determination of the shape of the membership functions

– Capable of taking into account both local information and global infor-

mation in the structure of the input variables.

1.4. The contribution of this thesis 11

1.4 The contribution of this thesis

There are a number of contributions of this thesis.

• The thesis proposes an extension of the ANFIS in what we called an extended

ANFIS (EANFIS) architecture. This architecture is based on the observation

that it is possible to improve the performance of the ANFIS in cases when the

outputs are discrete variables by incorporating some kind of logistic functions

2, in a manner inspired by the incorporation of a logistic operation in artificial

neural networks [15]. From this insight we propose an EANFIS architecture

which can be used for both discrete output variables or continuous output

variables.

• A way to reduce the number of rule formation in the EANFIS a priori. In other

words, we propose a novel way of determining the appropriate architecture

from the input structure of the problem. In this architecture, only the rules

which will be used for inference will be formed. The rules which will not be used

in the inference process will not be formed. The proposed method is inspired by

the Apriori algorithm [41] in associative rule data mining techniques, though it

is not exactly the same as what is used in the popular Apriori algorithm. Thus

we find a way to produce an architecture which is appropriate for a particular

problem at hand based on an analysis of the input structure of the problem.

This is significant in that this will allow our proposed architecture to be used

for practical problems of may be high input dimensions as the limiting factor

2The sigmoid function is an example of the logistic function [10].

1.4. The contribution of this thesis 12

is no longer the large number of rules which must be formed, but only the set

of rules which is necessary to the inference process will be formed.

• A way to use the input variables to form possibly non-symmetric fuzzy mem-

bership functions. In other words, instead of pre-supposing the shape of the

fuzzy membership functions, in our proposed method, the shape of the fuzzy

membership functions is determined from the input variables. The parameters

of the fuzzy membership function are determined in the process as well. Thus,

using our proposed method, the user does not need to find the parameters of

the fuzzy membership functions somehow.

• By further analysing the input structure, it is found that if the global structure

of the input variables is known, then such information can lead to improved

performance of the EANFIS. Obviously there are situations in which such

global information may not help, as the problem may only be dependent on

the local structure of the input variables. However, where such information is

useful, it is found that the combination of local and global structures of the

input can lead to better performance of the EANFIS.

• A minor contribution of the thesis is that we find a way in which a nonlinear

grid decomposition can be obtained from input variables. This method allows

us to find an appropriate nonlinear grid structure for a set of inputs. This can

be applied to the radial basis function neural network in finding the centres

and spreads of the radial basis functions.

1.5. Structure of the thesis 13

1.5 Structure of the thesis

The structure of the thesis will be as follows:

• In Chapter 2, we will provide more details about the ANFIS architecture as

a background to the thesis. In addition, we will show how the parameters

of the inference section of the ANFIS architecture can be determined using a

backpropagation type of learning algorithm.

• In Chapter 3, we will give a description of the nonlinear grid determination

method. This section can be standalone.

• In Chapter 4, we will describe the proposed EANFIS architecture. We will

further describe our proposed method to customise the EANFIS architecture

for particular problem based on the input variable structure as provided in the

set of training examples. We will further describe the ways in which the fuzzy

membership function can be determined from the set of inputs, and how the

associated parameters may be determined.

• In Chapter 5, we will describe a method for combining global and local infor-

mation concerning input variables in the EANFIS architecture. It is shown

that where appropriate such combination can improve the performance of the

EANFIS architecture.

• Chapter 6 will draw some conclusions from the work presented in this thesis,

and will provide some directions for future research.

Chapter 2

Neuro-Fuzzy Systems

In this chapter we will give some background descriptions on the neuro-fuzzy sys-

tem. We will first describe a fuzzy inference system, followed by a description of a

particular adaptive neuro-fuzzy system, viz., adaptive neuro-fuzzy inference system

(ANFIS). Note that we make a difference in this thesis between neuro-fuzzy sys-

tem and fuzzy system. A fuzzy system is one in which the membership function is

pre-assigned and that the consequent part is also pre-assigned, while a neuro-fuzzy

system is one in which the membership function is unknown which needs to be

determined and the parameters of the consequent part need to be determined.

14

2.1. Background on Fuzzy concepts 15

2.1 Background on Fuzzy concepts

Instead of giving a detailed background on fuzzy concepts we will direct the readers

to the following books on fuzzy systems [7]. We will provide only sufficient materials

in this chapter to understand the concepts as required in this thesis.

2.2 Fuzzy rules

One of the main concepts in a fuzzy system is a rule which expresses the relationship

of entities in fuzzy terms, viz., a fuzzy rule. A fuzzy rule can be expressed as follows:

IF x is A

THEN y is B(2.1)

where x and y are linguistic variables, and A and B are linguistic values determined

by the fuzzy sets on the universe of discourse X and Y respectively. The IF statement

is sometimes referred to as the “premise” while the THEN statement is sometimes

referred to as the “consequence”.

Fuzzy IF-THEN rule uses natural language to express the premise and the con-

sequence. The premise may have more than one conditional expression joined by

logical operator “logical OP”: “AND” or “OR”.

2.3. Fuzzy inference System 16

Ri : IF x1 is μi1 < logical OP > . . . < logical OP > xd is μi

d THEN

y is μi

where μid represents membership function of rule i in d dimension

2.2.1 Reasoning with fuzzy rules

Reasoning with fuzzy rules typically involves two parts:

• Evaluation of the rule antecedent (the IF part of the rule)

• Applying the result of the evaluation of the antecedent to the consequent (the

THEN part of the rule).

In a fuzzy rule reasoning situation, when the antecedent is true, then all rules fire

to some extent (dependent on the degree of incursion of the membership function

into the others).

2.3 Fuzzy inference System

A Fuzzy Inference System (FIS) mimics a human reasoning process by implementing

fuzzy sets and approximate reasoning mechanism which use numerical values instead

of logical values.


A Fuzzy Inference System (FIS) consists of three conceptual components:

• a rule base containing a set of fuzzy rules,

• a database which defines the membership functions used in the fuzzy rules,

and,

• a reasoning mechanism which performs the inference procedures.

Figure 2.1 gives a block diagram representation of a FIS [12].

Figure 2.1: A block diagram of a Fuzzy Inference System [12]

Fuzzy inference may be defined as the process of mapping from a given input

to an output using the theory of fuzzy sets. Fuzzy reasoning is also known as

approximate reasoning in that the process is to draw a conclusion provided that the

fuzzy implication A → B is true. Fuzzy reasoning includes four steps: [12]

• Fuzzification: find the degree of similarity between the input function and

the membership function


• Inference: Combine the degree of similarities from different membership func-

tions using fuzzy AND or OR operators to form a firing strength of the rule

• Aggregation: Apply the firing strengths to the consequent membership func-

tion to generate a qualified consequent membership function

• Defuzzification: Aggregate all the qualified consequent membership func-

tions to produce a crisp output.

There are three types of popular fuzzy models used in various applications. The

main difference among these models is in the aggregation and defuzzification stages.

Mamdami Fuzzy Model

The Mamdani fuzzy inference system was proposed in [19, 20]. The original design

uses a min-max composition as shown in Figure 2.2. One possible variation is

shown in Figure 2.3. The differences between the original Mamdami model and

the alternative one is: the original Mamdani method uses a min (T-norm) operator

for the implication operation (AND) and uses a max (T-conorm) operator for the

aggregation operation (OR) while the alternative one uses an algebraic product

(μaμb) for the T-norm operation and (μa + μb − μaμb) for the T-conorm operation.

Figures 2.2 and 2.3 respectively show two crisp values (x, y), the input variables

to two fuzzy rules and obtain the degree of similarity as a result. The rule used is:

(Ri: IF x is μi1 AND y is μi

2 THEN z is μic). The inference process combines the firing

strengths of the rules. The aggregator generates a qualified consequent membership


Figure 2.2: Mamdani fuzzy inference system using min and max operators [12].

Figure 2.3: An alternative Mamdani fuzzy inference system using product and maxoperators [12].


function by using these firing strengths. The defuzzification process aggregates all

qualified consequent membership functions with a max operation and extracts a

crisp output value from the fuzzy set. Generally, there are seven commonly used

methods to extract a crisp output value from a fuzzy set [30]. The most commonly

used method is the Centroid of Area (COA) technique as shown in Equation 2.2;

this calculates the expected value of probability distributions [12].

ZCOA =

∫zμC′ (z) zdz∫

zμC

′ (z) dz(2.2)

where μC′ is the membership function.

Tsukamoto Fuzzy Model

In the Tsukamoto Fuzzy Model (Figure 2.4), the consequent of each fuzzy rule is

a monotonic membership function. The output of each fuzzy rule is a crisp value

induced by a firing strength. The overall output is the weighted average of each rule’s

output. Because of the output of each fuzzy rule is a crisp value, the monotonic

membership function can avoid a time consuming defuzzification process [12].

Sugeno (TSK) Fuzzy Model

In the Sugeno Fuzzy Model, alternatively referred to as the Takagi-Sugeno-Kang

(TSK) model, the consequence membership function is replaced by a polynomial

equation as shown in Figure 2.5 [12]. The output of each rule is a linear combination


Figure 2.4: Tsukamoto fuzzy inference system [12].

of input variables plus a bias as shown in Equation 2.3. The weights αri, i =

1, 2, . . . , d for a particular value of r, r is the rule number, can be adjusted by a

steepest descent algorithm [32] or other similar methods. The overall output is the

weighted average of each rule’s output.

zr = αr0 + αr1x1 + · · · + αrdxd (2.3)

where α is the weight and αr0 is the bias of rule r.

The TSK model refers to the possibility of a polynomial inference mechanism.

In the TSK model [21], the linguistic value in the consequence is replaced by a

unit with inputs taken from the premise. Usually, the unit performs a first order

polynomial operation in the inputs:


Figure 2.5: Sugeno fuzzy inference system [12].

Ri : IF x1 is μi1 < logical OP > . . . < logical OP > xd is μi

d THEN

yi = fi(x1, . . . , xd)

where fi is a first order polynomial.

Dependent on the reasoning method, whether it is a Mamdami, a Tsukamoto,

or a TSK model, there are a number of parameters which need to be determined.

These parameters can be determined by either experts or through a learning process

from the training data set. If they are determined by experts then this is known as

a fuzzy system. On the other hand if they are determined using a learning process,

then it is known as a neuro-fuzzy system.

2.4. Neuro-Fuzzy Inference System 23

2.4 Neuro-Fuzzy Inference System

The neuro-fuzzy inference system is in between a fuzzy inference system and adaptive

neuro-fuzzy inference system. It consists of five feed-forward layer [23].

x1

x2

xd

y

wr

w1

w4

w3

w2

Layer 1 Layer 2 Layer 3 Layer 4

11ϕ

12ϕ

21ϕ

22ϕ

dcϕ

1φ

2φ

3φ

4φ

1φ

2φ

3φ

4φ

MIN MAX

Layer 5

Figure 2.6: Architecture of a Neuro Fuzzy System.

• Layer 1 (Input Layer): In this layer the input x is input into the corre-

sponding membership function in the next layer.

• Layer 2 (Matching): In this layer each node is a membership function. It

calculates the similarity between the input and the membership function. The

membership function can be a logistic function as shown below.

ϕd,c (x) =2

1 + exp(−a (x − c)2) (2.4)

where c is the function center.

2.5. Adaptive Neuro Fuzzy Inference System (ANFIS) 24

• Layer 3 (MIN): This layer performs a fuzzy AND operation.

φr = min(ϕ1,c (x) , ϕ2,c (x) , . . . , ϕd,c (x)) (2.5)

• Layer 4 (MAX): This layer performs a fuzzy OR operation to integrate the

rules which have the same consequence.

φt = max(φ1, φ2, . . . , φt) (2.6)

• Layer 5 (Defuzzifaction): This layer aggregates the output from different

rules.

y =

∑t

φtwtμt

(φt

)∑t

φt

(2.7)

where w and μ (φ) are the weights and output membership function respec-

tively.

The weights and the parameters in the membership function can be trained

by a back-propagation type of algorithm.

2.5 Adaptive Neuro Fuzzy Inference System (AN-

FIS)

A neuro-fuzzy system is functionally equivalent to a fuzzy inference system (FIS). A

FIS requires a domain expert to define the membership functions and to determine


the associated parameters both in the membership functions, and the reasoning sec-

tion. However, there is no standard for the knowledge acquisition process and thus

the results may be different if a different knowledge engineer is at work in acquir-

ing the knowledge from experts. A neuro-fuzzy system can replace the knowledge

acquisition process by humans using a training process with a set of input-output

training data set. Thus instead of dependent on human experts the neuro-fuzzy sys-

tem will determine the parameters associated with the neuro-fuzzy system through

a training process, by minimising an error criterion. A popular neuro-fuzzy system

is called an adaptive neuro-fuzzy inference system (ANFIS) [22]. It consists of five

feed-forward layers as shown in Figure 2.7. The ANFIS is functionally equivalent

to Sugeno (TSK) Fuzzy Model presented in Section 2.3. It can also express its

knowledge in the IF-THEN rule format as follow: [12]

Rule1 : IF x1 is μ11 and x2 is μ21 then y1 = α10 + α11x1 + α12x2



Rule4 : IF x1 is μ12 and x2 is μ22 then y4 = α40 + α41xi1 + α42x2

2.5.1 A Feed-Forward Network

In this section, the basics of the ANFIS architecture are briefly described. This

will form as background to the proposed extended ANFIS architecture. An ANFIS

architecture in general consists of three sections:


x1

x2

xd

y

wr

w1

w4

w3

w2

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

11ϕ

12ϕ

21ϕ

22ϕ

dcϕ

1φ

2φ

3φ

4φ

rφrφ

1φ

2φ

3φ

4φ

Figure 2.7: Architecture of ANFIS

• The input section – in this section, the input variables are modeled by using

fuzzy membership functions. There are many possible candidate fuzzy mem-

bership functions, e.g. triangular, trapezoidal, Gaussian functions. We will

call this collection of possible membership functions as the set of candidate

membership functions. A membership function is chosen by the user from this

set of candidate membership functions.

• The rules section – in this section, rules are formed from the set of membership

functions. Since there is no a priori reason to exclude any possible combina-

tions of rules from a combination of membership functions hence all possible

combinations of the membership functions are formed.

• The output section – in this section, the outputs are formed by a combination

of the outputs of the rules. There are a number of possibilities. However, in


practice there are two popular choices:

(1) a zeroth order output function (in other words, the output is a linear

combination of the inputs), and

(2) the Takaga-Sugeno-Kang (TSK) output function.

The TSK mechanism allows a direct influence by the input on the output,

while the zeroth order output function consists of a linear combination of the

internal signals of the ANFIS architecture as the output.

We will delay indication of the architecture details until Chapter 4 when we will

give description of each section. For the other variety of neuro-fuzzy system, please

refer to [73].

The ANFIS architecture while popular is known to have a number of shortcom-

ings. These include: the lack of scalability to a large number of input variables,

the almost ad hoc manner in which a membership function is chosen, the number

of membership functions required for each input variable. Fortunately there is some

empirical evidence that the prediction of the output is largely insensitive to the

choice of membership function type. There is also some empirical evidence that

as long as a sufficient number of membership functions is chosen for each input

variable, the architecture appears to be able to handle the modeling of the input

variables. However, there is no remedy for the number of rules needed for ANFIS.

Indeed as the ANFIS architecture does not make any a priori assumption on the

rule structure, and hence it is required to form all possible combinations of the input


variables. Hence, if the number of input variables is high, then it is quite difficult

to implement the architecture in practice.

In Chapter 4, we will provide an extension of the ANFIS architecture, which

provides alleviation of these deficiencies. We claim that the EANFIS architecture

will overcome some of these issues adequately that the neuro-fuzzy architecture, of

which ANFIS is a genre, can be recommended for practical applications.

2.5.2 Network Training

The adaptive neuro-fuzzy network can be trained using a steepest descent algo-

rithm as shown in Equation 2.8. The detailed derivation of the training algorithm

will be shown in Appendix A.2. Alternatively we may use a normal equation [33]

formulation if the parameter dependence is linear 1.

αnewrd = αold

rd + ηeφrxd (2.8)

where η is a learning constant, e is the error between the desired output and the

output of the system, φr is the set of internal states of the system, while xd is the

input. For further details please see Section A.2.

1There are a number of equivalent formulations in this case. It can be solved in a one-shot

fashion in solving the normal equation. Alternatively this can be solved in a recursive fashion

using a least square recursive training technique.


2.5.3 Network Pruning

The membership function in ANFIS should cover all the input area. However, if the

model includes as many rules as possible it is the number of membership functions

per dimension which determines the number of rules. Because of the fact that the

input data is not usually distributed uniformly in the input space, not every rule in

the network is useful or “fired”. A pruning process can be used to shrink down the

generated rules according to their significance (in terms of firing strength). A survey

on various pruning algorithms is in provided in [35]. An orthogonal transformation

method proposed in [36] that can determine the less important nodes, i.e. those

which have less or no firing strengths, and hence can be eliminated.

An orthogonal transformation method for Network Pruning

This method is proposed by Partha and Debarag in [36]. It uses a singular value

decomposition (SVD) method [37] and a QR [38] decomposition method with column

pivoting factorization (QRcp) for transformation. The SVD mainly serves as the

null space detector and the QRcp coupled with SVD is used for subset selection [36]

which can find out the less important nodes in the network.

The SVD is given in Equation 2.9 where φ ∈ �I×R, i = 1, 2, . . . I, r = 1, 2, . . .R is

the normalized firing strength matrix coming from RBFN or ANFIS, U ∈ �I×I and

V ∈ �R×R are left and right orthogonal matrices respectively with the properties

(UTU = I,VTV = I) and Σ ∈ �I×R = diag[σ1, . . . , σR] is a pseudodiagonal matrix

where the singular values are sorted in descending order where σ1 ≥ . . . ≥ σR. The


singular values in Σ are square roots of eigenvalues from φφT or φTφ which reflects

the number of important nodes in φ.

φ = UΣVT (2.9)

The QR factorization is given in Equation 2.10. Let V ∈ �R×q constitute the

first q columns of V from Equation 2.9 where q is the first q important columns

in Σ. P ∈ �R×R is a permutation matrix, Q ∈ �q×q is an orthogonal matrix and

R ∈ �q×R is an upper triangular matrix.

VTP = QR (2.10)

The F ∈ �I×q in Equation 2.11 is the constructed normalized output firing

strength constitute of the first q important columns from an input normalized firing

matrix φ. P ∈ �R×q is the first q important columns of the permutation matrix P.

F = φP (2.11)

In the ANFIS system, the network includes as many rules as possible to cover the

input space. Although the pruning method can detect less important nodes in the

network, this approach is not recommended. When the network is fully expanded

it requires huge memory if the network is large. Another approach is by using a

data mining method that can determine the important rules first before they are

2.6. Radial Basis Function Networks (RBFN) 31

generated. A possible method to achieve this will be shown in Section 4.4.

2.6 Radial Basis Function Networks (RBFN)

RBFN neural networks play a crucial role in adaptive neuro-fuzzy systems. It incor-

porates an adaptive feature to the system which does not require a domain expert to

specify the membership functions. The membership functions used are radial basis

functions. A RBFN neural network is a linear combination of these basis functions

(see Equation 2.12).

y (x) =∑r=1

wrBr (x) + w0 (2.12)

where x ∈ �D is an input vector, Br(x), r = 1, 2, . . . , R, is a basis function of rth

neuron, wr is a constant attached to each neuron and w0 is a bias. Figure 2.8 shows

the RBFN architecture.

A typical basis function is a Gaussian function (please refer to Equation 2.13)

or a logistic function (please refer to Equation 2.14).

• Gaussian function

Br (x) = exp

(−‖x − cr‖2

2σ2r

)(2.13)

where ‖ · ‖ denotes the Euclidean norm, cr and σr are respectively the centre

and spread.


B1

B2

BR

y

w1

w2

w3

wR

w0

.

.

.

X

Figure 2.8: Architecture of RBFN

• Logistic function

Br (x) =1

1 + exp(

‖x−cr‖2

σ2r

) (2.14)

where cr, σr have the same meaning as in Eq(2.13).

Generally speaking, a basis function consists of a center cr, and a spread of the

effective area σr. These parameters will need to be determined from the inputs.

An improved version of RBFN is to use the weighted average as shown in Equa-

tion (2.15) instead of the weighted sum of each neuron’s firing strength.

y (x) =

∑r=1

wrBr (x)∑r=1

Br (x)+ w0 (2.15)

The final output is obtained by a linear combination of the normalized firing


strengths of neurons. When the overlapped area between two or more receptive

fields is large, the weighted average method can have a well-interpolated overall

output between the outputs of the overlapping receptive fields [12]. In this thesis

we will not use this version of the RBFNs and hence it will not be considered any

further.

The weights wi are usually adjusted by using either a steepest descent algorithm

[32] (as shown in Equation (2.16) or using a normal equation [33] (as shown in

Equation (2.17)).

• Steepest descent learning algorithm:

wnewr = wold

r + ηeiBr (xi) (2.16)

where η is a learning constant, xi is the ith input vector, i = 1, 2, . . . , I, ei is

an error obtained from eiΔ= δi − yi, δi is the desired value of ith output yi.

Br(xi) is a basis function of rth neuron, r = 0, 1, . . . , R. The detailed learning

algorithm is derived in Appendix A.1.

• Normal equation

w =(BBT

)−1BδT (2.17)

where w ∈ �R+1 is a weight vector, B ∈ �I×(R+1), i = 1, 2, . . . , I is a basis

function matrix which contains firing strengths from the input matrix X, X ∈�I×D, d = 1, 2, . . . , D and δ ∈ �I is the desired output value vector.


The normal equation can be solved in a recursive manner if desired.

In the application of the RBFN neural network, the center and the spread for

various neurons would need to be determined before the network training process.

Normally, the basis function should cover the entire input space and distributed

uniformly as shown in Figure 2.9. Very often for convenience, it is assumed that the

centres are distributed uniformly over the input space. Under certain circumstances,

a non-uniformly distributed center scheme may be used instead, as shown in Figure

2.10.

X

Y

1

1

0

Figure 2.9: An example of the distribution of Gaussian function centers in a uni-formly distributed fashion.

This non-uniformed distributed centre scheme method will be considered in

Chapter 3.

2.7. Nonlinear Approximation Method proposed by Schilling et al. 35

X

Y

1

1

0

Figure 2.10: An example of non-uniformly distributed scheme of Gaussian functioncenters.

2.7 Nonlinear Approximation Method proposed

by Schilling et al.

Schilling et. al. [39] considered both a zeroth order and a first order radial basis

function neural network (RBFN) as an approximation of continuous signals using a

raised cosine function. In Section 3.3, we will consider explicitly the case of a zeroth

order RBFN. In a zeroth order RBFN [39]:

yi = wT B (xi) (2.18)

where yi is a scalar output, and B(xi) ∈ Rr is a vector denoting the outputs of r

radial basis function neurons. xi ∈ �D, i = 1, 2, . . . , I is the normalized input. A

radial basis function neuron is an artificial neuron with a radial basis function as

the activation function [12]. There are a number of possible radial basis function

activation functions. A popular basis function is Gaussian function shown in Equa-


tion (2.19). The one recommended in [39] is a raised cosine function as shown in

Equation (2.20). In this thesis, we will use the Gaussian function

Br (xi) = exp

(−‖xi − cr‖2

2σ2r

)(2.19)

where c is the centre and σ is the spread. The vector w ∈ Rr denotes the constant

weights connecting the outputs of the radial basis function neurons to the output.

The weights w can be obtained using a normal equation [33] as shown in Equation

(2.17).

Br(xi) =

⎧⎪⎨⎪⎩

1+cos(π‖xi−cr‖2)2

0

‖xi − cr‖2 ≤ 1

‖xi − cr‖2 > 1(2.20)

It is assumed that there are Gd non-linear grid points obtained in Section 3.2

along each dimension of the input xi ∈ RD, i = 1, 2, . . . , I. The nonlinear grid

points along the d-th dimension is denoted by zd,1 ≤ zd,2 ≤ . . . ≤ zd,G. The mapping

function pd : [zd,1, zd,G] → [1, G] as shown in Equation (2.21). By construction, the

mapping pd maps the g-th grid of zd,g into the array index of g. The pseudo codes

implementing the algorithm is shown in Figure 2.11 in which the input X maps to

X. The xd,i is an element of X, X ∈ �D×I , d = 1, 2, . . . , D, i = 1, 2, . . . , I. The xd,i

is an element of X, X ∈ �D×I . The zd,g is an element of Z ∈ �D×G, d = 1, 2, . . . , D,

g = 1, 2, . . . , G.


pd(xd,i) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

1 +xd,i−zd,1

zd,2−zd,1

g +xd,i−zd,g

zd,g+1−zd,g

G +xd,i−zd,G−1

zd,G−zd,G−1

xi,d < zd,1

zd,g ≤ xd,i < zd,g+1

xi,d ≥ zd,G

(2.21)

X = mapping(X, Z)

for each dimension dfor each data point i

if (xd,i < zd,1)

xd,i = 1 +xd,i−zd,1

zd,2−zd,1

elseif (xd,i >= zd,G)

xd,i = G +xd,i−zd,G−1

zd,G−zd,G−1

else

xd,i = g +xd,i−zd,g

zd,g+1−zd,g

end

end

Figure 2.11: The pseudo-code implementation of Schilling et al’s mapping function[39].

Shilling et al [39] assume that there already exists a way to find the nonlinear

grid points. Hence they only provide a method for transforming the nonlinear grid

points to a set of linear grid points. After the transformation from the nonlinear

grid to the linear grid, the transformed data can be put in a linear grid supported

RBFN for training purposes. The basis function center is given by the index g

where g ∈ {1, Gd}. Figure 2.12 shows a block diagram implementing Schilling et

al’s mapping method [39]. This method allows the transformation from a nonlinear

grid to a linear grid and use it in the training of a RBFN. We will call this method

a nonlinear to linear map (N2Lmap) method.


Thus we have three possibilities:

• Use a linear grid regime. The linear grid regime will provide the centers and

spreads of the radial basis functions used in a RBFN.

• Use a nonlinear grid regime. The nonlinear grid regime will provide the centers

and spreads of the radial basis functions used in a RBFN.

• Use a mapping from the nonlinear grid to a linear grid regime using Shilling et

al’s method [39]. In this case, the transformed linear grid provides the centers

and spreads of radial basis functions used in a RBFN. This method is called

N2Lmap method.

It would be interesting to investigate the performance of the RBFN using these

three methods of obtaining the centers and spreads. This will be carried out in

Section 3.3.

Non-linear grid points Z

Schilling et al’smapping function

Linear setup grid RBFNX X y

Figure 2.12: Block diagram of Schilling et al’s mapping method

In summary, there are two alternatives after the set of non-linear grid points is

obtained.

2.8. Kohonen Self-Organising Map (SOM) 39

(1) Put the non-linear grid points into a non-linear grid supported RBFN, and

(2) Put the non-linear grid points into Schilling et al’s mapping function [39] that

maps the input data x to x and then use a linear grid supported RBFN.

A flowchart of the main system incorporating Shilling et al’s method [39] and

the proposed method indicated in this chapter is shown in Figure 2.13.

2.8 Kohonen Self-Organising Map (SOM)

Kohonen’s Self-Organising Map is proposed in [40] which is known as an unsuper-

vised learning method. It does not require a desired output value in its training

process. During the training process, similar patterns would gather together. If a

new pattern which is close to a pattern that is laready stored in the network then

it is classified as the stored class.

The algorithm can be expressed as follows:

• Step 1: Select a 2D map size, each node (at the intersection of the grid in the

2D map) contains a random feature vector. The feature vector has the same

length as the input vector and is initialized to have vlaues lying between 0 to

1.

• Step 2: Apply an input vector to each node to find a node which has the


Turning Point Sampling Algorithm

( )turning_point_sampling=X X�

Set up linear grid points Z

Non-linear grid points formulation

( )non_linear , ,�,a,b,c=Z X Z�

Method

Set Z as basis function centers then put in

Non-linear grid supported RBFN

Schilling et al’s mapping function

( )mapping ,=X X Z

Set array index as basis function centers then put in

linear grid supported RBFN

N2LmapNon-linear grid supported RBFN

Start

End

Figure 2.13: A flow chart showing the implementation of a non-linear grid in aRadial Basis Function Neural Network.


smallest Euclidean distance as a winner node.

minj

‖Xi − Wj‖ =

√√√√ D∑d=1

(xdi − wdj)2 (2.22)

where d = 1, 2, . . . , D is the input dimension, i = 1, 2, . . . , I is the index of the

i − th input vector, j = 1, 2, . . . , J is the j − th neuron in the map.

• Step 3:Update the feature vector of the winner node and those neurons of its

immediate neighbourhood.

wnewdj = wold

dj + Δwdj (2.23)

Δwdj =η (xd − wdj) , j ∈ Λj

0 , j /∈ Λj

(2.24)

where Λ is a neighborhood function which usually takes the form of a Mexican

hat shape around the winner node j as follow.

Figure 2.14: SOM Mexican hat update function

Repeat step 2 and step 3 until all input vectors are fed into the network. The

training process should run until the network is converged.

Chapter 3

Non-uniform Grid Construction in

a Radial Basis Function Neural

Network

3.1 Motivation

Basically, a neuro-fuzzy system employs a radial basis function network (RBFN)

for the premise and the Takagi-Sugeno-Kang (TSK) method for the consequent in

a fuzzy system. To improve the performance of RBFN is equivalent to “tuning” up

the performance of the neuro-fuzzy system. In an adaptive neuro-fuzzy inference

system (ANFIS) the basis functions should span the entire input space. The normal

implicit assumption is that the input is distributed uniformly over the entire input

42

3.2. Method of obtaining non-linear grid points 43

space. However, in general, the input data may not be distributed uniformly in the

input space. The sparse or flat area in the input space would require less number

of neurons to adequately represent it. On the contrary, the dense or “bumpy” area

would require more number of neurons. Experiments carried out in this thesis (in

Section 3.3) show that using a non-uniform grid distribution (this will be alterna-

tively referred to as a “non-linear grid”) outperforms one which uses a linear grid.

Such a method for obtaining the nonlinear grid will complement a method proposed

by Shilling et al [39]. In [39], they proposed a method for transforming a nonlinear

grid to a linear grid. However, in their method they assumed that the nonlinear grid

is obtained by a domain expert and they did not provide any explicit method for the

determination of the non-uniform grid from the inputs. Our proposed method will

fill this gap. The following sections provide a detailed discussion of our proposed

method for generating a non-linear grid. The experimental results will be shown in

Section 3.3 which will demonstrate the potential of the proposed method.

3.2 Method of obtaining non-linear grid points

Given a signal, x ∈ RD, the problem is to find a set of nonlinear grid points which

will “adequately” approximate the signal. Here “adequately” means the error with

respect to an error criterion is small, or smaller than a prescribed threshold. The

elements of x is denoted by xd,i; d = 1, 2, . . . , D; and i = 1, 2, . . . , Id. In other

words, we allow the D-dimensional signal to have different number of points in each

dimension. Generally Id = I, i.e. each dimension will have the same number of


points.

The reasons why we wish to find a set of nonlinear grid points can be stated as

follows:

• As a preprocessing method for the method proposed by Shilling et al [39].

• As a standalone method which will provide an approximate signal for a given

signal.

The method of obtaining a set of non-linear grid points involves two steps.

1. Find the turning points of the given signal, and

2. Find a set of non-linear grid points by non-uniform sampling of the given

signal.

The first step in the method involves finding the set of turning points in the

given signal. A turning point is defined as the point where the gradient of the

signal changes. There are a number of ways in which the gradient of a signal can

be detected to change. For example, we can compute the approximate gradient

of the signal by computing |(xd,i − xd,i−1) − (xd,i+1 − xd,i)| > τ where τ is a given

threshold and xd,i is the ith value of the d-th dimension of the given signal. If the

input data is noisy then the value of τ should be higher. Otherwise the unwanted

noise will be included. On the other hand, if the signal is not noisy, then the value

of τ would be lower. In general, the value of τ is determined by a trial and error


method 1. The turning point set is a set of points where the gradient values change

are larger than the threshold. The intermediate points, i.e., the points which are

not part of the set of turning points will not be used in the second step. The turning

point sampling algorithm is as shown in Figure 3.1 which maps X → X where the

elements of X is denoted by xd,i, d = 1, 2, . . . , D and i = 1, 2, .., I and the elements

of X is denoted by xd,j , d = 1, 2, . . . , D and j = 1, 2, .., Jd.

X = turning point sampling(X)

for each input dimension dj=1

for each data point iIf |(xd,i − xd,i−1) − (xd,i+1 − xd,i)| > τ

xd,j = xd,i

j=j+1

endend

end

Figure 3.1: A pseudocode representation of the proposed turning point detectionalgorithm.

Usually, the grid points are uniformly distributed in the d-th dimension as shown

in Figure 3.2, zd,1 = xmind, zd,G = xmaxd

where xmindand xmaxd

are respectively

the minimum and maximum values of the input signal in the d-th dimension. In

1Note that the issue of determination of the threshold value of τ is related to the prevention of

information loss from the signal reconstruction process as explained in this chapter. If a large value

of τ is chosen, then this may lead to fewer grid points being chosen, thus may lead to information

loss in the reconstruction of the signal. On the other hand, if a smaller value of τ is chosen, this

may lead to noise being allowed to pass through to the signal reconstruction. As the value of τ is

found by trial and error method, hence in general, the issue of information loss will depend on the

judgement of the user.


other words, zd,g − zd,g−1 = Δ, where g = 1, 2, . . . , G, and Δ is a constant [1].

Δ

,1dz ,2dz ,d gz ,d Gz

z

mindx maxd

x

Figure 3.2: Uniform grid point distribution in the d-th dimension of a given signal.

However, intuitively, it can be surmised that there may be two practical situa-

tions:

(1) where the given signal varies rapidly over a certain region, it makes sense to

allocate more grid points over the rapidly varying region, and

(2) in a region where the signal varies slowly, it will require less number of grid

points.

In this case, it will be advantageous to use a non-constant value of Δ.

In the second step, Equation (3.1) is used to determine the non-uniform grid

points. Let the values of a nominal grid be denoted by the vector zd, an element

of Z a matrix which stores the values of the nominal grid, Z ∈ �D×G, and it is

assumed that there will be Gd grid points in the d-th dimension, and G = maxd Gd.


Consider an element of zd, denoted by zd,g, d = 1, 2, . . . , D and g = 1, 2, . . . , Gd.

Initially, the points zd, d = 1, 2, . . . , D are assumed to be uniformly distributed on

the zd axis. We wish to have an algorithm which will adjust the grid points such

that the original signal is approximated. There are a number of possibilities. For

example, one may use a simple algorithm which approximates the gradient of the

underlying signal, as represented by the set of turning points. However, such an

algorithm may not be optimal, as when the gradients change rapidly, or when the

value of τ is set too small, it may result in too many grid points. In this thesis we

propose an updating algorithm which is inspired by the updating algorithm of self

organising map algorithm [40]. The updating equation for zd,g is given in Equation

(3.1) and a pseudo-code representation of the proposed updating algorithm is shown

in Figure 3.3.

znewd,g = zold

d,g − ηξd,g (xd,j ; a, b, c) ed,g (3.1)

where xd,j is an element of X, X ∈ �D×Jd, d = 1, 2, . . . , D, and j = 1, 2, . . . , Jd is the

set of turning points obtained in Step 1; η is a learning constant, usually selected

such that 0 ≤ η ≤ 1; and ξ(x; a, b, c) is a triangular function given as follows [12]:

ξ (x; a, b, c) = max

(min

(x − a

b − a,c − x

c − b

), 0

)(3.2)

The triangular function shown in Figure 3.4 has a height of 1 at the point b.

The base of the triangle is located at points a and c respectively. By allowing b

to be different from 12(a + c), we allow for a non-symmetrical triangular function.


Z = non uniform(X, Z, η, a, b, c)for each iteration t

for each input dimension dfor each data point i

for each grid point gzd,g = zd,g − ηξd,g(xd,i; a, b, c)ed,g

end

endend

end

Figure 3.3: Updating algorithm of finding the set of non-linear grid points.

,1dz ,2dz ,d Gz

za

b

c

1

,d gξ

Figure 3.4: A diagram illustrating a triangular function on uniform distribution gridpoints.


The constants a, b, c are chosen a priori. The error is given by ed,g = xd,i − zd,g

where xd,i is the set of turning points found in Step 1 of the proposed algorithm for

the dth dimension of the input, and g = 1, 2, . . . , Gd. When the sum of errors ed,g

computed over all grid points g = 1, 2, . . . , Gd is small or smaller than a prescribed

threshold, or the number of iterations has reached a preset constant, the algorithm

will stop. The converged values zd,g, d = 1, 2, . . . , D; g = 1, 2, . . . , Gd will be the

set of nonlinear grid points in d-th dimension. This set of nonlinear grid points will

approximate the original signal.

Now since the set of nonlinear grid points found using the proposed algorithm

is a set of discrete points, to approximate the original signal, it will be necessary to

use either interpolation algorithms, or a set of radial basis functions to interpolate

the signal over these nonlinear grid points. In this thesis we will only consider using

the set of radial basis functions method.

There are two parameters associated with each radial basis function, viz. the

center and the spread. Once these two parameters are determined the shape of

the radial basis function is determined. There are many ways in which these two

parameters associated with the radial basis function can be determined. For exam-

ple, [64] provided an offline method which uses a clustering algorithm to cluster the

given data into clusters, and from such clusters, the centers and spreads of the set

of radial basis functions can be determined accordingly. In [1] a method is proposed

for determining the centers and spreads of a set of radial basis functions if a grid is

provided. In this case, Tsoi and Tan [1] suggested that one simple way to find the

centers and spreads is to assume that the centers are located at the intersection of


the grid points, and the spread is determined by the interval between the grid points.

This is a very simple method if a grid on the input space is provided. This scheme of

determining the centre and spread of the radial basis function, given a grid, will be

the method which we will use in this thesis. Hence once a grid is obtained then the

centres and spreads of the radial basis functions can be determined. The question

is: how do we find the grid in the first instance.

The centre and spread of the radial basis function shown in Figure 3.5 will be

denoted respectively by zd,g and zd,g+1 − zd,g, g = 1, 2, . . . , Gd, d = 1, 2, . . . , D.

maxdxmind

x

,1dz ,2dz ,d Gz

zBasis

function

, 1 ,d g d gz z+ −,d gz

Figure 3.5: A diagram illustrating the determination of the centers and spreads ofa non-uniformly distributed set of grid points.

This algorithm can be conceptualised as follows: if the error is small, and within

the range [a, c], then the grid point zd,g is updated. The magnitude of the update

depends on the relative position of the turning point xd,i with respect to the apex

b of the triangular function with parameters a, b, c. If xd,i lies to the left hand side

of b, then zd,g will move a little closer to b, determined by the constant η. On the

other hand, if xd,i lies to the right hand side of b, then zd,g will move a little closer


to b. If the point zd,g is outside the region of interest [a, c], then there is no update

to the value of zd,g.

Here in this exposition, we have chosen a triangular function to determine the

region of interest in the updating algorithm. It is obviously possible to use other

functions, e.g. a Gaussian function. We find from a number of preliminary ex-

periments that the Gaussian function performs inferior to the triangular function.

Figure 3.7 shows the magnitude of the update using a Gaussian function and a tri-

angular function respectively. Because the magnitude update is the product of the

function ξd,g and the error ed,g, and noting that the Gaussian function is non-zero

at the boundaries of points a, c, it leads the Gaussian function to produce a “leak-

age” effect outside the region bracketed by the points a and c. Hence, we will only

consider the deployment of ξ as a triangular function in Figure 3.6.

This algorithm is in the spirit of the self organizing map (SOM) technique pro-

posed by Kohonen [40] in the sense that it updates the weights associated with the

winning node and the neighborhood nodes respectively. The SOM update equation

is shown in Equation (3.3). Here SOM uses Ωc, a neighborhood function to control

the update around the winning node. On the other hand, our proposed algorithm

uses a triangular function as a neighborhood function. Because the triangular neigh-

borhood function has zero values if the input is outside the region bracketed by the

points a and c, this algorithm is not required to find the winning node. Secondly,

in the SOM updating algorithm [40], the size of the neighborhood function shrinks

as the algorithm progresses. On the other hand, in the proposed algorithm, we use

a neighborhood function with a constant cover. The end result of our proposed


1 2 3 4 5−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Grid

Upd

ate

Mag

nitu

de

a b c

Figure 3.6: The magnitude update using triangular functions in the grid pointlocation updating algorithm.

1 2 3 4 5−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Grid

Upd

ate

Mag

nitu

de

a b c

Figure 3.7: The magnitude update using Gaussian functions in the grid point loca-tion updating algorithm.


algorithm is that the grid points will move towards the turning points, thus creating

a set of non-linear grid points.

znewk = zold

k − ηΩc (xi) ek (3.3)

where η is a learning constant, Ω is a neighborhood function and c denotes a winning

node, k = 1, 2, . . . , K.

Once the set of non-linear grid points zd,g is obtained, we can then use the classic

radial basis neural network architecture with the non-linear grid support as described

in Section 2.6 or use it as a preprocessing method to obtain the non-linear grid in

the Schilling et al’s method as shown in Section 2.7 to construct the approximate

signal.

The behaviour of our proposed algorithm is governed by:

• The number of grid points. We can control the number of grid points used in

the algorithm. The number of grid points provides an indirect control on the

goodness of fit on the original signal. If we use relatively few grid points, then

it is found that only the coarse nature of the signal is approximated. On the

other hand, if we use a relatively large number of grid points, then some finer

features of the signal will be approximated. In this respect, it would be difficult

to characterize the exact nature of the approximation. We cannot indicate,

like with wavelet functions, the approximation capabilities in terms of coarse

and fine features of the signal. It is also not possible to give a statement on

how well the method approximates the given signal as a function of the number


of grid points used. However, we have employed this method on a number of

practical signals, and it is found that the method works well. It is also found

that this method is able to filter out some of the noise content in the signal

by controlling the number of grid points.

• The parameters a, b, c. These parameters interact with the number of grid

points. The parameters a and c control the effective area, the parameter b

is the center. When the number of grid points along the input dimension is

large it would be advisable to increase the effective area of the neighborhood

function. On the other hand, if the number of grid points is small along the

input dimension it would be advisable to decrease the effective area covered

by the neighborhood function. Figure 3.8 and Figure 3.9 respectively show

the magnitudes of updates when using one or two grid points in the effective

area covered from either sides of the center b. It is observed that for more grid

points used on either sides of the center, the triangular function works well.

• The learning constant η. It is a small constant or a monotonic decreasing

function. It is found that if η is relatively large, then the algorithm exhibits

an oscillatory behaviour. On the other hand if η is relatively small, it will take

considerable time for the algorithm to converge.

• The stopping criterion. There are a number of possible stopping criteria:

cumulated error function, or a prescribed maximum number of iterations. In

this thesis, we opt to use a fixed maximum number of iterations.


1 2 3 4 5−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Grid

Upd

ate

Mag

nitu

de

a b c

Figure 3.8: The magnitude of the update using one grid point on either sides of thefunction center.

1 2 3 4 5

−0.5

0

0.5

Grid

Upd

ate

Mag

nitu

de

a b c

Figure 3.9: The update of the magnitude of the updating algorithm using two gridpoint either side of grid function center b.

3.3. Application Examples 56

3.3 Application Examples

In this section, we will demonstrate the effectiveness of the proposed method using

a number of examples: one based on the van der Pol equation [71], one based on real

data: the currency exchange rate between US Dollar and Indonesian Rupiah during

the tumultuous days of Asian economic meltdown between 1997 and 1998, one based

on the famous Sunspot cycle time series and one multidimensional example based

one iris dataset. These examples are chosen because they represent different types

of problems, some artificially generated, e.g. the van der Pol equation, and some

practical problems, e.g. sunspot cycle time series, the currency exchange problem.

In addition, the iris dataset provides a good example of a multi-dimensional dataset.

These datasets would help us to evaluate the effectiveness of the proposed algorithm.

3.3.1 Van der Pol Oscillator

The Van der Pol Oscillator is a nonlinear oscillaotry model exhibiting a limit cycle

behaviour. In this section, we will demonstrate the application of our proposed

method on the system by varying the number of grid points. The response of the van

der Pol equation is simulated using a random initial value. This response is stored

in a file which represents the input to this investigation. The implementation of the

algorithms is in Matlab. This will allow rapid evaluation of various concepts without

being bogged down in details of code development. We performed experiments which

use different values of the total number of grid points using our proposed method.

We plotted the root mean square (RMS) errors as a function of the number of grid


points as shown in Figure 3.10. The upper graph corresponds to the performance of

the linear grid regime, the middle one corresponds to that of a nonlinear grid regime

and the lower most one corresponds to that of the N2Lmap method. It is observed

that from a RMS error’s point of view, the RBFN using the nonlinear grid and the

one using the N2Lmap both perform better than the RBFN using a linear grid.

10 20 30 40 50 60 70 80 90 1000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Grid Points

RM

S E

rror

linear RBFNnonlinear RBFNN2Lmap

Figure 3.10: Performance comparisons of RBFN using three different regimes: lineargrid, nonlinear grid, and the transformation of the nonlinear grid to the linear grid(N2Lmap) method.

This result is interesting in that it shows that the nonlinear grid regime outper-

forms a linear grid regime. In addition, it also shows that if we use a small number

of grids, then the RMS error values appear to accentuate. On the other hand, if a

sufficiently large number of grid points is used, then there is hardly any difference

between the three methods. This result is hardly surprising. When a large number


of grid points is used, there will be sufficient number of grid points in the linear grid

regime to cover the rapidly varying portion of the signal. Hence, the effectiveness

of the nonlinear grid regime is lost. On the other hand, if only a small number

of grid points is used, then in the linear grid regime, it can be surmised that the

rapidly varying portion of the signal does not have sufficient number of grid points

to adequately represent the signal, and hence the RMS error values would be worse

than those found using a linear grid counterpart. It is observed from Figure 3.10, if

we use 40 grid points then the performances of the linear grid and nonlinear grid are

comparable. On the other hand, if we use only 15 grid points, then there is much

difference in the performance between the linear grid and non-linear grid methods.

Figure 3.11 shows the set of turning points obtained for the van der Pol equation

and Figure 3.12 shows the non-linear grid point distribution using 15 neurons while

in Figure 3.13 the actual output of the linear grid method using 15 neurons is shown.

Figure 3.15 shows the actual output of the nonlinear grid method using 15 neurons

and Figure 3.17 shows the actual output of the N2Lmap method using 15 neurons.

Figure 3.14 shows the differences in outputs of the original signal and the recon-

structed signal using 15 linear grid points.

Figure 3.16 shows the differences in the outputs of the original signal using a

nonlinear grid with 15 points and Figure 3.18 shows the differences in outputs of

N2Lmap method using 15 grid points.

It is observed that from Figures 3.14, 3.16 and 3.18 that the linear grid method


0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Turning points sampling of Van der Pol Oscillator

Time

Am

plitu

de

Figure 3.11: The set of turning points superimposed on the original signal for thevan der Pol equation.

0 0.2 0.4 0.6 0.8 1

Linear Grid Points Distribution

Amplitude

0 0.2 0.4 0.6 0.8 1

Nonlinear Grid Points Distribution

Amplitude

Figure 3.12: The distribution of the set of grid points. The upper graph shows thedistribution using a linear grid while the lower graph shows the location of the gridpoints using a nonlinear grid regime. The total number of grid points used is 15.


0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Time

Am

plitu

de

Actual output of 15 linear grid points RBFN

Figure 3.13: The actual output of a RBFN using 15 grid points in a linear gridregime. It is observed that the output is significantly different from that of theoriginal output of the van der Pol equation.

0 0.2 0.4 0.6 0.8 1−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Time

Err

or A

mpl

itude

Output different of 15 linear grid points RBFN

Figure 3.14: The differences in the output of the van der Pol equation, and thereconstructed one using a RBFN with 12 grid points using a linear grid regime.


0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Time

Am

plitu

de

Actual output of 15 nonlinear grid points RBFN

Figure 3.15: The output of a RBFN using 15 grid points and a nonlinear grid regime.

0 0.2 0.4 0.6 0.8 1−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Time

Err

or A

mpl

itude

Output different of 15 nonlinear grid points RBFN

Figure 3.16: The output differences between the original signal from the van der Polequation and a reconstructed signal using a RBFN with 15 grid points and using anonlinear grid regime.


0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Time

Am

plitu

de

Actual output of N2Lmap using 15 grid points

Figure 3.17: The output of a RBFN with 15 grid points using a nonlinear gridregime. In this case, we use the nonlinear grid mapped onto a linear grid using themethod proposed by Shilling et al [39].

0 0.2 0.4 0.6 0.8 1−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Time

Err

or A

mpl

itude

Output different of N2Lmap using 15 grid points

Figure 3.18: The output differences between the original signal and the reconstructedsignal using a nonlinear to linear grid mapping as proposed in Shilling et al. [39].The number of grid points used is 15.


produces the highest discrepancies between the original signal and the reconstructed

signal which means the nonlinear grid methods outperform the linear grid method.

This set of figures using 15 grid points confirms the information provided in

Figure 3.10 that the grid may be too coarse to represent the van der Pol equation

output adequately. In particular it is observed that where the signal is rapidly

changing (near the peak), the error is most pronounced. On the other hand, where

the signal is relatively flat, the error is not as pronounced. It is further observed

that there are some differences between the two nonlinear grid methods: one using

the nonlinear grid and the other one using a mapping from a nonlinear grid to a

linear grid. This result is somewhat surprising, as one would have assumed that

the nonlinear mapping as determined in Equation (2.21) is lossless. However, on

further examination of the equations, it is observed that the method is not lossless.

In other words, there is an information loss mapping from a nonlinear grid to a

linear grid. Hence the results are not surprising. What is more surprising is that

the mapping from the nonlinear grid to the linear grid appears to perform better

than the nonlinear grid method. This difference may be surmised to be caused by

the fact that the nonlinear grid is tuned to the signal, while the mapping from a

nonlinear grid to a linear grid is not tuned the signal, and hence would have a better

generalisation capability.

To further confirm our intuition, we will repeat the same set of experiments

except this time we will use 40 grid points. The number 40 is chosen because from

Figure 3.10, it is observed that the RMS values using either the linear grid regime,

or the nonlinear grid regime are sufficiently close to one another. This implies that


we do not anticipate to find too much differences in the errors when we use either

the linear grid or the nonlinear grid method.

Figure 3.19 shows the nonlinear grid distribution using 40 grid points.

0 0.2 0.4 0.6 0.8 1


Amplitude

0 0.2 0.4 0.6 0.8 1


Amplitude

Figure 3.19: The distribution of the grid points. The upper graph shows the lin-ear grid point distribution, while the lower graph shows the distribution using anonlinear grid. The number of grid points used is 40.

Figure 3.20 shows the actual output of linear grid method using 40 neurons,

Figure 3.22 shows the actual output of nonlinear grid method using 40 neurons and

Figure 3.24 shows the actual output of N2Lmap using 40 neurons.

Figure 3.21 shows the differential output of the original signal using 40 linear grid

neurons, while Figure 3.23 shows the differential output of the original signal using

40 nonlinear grid neurons and Figure 3.25 shows the differential output of N2Lmap


0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Time

Am

plitu

de


Figure 3.20: The output of a RBFN with 40 grid points using a linear grid regime.

0 0.2 0.4 0.6 0.8 1−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Time

Err

or A

mpl

itude


Figure 3.21: The output differences between the original signal and the reconstruc-tion using a RBFN using a linear grid regime with 40 grid points.


0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Time

Am

plitu

de


Figure 3.22: The output of a RBFN using a nonlinear grid regime with 40 gridpoints.

0 0.2 0.4 0.6 0.8 1−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Time

Err

or A

mpl

itude


Figure 3.23: The output differences between the original signal and the reconstruc-tion using a RBFN with a nonlinear grid regime using 40 grid points.


using 40 neurons.

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Time

Am

plitu

de


Figure 3.24: The output of a RBFN using a nonlinear grid mapped onto a lineargrid with 40 grid points.

Again, the linear grid method has the largest error magnitudes as shown in Figures

3.21, 3.23 and 3.25 which implies that the nonlinear grid method outperforms the

linear grid method.


0 0.2 0.4 0.6 0.8 1−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Time

Err

or A

mpl

itude


Figure 3.25: The output differences between the original signal and a reconstructedsignal using a RBFN with a nonlinear grid mapped onto a linear grid using 40 gridpoints.


RMS error using 15 grid points

Linear RBFN Nonlinear RBFN N2Lmap

0.0658 0.0480 0.0430



0.0073 0.0048 0.0040

Table 3.1: Output results comparisons of the van der Pol equation ex-ample.

It is noted that from Figures 3.13, 3.15 and 3.17, the approximation using 15 grid

points is not as good as one using a higher number of grid points, e.g., using 40 grid

points. Using more grid points produces better approximation of the original signal.

It is observed nevertheless the nonlinear grid method outperforms the linear grid

method (please see Figure 3.10 and Table 3.1) between using 10 – 50 grid points.

Once the distribution of grid points is sufficiently dense to cover the input space,

then there is no significant difference between the performance of the linear grid and

the nonlinear grid.

3.3.2 Currency exchange rate between the US Dollar and

the Indonesian Rupiah

We use the data on the currency exchange rate between US Dollar (USD) and

Indonesian Rupiah (IDR) between 01/01/1994 and 31/12/1999. The minimum was


1 USD to 2160 IDR. while the maximum was 1 USD to 16,475 IDR. The input time

series is first normalized to lie between 0 (0 denotes 1 USD to 2160 IDR) and 1 (1

denotes 1 USD to 16,475 IDR). There are a total of 2191 data points in this time

series. The weekend and holidays values are not included in the total number of

data points, as during these days the value is zero. The currency exchange time

series is as shown in Figure 3.26.

1994 1995 1996 1997 1998 1999 2000

0

0.2

0.4

0.6

0.8

1

Year

Exc

hang

e ra

te

Exchange rate of USD to IDR

Figure 3.26: The currency exchange time series between the US dollars and theIndonesian Rupiah, between 1st January, 1994, and 31st December, 1999. Notethat the vertical axis of this graph is normalised with 0 denoting 1 USD to 2160IDR, while the maximum 1 denoting 1 USD to 16,475 IDR.

We wish to apply the proposed method discussed in this chapter to study this

time series. It is observed that this time series is quite challenging to approximate.


It has a major peak. around day 1700. In addition, the time series appears quite

noisy. As indicated previously, the performance of the proposed method depends on

how many grid points are used. As a result, we first compute the RMS error as a

function of the number of grid points used. The variation of the RMS error values

as a function of the number of grid points is shown in Figure 3.27. There are three

graphs in Figure 3.27, the upper most one represents the performance of a RBFN

using a linear grid, the middle graph shows the performance of a RBFN using a

nonlinear grid but mapped to a linear grid (N2Lmap), while the lowermost graph

shows the performance of a RBFN using a nonlinear grid regime.

20 40 60 80 1000.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Grid Points

RM

S E

rror


Figure 3.27: The variation of the root mean square error values as a function of thenumber of grid points used.

It is observed that with 100 grid points all methods will yield good results. In


order to show the effect of the number of grid points, we will give two sets of results,

one with 25 grid points only, and the other one with 100 grid points. These two

values are chosen with the assistance of Figure 3.27. It is observed from Figure 3.27

that at 25 grid points, the performances of the three methods appear to be quite

well separated. On the other hand, at 100 grid points the performances of all three

methods appear to be approximately the same. The RMS error values for the case

of 25 grid points and 100 grid points respectively are shown in Table 3.2. Note that

these results are presented in Figure 3.27. Here in Table 3.2, they are indicated in

numerical form.



0.0483 0.0264 0.0380



0.0212 0.0196 0.0186

Table 3.2: The comparison of the root mean square values between using25 grid points and 100 grid points for the currency exchange time series.

It is observed that with 25 grid points, the performance of a RBFN using a nonlinear

grid performs best, giving the lowest RMS values, while the performance of a RBFN

using a nonlinear grid mapped to a linear grid performs slightly inferior. This may

be due to the noise effect. As the time series is quite noisy, and hence by mapping

the nonlinear grid to a linear grid, it may have less resistance to the underlying noise.

Note that this conclusion is different from that of the van der Pol equation. In the


van der Pol equation, there is no noise. Hence we surmise that the nonlinear grid

mapped to a linear grid performs better as the mapping allows a better generalisation

result. Here in this case, as the underlying data is noisy, and hence the mapping

from a nonlinear grid to a linear grid does not perform as well as a RBFN using a

nonlinear grid.

Figure 3.28 shows the set of turning points. Figure 3.29 shows the grid points

distributions which uses 25 grid points. The upper graph shows the distribution of

the grid points using a linear grid regime, while the lower graph shows the grid point

distribution using a nonlinear grid regime.

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Turning points sampling of exchange rate of USD to IDR

Day

Exc

hang

e ra

te

Figure 3.28: The set of turning points in the time series of USD to IDR. Note thatwe have connected the points so as to make it easier to discern where the turningpoints are.

Figure 3.30 shows the output of a RBFN using 25 grid points with a linear grid

regime. Figure 3.32 shows the output of a RBFN using 25 grid points but with a


0 0.2 0.4 0.6 0.8 1


Exchange rate

0 0.2 0.4 0.6 0.8 1


Exchange rate

Figure 3.29: The distribution of the grid points. The upper graph shows the distri-bution of the linear grid points, while the lower graph shows the distribution of thenonlinear grid points.


nonlinear grid regime, and Figure 3.34 shows the output of a RBFN using 25 grid

points, with the mapping of the nonlinear grid to a linear grid regime.

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Day

Exc

hang

e ra

te


Figure 3.30: The output of a RBFN using 25 grid points with a linear grid regime.

Figure 3.31 shows the differences in the outputs of the original signal and the recon-

struction using 25 grid points with a linear grid regime,

Figure 3.33 shows the differences of the outputs of the original signal and the re-

construction using a RBFN with a nonlinear grid regime, and Figure 3.35 shows

the differences of the outputs of the original signal and the reconstruction using

the nonlinear grid mapped to a linear grid regime. It is observed that even though

the differences between the outputs of the original signal, and the reconstruction

using a RBFN with a mapping from the nonlinear grid to the linear grid are smaller

than the counterpart using a RBFN with a nonlinear grid, nevertheless the overall

cumulated root means square error is larger than the one using a nonlinear grid.


0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

Day

Err

or A

mpl

itude


Figure 3.31: The output differences between the original signal and the reconstruc-tion using a RBFN with 25 grid points with a linear grid regime.

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Day

Exc

hang

e ra

te


Figure 3.32: The output of a RBFN using 25 grid points with a nonlinear gridregime.


0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

Day

Err

or A

mpl

itude


Figure 3.33: The output differences between the original signal and the reconstruc-tion using a RBFN with 25 grid points in a nonlinear grid regime.

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Day

Exc

hang

e ra

te


Figure 3.34: The output of a RBFN using 25 grid points but with a mapping fromthe nonlinear grid to a linear grid.


0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

Day

Err

or A

mpl

itude


Figure 3.35: The output differences between the original signal and the reconstruc-tion using 25 grid points with a mapping from the nonlinear grid to a linear grid.

This supports our argument that the difficulties in the use of the mapping from a

nonlinear grid to a linear grid regime lies in the noise in the time series. However, it

is observed that from both Figures 3.32 and 3.34 that the reconstructions appear to

have captured the essence underneath the original signal, with a minor narrow peak

at day 1485. In contrast, the one using a linear grid cannot capture this narrow

peak.

We will repeat the experiment set using this time a larger number of grid points,

viz. 100 grid points. It is observed that with this number of grid points there is very

little difference between the three methods as they have all yielded similar RMS

values. The main reason why this has yielded similar RMS errors values is that

there is a sufficient number of grid points for the reconstruction of the underlying


system. Hence there is not much differences in their performance. Figure 3.36 shows

the grid point distributions which uses 100 grid points. The upper graph shows the

grid point distribution of a linear grid, while the lower graph shows the distribution

of the nonlinear grid points.

0 0.2 0.4 0.6 0.8 1


Exchange rate

0 0.2 0.4 0.6 0.8 1


Exchange rate

Figure 3.36: The distribution of the grid points. The upper graph shows the distri-bution of the linear grid points, while the lower graph shows the distribution of thenonlinear grid points. The number of grid points used is 100.

Figure 3.37 shows the output using a RBFN with 100 grid points with a linear grid

regime, Figure 3.39 shows the output of a RBFN using a nonlinear grid regime, and

Figure 3.41 shows the output reconstruction using a RBFN with a mapping from

the nonlinear grid to a linear grid regime.

Figure 3.38 shows the output differences between the original signal and the one

reconstructed using a RBFN with a linear grid regime using 100 grid points, Figure

3.40 shows the output differences of the original signal and the one reconstructed


0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Day

Exc

hang

e ra

te


Figure 3.37: The output of a RBFN using 100 grid points with a linear grid regime.

0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

Day

Err

or A

mpl

itude


Figure 3.38: The output differences between the original signal and the reconstructedone from a RBFN using 100 grid points with a linear grid regime.


0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Day

Exc

hang

e ra

te


Figure 3.39: The output of a RBFN using a nonlinear grid regime with 100 gridpoints.

0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

Day

Err

or A

mpl

itude


Figure 3.40: The output differences between the original signal and the one recon-structed from a RBFN with a nonlinear regime using 100 grid points.


from a RBFN using a nonlinear grid regime and Figure 3.42 shows the output dif-

ferences of the original signal and the reconstruction using a RBFN with a mapping

from the nonlinear grid to a linear grid regime.

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Day

Exc

hang

e ra

teActual output of N2Lmap using 100 grid points

Figure 3.41: The output of a RBFN using a mapping from the nonlinear grid to alinear grid regime using 100 grid points.

It is observed, as shown in Figures 3.30, 3.32 and 3.34, the details exhibited in

the original signal is not well approximated using only 25 grid points. However,

if we use a larger number of grid points, say, 100 grid points, it will be able to

approximate the peaks better as shown in Figures 3.37, 3.39 and 3.41. This implies

that the number of grid points provides an indirect control on the extent in which

the fine features of the original signal can be approximated. It is observed with a

finer grid, the linear grid is able to approximate the narrow peak occurring in day

1485 while using 25 grid points, it was not able to capture this fine feature. With

reference to Figures 3.29 and 3.36, it is observed that more grid points are allocated


0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

Day

Err

or A

mpl

itude


Figure 3.42: The output differences between the original signal and the reconstructedone from a RBFN with a mapping from the nonlinear grid to a linear grid regimeusing 100 grid points.


to the “bumpy” areas which allows the nonlinear grid methods to perform better

than the linear grid method in all cases (please see Figure 3.27) especially when less

grid points are used.

3.3.3 Sunspot Cycle Time Series

The sunspot cycle time series is compiled by the US National Oceanic and At-

mospheric Administration. The sunspot number is collected daily since January

1749 at the Zurich Observatory [54]. This time series appears to exhibit a cyclic

behaviour, with a cycle of approximately 11 years. Note that there has been some

discussions as to whether the sunspot cycle time series “really” has a cycle of 11

years, as it appears that there may be other cycles within the time series. We will

not consider this aspect in this thesis. We take the average sunspot cycle for each

month and the data set consists of data from January 1749 to July 2004. The

sunspot data is shown in Figure 3.43.

The sunspot values have been normalised to lie in the range from 0 to 1. It is

observed that there is considerable amount of noise in the time series, especially

around the peak values. This makes it challenging to use the proposed methods to

approximate the time series. Figure 3.44 shows the set of turning points as obtained

using the proposed algorithm.

It is noted that there are many turning points, as the time series is rather “peaky”.

What this implies is that it will require a larger number of grid points in order

to represent the time series more “faithfully”. This observation is backed up by


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.2

0.4

0.6

0.8

1

Month

Sun

spot

Num

ber

NOAA Sunspot Number

Figure 3.43: The monthly average sunspot number time series from January 1749to July 2004. The x-axis is normalised to lie between 0 and 1. Similarly the y-axisis also normalised to lie between 0 and 1.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1Turning points sampling of NOAA Sunspot Number

Month

Sun

spot

Num

ber

Figure 3.44: The set of turning points for the NOAA sunspot number time series.


observing the behaviour of the RMS error values as a function of the number of grid

points in Figure 3.45.

20 40 60 80 100 120 1400.04

0.06

0.08

0.1

0.12

0.14

0.16

Grid Points

RM

S E

rror


Figure 3.45: The variation of the RMS error values as a function of the number ofgrid points.

Figure 3.45 shows the RMS error values as a function of the number of grid points.

It is noted that there is considerable difference in the RMS error values among the

three methods in using a 50 grid points. However, there are no significant differences

among the three methods in the RMS error values after 100 grid points. It is noted

that the mapping of the nonlinear grid to the linear grid regime appears to perform

worse than the nonlinear grid regime. This is in line with our intuition as observed

in the currency exchange time series. Here the difference in the performance of the


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


Sunspot Number

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


Sunspot Number

Figure 3.46: Grid point distribution using 50 grid points. The upper graph shows thedistribution of the linear grid points, while the lower graph shows the distributionof the nonlinear grid points.

mapping from the nonlinear grid to linear grid regime when compared with the non-

linear grid regime may be attributed to the noise content of the time series. It may

be surmised that in mapping the nonlinear grid to a linear grid the generalisation

capability of the method might have been degraded. We will choose two different

values for the number of grid points, viz., 50 and 100 respectively to carry out our

investigations. 50 grid point was chosen because from Figure 3.45, it appears that

all three methods have a significant difference in their behaviours, while at 100 grid

points it is observed that all three methods do not have much differences in their

RMS error values.

Figure 3.46 shows the grid point distribution using 50 grid points.

Figures 3.47, 3.48 and 3.49 respectively show the output, and the differences in the


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Month

Sun

spot

Num

ber


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.5

0

0.5

Month

Err

or A

mpl

itude


Figure 3.47: The output and differences of outputs between the original signal andthe reconstructed one using a RBFN with 50 linear grid points.

outputs reconstructed using a RBFN using the linear grid regime, the nonlinear grid

regime, and the mapping of the nonlinear grid to the linear grid regime.

It is observed that the linear grid regime produces relatively larger errors. In

addition, it is observed that while the mapping from a nonlinear grid to a linear

grid regime produces in general errors of smaller magnitude than the nonlinear grid

regime, that overall the mapping method suffers from the contamination by the

noise content in the signal. This may be the reason why the overall RMS error value

is larger than the one using the nonlinear grid.

We will repeat the set of experiments with 100 grid points. Figure 3.50 shows

the grid point distribution using 100 grid points.

Figures 3.51, 3.52 and 3.53 respectively show the actual output and the differences


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Month

Sun

spot

Num

ber


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.5

0

0.5

Month

Err

or A

mpl

itude


Figure 3.48: The output and differences of the original signal and the reconstructedone using a RBFN with a nonlinear grid regime using 50 grid points.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Month

Sun

spot

Num

ber


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.5

0

0.5

Month

Err

or A

mpl

itude


Figure 3.49: The output and differences in the original signal and the reconstructedone using a RBFN with a mapping from the nonlinear grid to a linear grid regimewith 50 grid points.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


Sunspot Number

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


Sunspot Number

Figure 3.50: Grid point distribution using 100 grid points for the sunspot cycle timeseries. The upper graph shows the linear grid distribution, while the lower graphshows the distribution of grid points using a nonlinear grid regime.

in the original signal and the reconstructed output using a RBFN with a linear grid

regime, a nonlinear grid regime, and a mapping of the nonlinear grid to a linear grid

regime.

Table 3.3 shows the performance of the RBFN using 50 and 100 grid points

respectively.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Month

Sun

spot

Num

ber


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.5

0

0.5

Month

Err

or A

mpl

itude


Figure 3.51: The actual output and the differences in the original signal and theoutput reconstructed using a RBFN with 100 linear grid points.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Month

Sun

spot

Num

ber


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.5

0

0.5

Month

Err

or A

mpl

itude


Figure 3.52: The actual output and the differences in the original signal and thereconstructed output using a RBFN with 100 grid points with a nonlinear gridregime.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Month

Sun

spot

Num

ber


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.5

0

0.5

Month

Err

or A

mpl

itude


Figure 3.53: The actual output and the differences in the original signal and thereconstructed output using a RBFN with a mapping of the nonlinear grid to a lineargrid regime with 100 grid points.



0.1194 0.0866 0.1059



0.0628 0.0630 0.0626

Table 3.3: Output results comparisons.

It is observed that the nonlinear grid methods outperform the linear grid method

as one would have expected. The results of this experiment confirm our conclusion

from the currency exchange time series experiments.


So far, our experiments have been conducted on one-dimensional signals. In the

next subsection, we will consider an example with higher dimensions.

3.3.4 Experiments with the Iris Dataset

The Iris dataset consists of three species of iris flower. These are: Iris-Virginica, Iris-

Versicolor and Iris-Setosa. Each species measures sepal length, sepal width, petal

length, and petal width. Iris-Virginica and Iris-Versicolor are linearly inseparable.

We randomly select 51 data points for testing and 99 data points for training. Once

the training and testing data is chosen, the same data sets will be normalized and

put into different methods for performance evaluation.

Figure 3.54 shows the variation of the RMS errors as a function of number of

grid points per dimension.

It is noted that the RMS error values appear to behave quite “oddly” in that the

RMS error values increase with the number of grid points after 6 grid points per

dimension. This is an odd behaviour because one would have expected that the

RMS error values are a monotonically decreasing function of the number of grid

points. This “odd” behaviour is explained as below.

Figure 3.55 shows the grid point distribution using three grid points per dimension.

It may be observed from Figure 3.54, nonlinear grid methods outperform the linear

grid point one when using 2 to 4 grid points per dimension. However, the results

for the number of grid points greater than 3 may be of concern. Basically there is


2 4 6 8 100

0.5

1

1.5

2

2.5

Number of Grid Points per dimension

RM

S E

rror


Figure 3.54: The RMS error values as function of the number of grid points perdimension.

insufficient number of data points to support any concrete conclusions. As a rule of

thumb, the number of training data should be larger than the number of parameters.

When using 4 grid points per dimension the number of parameters is larger than

the number of examples in the training data set. For 6 grid points per dimension,

the system becomes unstable. This is the main reason why the behaviour of the

RMS error curve seems to behave “oddly”. There is insufficient number of data

point in the training data set to support any concrete conclusions after 3 grid points

per dimension. Since we are dealing with a signal with higher dimensions, one way

in which we can process the signal is to assume that the signal can be processed

individually in each dimension first, and the results can be combined in the product

space. Hence in order for the basis function to cover all input dimensions, the basis

function should be “joined” across all dimensions. For example, if there are two


0.05 0.06 0.07 0.08 0.09 0.1 0.11


Sepal length

0.05 0.06 0.07 0.08 0.09 0.1 0.11


Sepal length

0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12


Sepal width

0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12


Sepal width


0 0.02 0.04 0.06 0.08 0.1 0.12 0.14


Petal length

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14


Petal length

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14


Petal width

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14


Petal width

Figure 3.55: Grid point distributions of the iris data set. The upper graphs in eachsub-graph show the distribution of the linear grid points, while the lower graphsshow the nonlinear grid points.


input dimensions and that there are three grid points per dimension, then the total

number of neurons will be 9 as illustrated in Figure 3.56.

1 2 3

1 2 3

1d

2d

Figure 3.56: basis functions cover in two dimension

Figure 3.57 shows the number of neurons used in Iris data set when increasing the

number of grid points.

In general, for D dimensional inputs and the maximum number of grid points

on any input dimension is G, the total number of neurons is given by: GD. Thus

if D is large, then the number of grid points grows exponentially. This is one of

the main reasons why the current approach has limited applications, as pointed

out in [1], that such an approach can rarely be applied when the number of input

dimensions is high. In the next chapter, we will show that by using an extended

3.4. Conclusions 98

2 4 6 8 100

2000

4000

6000

8000

10000

Number of Grid points per dimension

Num

ber

of N

euro

ns

Figure 3.57: Number of neurons used in a RBFN when the input dimension is four.

ANFIS approach such limitations may be overcome.

3.4 Conclusions

In this chapter, we have provided a simple method for generating the nonlinear

grid which can be used in association with the technique proposed by [39] or used

in a non-linear grid supported RBFN architecture. It is demonstrated where the

performance of the linear grid and the nonlinear grid differs 2. The performances of

2There are three classes of algorithms, the linear grid algorithm, the nonlinear grid algorithm.

These are used as standalone algorithms without any need for the use of radial basis neural network

algorithm. Then, there is a nonlinear to linear mapping algorithm, which is used to map a nonlinear

grid to a linear grid before it is input into the radial basis function neural network. Hence, it is

expected that there will be a difference in the performance of the linear grid method and the

3.4. Conclusions 99

the nonlinear grid in all cases outperform that of the corresponding ones in the linear

grid case. The method can be extended to multi-dimensional inputs. However, the

RBFN type of neural networks suffers from an exponential growth in the number

of neurons required to represent the signal. This can be overcome in Chapter 4

using an extended ANFIS approach. It is also demonstrated that the extent of fine

features can be approximated is indirectly controlled by the number of grid points

used. The more grid points are used, the better will be the approximation of the

fine features in the signal.

It is found that when the signal has no or little noise the mapping from a non-

linear grid to a linear grid seems to outperform that of a nonlinear grid. However,

when there are noises in the inputs, then the nonlinear grid outperforms that of the

mapping from a nonlinear grid to a linear grid one.

nonlinear grid method, if the proposed algorithm (nonlinear grid determination) has any benefits

at all. It is also expected that the nonlinear to linear mapping algorithm will have performance

close to the nonlinear grid method, as it essentially is a nonlinear grid method, except that it is

transformed so that it can be used in association with the radial basis function neural network.

These intuitions are confirmed in the experiments conducted in this chapter and demonstrated in

the results as shown in Tables 3.1, 3.2 and 3.3.

Chapter 4

Extended Adaptive Neuro-Fuzzy

Inference Systems

4.1 Motivation

In the last chapter, we have shown that the RBFN with a nonlinear grid regime

outperforms that with a linear grid regime. However, the RBFN uses a Gaussian

function which requires the determination of the center and spread. Secondly, it

was found that the proposed method discussed in the last chapter cannot be used

for inputs of high dimensions. Hence, it would be useful if we can find a method

which can be applied to higher input dimensions. Secondly, it would be useful to

investigate if there are methods which do not require the determination of the centers

and spreads of Gaussian functions. Since it is known that a RBFN is equivalent to a

100

4.2. Introduction to Extended Adaptive Neuro-Fuzzy Inference System 101

neuro-fuzzy system with Gaussian membership functions [12], this problem may be

transformed into one which concerns the determination of the membership function

in a neuro-fuzzy system. In other words, would it be possible to find other types of

membership functions which “automatically” adjust their shapes of the membership

functions to the inputs. In this chapter, we will introduce a novel neuro-fuzzy system

which we called an extended adaptive neuro-fuzzy inference system (EANFIS) 1.

The EANFIS allows us to handle higher input dimensions by avoiding the need to

form the rules in the first instance. The EANFIS is an extension of the common

adaptive neuro-fuzzy inference system (ANFIS) by the incorporation of additional

layers. Secondly, we will deploy a self organising clustering mountain function as a

membership function which has the property of “automatically” adjusting its shape

to the inputs.

4.2 Introduction to Extended Adaptive Neuro-

Fuzzy Inference System

The adaptive neuro-fuzzy inference system (ANFIS) [12] is a popular neuro-fuzzy

system. It consists of a number of layers implementing the premises, and the conse-

quences of a fuzzy system. It accepts various membership functions in the premises.

However, it is known that ANFIS cannot be applied to inputs with high dimensions.

The reason is that the ANFIS forms the pairwise combination of the inputs at the

1The concepts of ANFIS and extended ANFIS will be explained in this chapter.


premises part. Thus, if there are many inputs, there will be an explosion of the

number of rules which is required in the premises part of the ANFIS. This has been

a limiting factor in the application of ANFIS to practical systems with a high input

dimensions.

In this chapter, we will use the following intuition: if it is possible to “prune” the

number of rules before they are formed in the premises part of an ANFIS, then it

might be possible to apply the ANFIS to a higher input dimension. In other words,

we recognise that the limiting factor in the ANFIS is due to its requirement to form

the pairwise combination of the inputs, whether they are required or not. However,

if there is a way in which we can determine what rules need to be formed at the

premises part, then we may avoid this explosion of the rules if the input dimension is

high. In this case, the limiting factor will be the complexity of the problem at hand.

If the underlying system is very complex and requires many rules being formed, then

this will be a limiting factor, as there is only a limited memory available to allow

the formation of rules. On the other hand, if the underlying is not complex, even

though it may contain a high input dimension, then by avoiding the formation of

the rules at the premises part, we may still utilise the idea of ANFIS to model the

system. The problem is that it is not immediately obvious how one may determine

which rule to form before the adaptation process begins.

In this chapter, we will discuss a way of augmenting the ANFIS by incorpora-

tion of two additional layers which is based on the observation: logistic functions

appear to perform well in classification problems. Hence we extended the ANFIS

architecture by incorporating two extra layers: one involving neurons with logistic


functions, while the other one performing a normalisation process. Then, inspired

by the a priori algorithm in the data mining literature [41], we introduce a method

whereby the number of rules can be “pruned” thus avoiding the need to form all the

unnecessary ones (those that would not fire sufficiently even if they are formed).

This architecture may deploy various membership functions, e.g., triangular

membership function, trapezoidal membership function, Gaussian membership func-

tion. Most of these membership functions require the determination of their para-

meters in a separate step. They would have difficulties adapting to an input which

may require a non-symmetric membership function 2. In this chapter, we will use

the self organising mountain clustering membership function as originally suggested

in [44]. This membership function is in reality an approximate implementation of

the common kernel density estimation technique common in non-parametric statis-

tics [16]. Hence, the membership function, not necessarily symmetric, can adapt to

any input shape.

Thus the EANFIS architecture consists of the following elements:

1. Membership function generation,

2. Rule formulation,

3. Parameter learning related to the parameters associated with the layers, and

4. Output layer parameter learning.

2They can adapt to non-symmetric input functions by increasing the number of membership

functions required to represent the inputs.

4.3. Architecture of the Extended Adaptive Neuro-Fuzzy Inference System 104

The first two steps are related to the structure of the architecture, while the sec-

ond two steps are related to parameter estimation once the structure is determined.

We will describe the parameter estimation algorithm in Section 4.6.2. We will dis-

cuss the structure determination aspects of the architecture in Section 4.4 after a

discussion of the architecture of the proposed EANFIS architecture in Section 4.3.

4.3 Architecture of the Extended Adaptive Neuro-

Fuzzy Inference System

In this section, we will describe the extended ANFIS architecture in Figure 4.1.

One of the insights which derived from working with ANFIS is that when the ar-

chitecture is applied to a discrete output, the performance could be improved if we

use a softmax-type of output 3. This convinced us that perhaps a softmax type

of formulation may help in extending an ANFIS architecture. However, as may be

observed below formulating the problem as a softmax type problem dramatically

increased the complexity of the problem. Indeed, this greatly increases the number

of rules which need to be formed. This motivates us to look for ways in which

the structure of the architecture can be determined before the determination of the

parameters associated with the inference engine. Inspired by the ways that associa-

3Softmax is a device used by neural network researchers in providing a discriminative training

for discrete value outputs. It essnetially normalises the outputs by taking the exponential of the

output and normalise it with the sum of the exponential of all outputs. In neural network literature,

this has been found to be useful for training discrete value outputs.


tive rules can be found for data mining problems, we worked out a way in which

the structure of the architecture can be determined before the parameters need to

be estimated. Thus, using such a method, we find that there is no need to imple-

ment all the possible combinations of membership function outputs. We only need

to implement those few combinations of membership function outputs, using our

proposed method, which are required for the particular practical problem at hand.

Thus, this alleviate the problem of needing to form all combinations of membership

function outputs in the ANFIS architecture. Next, we turned our attention to the

issue of membership functions. We note that most membership function determi-

nation approaches are essentially “open-loop”. In other words, one postulates that

a candidate membership function can be used, and applies the function, through a

trial and error method, in the determination of the number of membership functions

required. There does not appear to be any attention paid to “what the input data is

trying to tell us”. Thus the idea of a possible non-symmetrical membership function

began to germinate in our minds. We find a particular method, called self organ-

ising mountain clustering method, allows us to find the shape of the membership

function required. The self organising mountain clustering method is essentially an

approximation of the well known kernel based method for finding the probability

density function of given data. However, a straight application of such a method

leads to a large number of membership functions. Hence, we modify the concept

of the self organising mountain clustering method so that it can be applied to our

situation.

In this section, we will describe the proposed general extended ANFIS archi-

tecture. In Section 4.4 we will describe the proposed method for determining the


structure of the architecture and in Section 4.5 we will describe the modified version

of the self organising mountain clustering method in determining the membership

function from the given training data.

In the EANFIS architecture, we need to expand the formulation of the ANFIS

architecture dependent on the output. In particular, we need to consider two dif-

ferent situations: when the output is discrete, and when the output is continuous.

We will deal with the situation of discrete output first before we deal with the con-

tinuous output situation. Let us assume that the output is discrete and there are

T output classes. We will denote the desired output as dτi , where i = 1, 2, . . . , I,

the total number of training instances, and τ = 1, 2, . . . , T , T being the number of

output classes. To model such discrete output classes, we assume that the output

of the EANFIS architecture to be discrete, i.e., yτi , where yi is the output of the

i-th input, with a corresponding desired output dτi . From this, we will need to build

a “separate” ANFIS for each output class, τ . We will describe this more formally

following the same approach as the classic ANFIS architecture, albeit with some

modifications.

Layer 1 (Input Layer): In this layer the input vector xd ∈ �D, d = 1, 2, . . . , D.

This input xd is fed into a membership module which consists of C membership func-

tion nodes. The membership function node can be a traditional Gaussian function,

a triangular function, a trapezoidal function, a bell shaped membership function [12]

or the proposed modified self-organizing mountain clustering membership function

(for more details please see section 4.5). There are C membership functions for each

input xd and for each output class τ . Let the output of the membership module be


Error Correction layer

21,2ϕ

dcτϕ �

x

11φ �

12φ �

21φ �

22φ �

rτφ

11φ �

12φ �

21φ

22φ

11π �

12π �

21π �

22π �

12π

21π

22π

rτφ r

τπ � rτπ

��

wr

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6

y

11π �

ANFIS

21,1ϕ

11,2ϕ

11,1ϕ

Figure 4.1: EANFIS architecture


ϕτdc(xd); d = 1, 2, . . . , D, c = 1, 2, . . . , C and τ = 1, 2, . . . , T . In effect, the outputs of

the membership functions ϕτdc(xd) can be considered as a measure of the similarity

between the input xd and the c-th membership function for a particular output class

τ . If they are close, then the output of ϕτdc(xd) will be high. On the other hand, if

the match between the c-th membership function and the input xd is low, then the

corresponding output ϕτdc(xd) will be low.

Layer 2 (Rule Layer): In this layer the membership function outputs are

multiplied together according to a specific scheme as follows: Assuming that for a

particular output class τ each input dimension has C membership functions. Then,

we start with d = 1, which has membership functions ϕτ1,1, ϕ

τ1,2, . . . , ϕ

τ1,C. For d = 2,

there are ϕτ2,1, ϕ

τ2,2, . . . , ϕ

τ2,C variables. We need to form pairwise combination of

these variables ϕτ2,i, i = 1, 2, . . . , C with those of ϕτ

1,j, j = 1, 2, . . . , C as follows:

ϕτ1,1ϕ

τ2,1, ϕ

τ1,1ϕ

τ2,2, . . . , ϕ

τ1,1ϕ

τ2,C , . . . , ϕτ

1,Cϕτ2,1, . . . , ϕ

τ1,Cϕτ

2,C , a total of C2 terms. Then

for d = 3, we will need to form the terms ϕτ3,1, ϕ

τ3,2, . . . , ϕ

τ3,C with each of the products

formed by concatenating input dimensions 1 and 2 together. There will be a total

of C3 terms, with the general form ϕτ1,iϕ

τ2,jϕ

τ3,k. The method can be generalised to

general value of d, until d = D. Thus there will be in general CD rules for each

value τ giving a total of TCD rules. We will denote each rule by φτr with a general

form: ϕτ1,iϕ

τ2,j . . . ϕτ

d,k. Each membership output product is equivalent to performing

the fuzzy T-norm operation, representing the firing strength of this rule [22].

Note that the total number of rules is R = TCD.

Layer 3 (Normalized Layer): In this layer, it calculates the ratio of the

firing strength to the total firing strength. In other words, this layer computes the


normalized outputs of the rule layer.

φτr =

φτr∑R

r=1 φτr

(4.1)

Note that layers 1 to 3 are the layers in the classic ANFIS [22] except in this case

we separate out each output class into a separate strand. Note also that φr ∈ [0, 1]

which denotes the normalised similarity of rule corresponding to the d-th input and

the c-th membership function.

Layer 4 (Error Correction Layer): This layer is used to “fine tune” the

output of layer 3 by using a logistic function.

�(φr, γ) =2

1 + exp(− (1 − γ)(φr − 1

))

(4.2)

where γ ∈ R is an adjustable parameter. Thus the logistic function is one in which

the slope γ can be adjusted 4. The output of this layer is πr = �(φr, γr)φr.

In general, the fuzzy rule node outputs (in layer 3) in an ANFIS architecture

may contain contradictions, overlaps or inconsistencies which may be attributed to

the noise in the training data set or to the blurred cluster regions among different

output clusters. The proposed error correction layer (layer 4) presents one way

to solve these problems. Thus, this layer will become effective if the output of

the neuro-fuzzy system is ambiguous, e.g., if two rules are giving rise to similar

4We will show in a later section how this parameter may be adjusted.


outputs. In this case, it will be difficult to distinguish the effectiveness of the rules.

However, using this layer, we will be able to distinguish the effectiveness of the rules.

Obviously, this layer will not be required if all the rules give rise to well distinct

outputs.

The logic of this layer can be understood as follows:

1. If the degree of similarity φr is close to 1 and �(φr, γr) is high then the output

πr is high.

2. If the degree of similarity φr is close to 1 and �(φr, γr) is small then the output

πr is still high. It is because this can be thought of as a rarity situation when

there are only very small samples of the case exist.

3. If the degree of similarity φr is close to 0.5 and �(φr, γr) is low then the

output πr is low. It is because this fuzzy rule has a low �(φr, γr) which means

it contributes many errors during the training process. The output from this

rule is untrustworthy. The output strength πr is lower accordingly in this case.

4. If the degree of similarity φr is close to 0.5 and �(φr, γr) is high then the

output πr is medium. If the �(φr, γr) is high it does not apply any discount

to the output πr.

This layer may be formally represented as follows:

Ruler: If x1 is ϕτ1,c and ... and xd is ϕτ

d,c then

πr = φr�(φr, γr)


Layer 5 (Normalized Layer): This layer performs the normalisation of the

outputs of Layer 4:

πr =πr∑Rr=1 πr

(4.3)

The output of this layer is normalised to lie between 0 and 1.

Layer 6 (Output Layer): The output layer can accept two possible forms, viz.

the zeroth order output and the TSK scheme respectively. This is exactly the same

as shown in Section 2.5.1.

For continuous outputs, we will assume that τ = 1. Thus the continuous output

case can be considered as a special case of the more general discrete output case.

The adjustable parameters of this architecture are: γr, r = 1, 2, . . . , R and wr,

r = 0, 1, 2, . . . , R in the case of zeroth order output function, and in addition, αrd,

and αr0, for r = 1, 2, . . . , R, and d = 1, 2, . . . , D in the case of TSK output mech-

anism. Furthermore, there are parameters associated with the membership func-

tions as well. This depends on the membership function used. For example, if we

use Gaussian membership function, then there will be two parameters associated

with each membership function. On the other hand, if we use triangular member-

ship function, then there will be three parameters associated with each membership

function.


4.3.1 Remarks

1. What is the difference between this architecture and the classic multilayer per-

ceptron architecture? One may collapse Layers 1, 2, and 3 into an aggregate

input layer, and Layer5 and 6 (assuming for simplicity, the output layer is a

zeroth order output function) into one aggregate output layer, then we have a

general input layer followed by a logistic function layer and then followed by

a general output layer. It is tempting to try to compare this with the MLP

architecture. However, this architecture is different from the classic MLP ar-

chitecture in that in the classic MLP architecture, the output is formed by a

combination of the outputs of hidden layer neurons. On the other hand, in the

proposed architecture, the output is formed by a single weight wr associated

with a logistic function with an adjustable slope. In other words, in the pro-

posed architecture, the output is formed from the output of a single “hidden

layer neuron” (in the language of neural networks), and hidden layer neuron

in this case, has an adjustable slope. This is different from the classic MLP

architecture in which the logistic function normally has a fixed slope.

Indeed there were research work which associate a MLP architecture with

the output of the neuro-fuzzy architecture [12]. However, in this case, it was

shown that the performance of the cascaded architecture of ANFIS architec-

ture, and MLP architecture is not good. This is one reason why we choose the

proposed architecture which draws individual output from individual logistic

functions with each output of the “hidden layer neurons” (in the language of

neural networks). As will be shown in later sections this provides much better

outputs.


2. The consequence of the above remark is that we cannot say anything about

the universal approximation capability of the proposed EANFIS architecture.

Indeed it is known that a neuro-fuzzy architecture does not have universal

approximation capability. Hence it is suspected that the proposed architec-

ture does not have universal approximation capability. However, on the other

hand, it is rather difficult to extract “meaningful” rules from a classic MLP

architecture. On the other hand, with a neuro-fuzzy architecture it is possible

to extract meaningful rules from the architecture. Indeed as will be shown in

a later section, the proposed architecture can extract meaningful rules from

the trained architecture. Hence this can be considered as a tradeoff between

the universal approximation capability and the ability to extract rules from

the trained architecture.

3. It is known in the literature [2] that there may be situations when the outputs

of the neuro-fuzzy architecture is ambiguous, in that the outputs are similar

but yet they belong to different classes. In this case, one solution is to use

the concept of “certainty factor” [2], which manually assign some weighting

on the outputs to reflect what the user consider as important. In the pro-

posed architecture, we have introduced an automatic method in assigning the

importance of the outputs through the deployment of an adjustable slope of

a logistic function. Indeed it is possible to derive the adjustable slope of a

logistic function through a Bayesian theorem, very much in the same spirit as

some of the work in classic certainty factor approach. However, we decided to

introduce the concept as indicated above as we believe this may give a better

direct insight into the proposed architecture.

4.4. Structure determination of the proposed neuro-fuzzy architecture 114

4.4 Structure determination of the proposed neuro-

fuzzy architecture

In this section, we will consider the issue of “pruning” the number of rules. This issue

arises because in a neuro-fuzzy network, in the antecedent, the normal approach is

to form all possible pairwise combination of inputs as indicated in Layer 2 in Section

2.5.1. As might be surmised, this leads to an explosion of the number of rules if

the input dimension is high. For example, if we have an input dimension of 10,

assume that there are two membership functions per dimension and the output is

continuous, then the total number of rules will be 210 = 1024. The formation of

these rules is undertaken whether the rules are required or not in the neuro-fuzzy

architecture. Note that while some pruning process may be applied, e.g. see [47,48],

nevertheless, this is only applied after the rules are formed. In other words, whether

the rules are required or not, they need to be generated first. This explosion effect is

a limiting factor for the neuro-fuzzy architecture to be applied to a practical system

with high input dimensionality. In this section, we will propose a method guiding

the formation of the rules. The method is inspired by the Apriori algorithm in data

mining using associative rules [41]. However, as it will become clear, the proposed

method is not the same as the Apriori algorithm [41]. It is only in the spirit of the

Apriori algorithm. Note that in this case, only the rules which are required will be

formed rather than forming all possible combinations of the rules. In other words,

the proposed method will form the required rules (those that will be fired by the

neuro-fuzzy network), and will not form those rules which will not be “fired” by

the network, in response to inputs. Obviously, the rules which need to be formed


is dependent on the input set. This method will facilitate the proposed neuro-fuzzy

architecture to be used for practical problems, as the method is only bounded by

the total number of rules required for the problem for a particular set of inputs.

Obviously if the underlying characteristics of the input set change, the rule set will

need to be changed as well. This can be incorporated in a “sentinel” module, which

monitors the underlying characteristics of the inputs, and then decide on when to

change the rule sets. In this paper, we will assume that the input set characteristics

are relatively static, and hence would not need to have a “sentinel” unit to monitor

its characteristics, and thus have multiple rule sets to characterize the problem. This

will be reserved as a future problem of research.

We will first briefly described the Apriori algorithm before we describe the pro-

posed method.

The Apriori algorithm (a shortened form of a priori algorithm) is a popular

algorithm used in data mining using associative rules [41]. It was originally proposed

in [41] to study the “shopping basket” problem. The “shopping basket” problem

may be stated briefly as follows: if a customer is purchasing a certain group of

items, what is the likelihood of the person buying another group of items in the

same shopping session. The set of items that a customer purchases is called an

itemset. Assume for convenience that each customer’s transaction has k items. The

algorithm finds all itemsets Lk greater than a threshold that each transaction has.

The Lk is then used to generate the candidate set Ck. The candidate set is the

union of Lk ∪Lk−1. The candidate set is used to form another new larger itemset by

removing all those itemsets which are below the threshold. The algorithm repeats


until Lk is empty [41].

We will illustrate this algorithm using a simple example. The procedure is shown

graphically in Figure 4.2.

TID Items

Transaction

T100 T101 T102 T103 T104 T105

1,2,4 2,3 2,4 1,2 1,2,3 1,3

Itemset Counts

{1} {2} {3} {4}

4 5 3 2

L1

Remove all itemsets that below threshold

Itemset

{1,2} {1,3} {1,4} {2,3} {2,4} {3,4}

C2

Itemset Counts

{1,2} {1,3} {2,3} {2,4}

3 2 2 2

L2

Remove all itemsets that below threshold

Join

3 2 1 2 2 0

Counts

Itemset Counts

{1,2,3} {1,2,4} {1,3,4} {2,3,4}

1 1 0 0

C3

Join

Assume threshold = 1

Figure 4.2: An example illustrating the determination of the maximum itemset inthe Apriori algorithm.

The input data to the Apriori algorithm consists of a set of transaction records. The

TID (Transaction Identifier) column is the transaction ID and the ‘items’ is the item


number involved in each transaction. For example, in the transaction ID T100, the

items purchased are 1, 2, and 4. In this example, there are six transactions. We

assume that a threshold of 1 has been set. Thus, we look at the TID column, and

find out if the occurrence of items 1, 2, 3, or 4 is less than or equal to one. In this

example, all items occurred at least twice. Hence we cannot eliminate any item.

In this case, we will form an itemset denoted as Table L1 in Figure 4.2. Since all

items are present (as none of them is below the threshold), and hence we have four

itemsets in this set: L1. Since there is only one candidate item in each itemset, hence

the count column in this table essentially counts the occurrence of each item. From

the TID table we find that there are four occurrences of the item 1 (in T100, T103,

T104, and T105), and hence the entry in Table L1 for itemset set {1} is 4. The

Table C2 is obtained by joining the itemsets in Table L1 together. Here the joining

is performed in a lexicographical order, with non-repeats. Thus, for example, from

itemset {1} in Table L1, we can form the following itemsets {1, 2}, {1, 3}, {1, 4}.Now we can compare the pattern of the itemset {1, 2} with the TID column and

find out the number of occurrence of this pattern. In this case we find that there are

three occurrences (T100, T103, and T104). Hence the entry in the column Count in

Table C2 is 3. We can remove all itemsets in Table C2 which are below the threshold.

In this case, we have itemsets {1, 4} and {3, 4} which are below the threshold, and

hence they will not be considered further. Table L2 is formed by removing all these

itemsets which are below the threshold with the corresponding count of occurrences.

Table C3 can be formed by joining (concatenating) the itemsets in Table L2 together

with those in Table L1. Thus the first candidate itemset would be {1, 2} from Table

L2 with itemset {1}. However, this cannot happen as item 1 already exists in the


itemset {1, 2}. Hence the only possibility would be {1, 2} in Table L2 concatenate

with itemset {3}. This results in candidate itemset {1, 2, 3}. Then we compare this

pattern with those in Table TID, and find that there is only one such occurrence

(T104). It is found that from Table C3 that all occurrences are less than or equal

to the threshold. Hence the process stops.

In this “shopping basket” example, we find that not every item combination

exists in the transaction record. The Apriori algorithm removes such a combination

if it does not exist or if it is below a prescribed threshold. The procedure may be

extended to provide information on the support and confidence of a particular rule

found.

4.4.1 A proposed algorithm for rule formation

In this section, we will give details of a proposed algorithm for rule formation. This

algorithm is inspired by the ways of finding the maximum itemset in the Apriori

algorithm. However, our proposed algorithm is different from the one used in the

Apriori algorithm.

In a way the proposed algorithm is like running the maximum itemset determi-

nation algorithm backwards. Instead of considering each item by itself, we will start

with the clusters identified in the self organising mountain clustering membership

function method (for details please see Section 4.5.). In the self organising mountain

clustering membership function technique, as will be described in Section 4.5, we

identify sets of clusters. These clusters are formed by the closeness of a particular


set of data points with respect to the underlying grid points. for example, in the

d-th dimension, there are Gd grid points. Then we can evaluate the closeness of a

particular data point xd,i with respect to each grid point using the following:

ωτd,i = arg max

g∈Gd

⎛⎜⎝exp

⎛⎜⎝−

(xτ

d,i − zτd,g

)22(zτ

d,(g+1) − zτd,g

)2

⎞⎟⎠⎞⎟⎠ (4.4)

This operation will identify the association of the data points xd,i, i = 1, 2, . . . , I

with particular grid points in the underlying grid. Note that it is possible that each

grid point may be associated with more than one data point xd,i, i = 1, 2, . . . , I. As-

suming that each grid point is associated with ηg data points, where g = 1, 2, . . . , Gd.

Note that ηg may be 0, as it may occur that none of the data points are close to a

particular grid point. Then the user can decide to group a number of grid points

together to form segments. Such segments will be the equivalent of the itemsets in

the associative rule mining techniques. Let us denote each segment as si. Thus, for

example, in one particular dimension, we may have, say, data points {1, 2, 3} which

are closest to, say, grid point 2; data points {4, 5, 6} are closest to grid point 4,

and data points {7, 8} are closest to grid point 7. Then the user decides that there

should be two segments (where two is an arbitrary decision; it could have been easily

three segments), one stretching from grid point 1 to grid point 5, while the second

segment will cover grid points 6 to 10. Then in this case, segment 1 will contain

data labels S1 = {1, 2, 3, 4, 5, 6} while segment 2 will contain data labels S2 = {7, 8}.The data in this dimension is then represented as S1 ∪ S2. Thus the segments Sj

plays the same role as the itemsets in associative rule formation.


These will be the starting point of the proposed algorithm for rule formation.

Assume that there are clusters of grid points formed on each input dimension.

Assume that there are D input dimensions, each dimension consists of Cd, d =

1, 2, . . . , D, clusters. We will consider each dimension in turn. We will consider

first d = 1. In this dimension there are C1 clusters. These will be the itemsets

for the next iteration. We will call this set of itemsets L1. Then we concatenate

the clusters of dimension d = 1 with those in dimension d = 2 by forming the

operation {ωτ1,1, ω

τ1,2, . . . , ω

τ1,κτ

1,1}∩{ωτ

2,1, ωτ2,2, . . . , ω

τ2,κτ

2,1} where ωτ

d,g denotes the data

label stored in g-th grid point in the d-th dimension. Assuming that there are

κd clusters found in this process, for each input dimension xd. The value of ωτd,i

is determined as shown in Eq(4.4). From this concatenation operation, we detect

the common elements. If the number of elements in the concatenation is above

a prescribed threshold, this will be passed on in the next iteration as an element

in the itemset. On the other hand, if the number of common elements in the

concatenation is smaller than the prescribed threshold, this will be eliminated from

further discussions. It is discarding of elements from further considerations that

is the key to reducing the number of rules which need to be formed. The process

repeats until d = D. Then the rules will be obtained from an inspection of the final

table Lk.

We will consider a simple example to illustrate the proposed rule formation

method. We will first consider the d = 1 axis. There are two clusters: {1, 2, 3} and

{4, 5, 6} respectively. This is shown in the table called Clustered Data in Figure

4.3. For convenience we will label {1, 2, 3} as cluster 1, and {4, 5, 6} as cluster


2. Each cluster consists of three data labels. This information is displayed in the

table called L1 in Figure 4.3. The threshold is 1. The column itemset has two

elements each denotes the cluster {1, 2, 3} and cluster {4, 5, 6} respectively. The

column “Combination of clusters” denotes the label we provided these two clusters,

i.e., cluster 1, and cluster 2 respectively. The column “Count of elements” denotes

the number of elements in the cluster. In both cases, there are three elements in

the cluster. Then we concatenate the clusters in dimension d = 1 with those in

dimension d = 2 using a join operation. As there are two clusters in the d = 2

dimension: {1, 2} and {3, 4, 5, 6}, we can concatenate the clusters of the dimension

d = 1 with those in dimension d = 2 and find the common elements. Thus, in

the table called C2 in Figure 4.3, the first element in the column “itemset” shows

the concatenation of the first cluster {1, 2, 3} in dimension d = 1 with the first

cluster {1, 2} in dimension d = 2. This is denoted by {1, 2, 3} ∩ {1, 2}. This is

denoted by (1), 1 in column “Combination of clusters” in Figure 4.3. In this case,

we find that there are two common elements 1, and 2 and hence the entry of 2 in the

column “Counts of common elements”. In a similar manner, we can find the values

on all the columns of the table: C2. Since the threshold is 1, we can remove the

entry corresponding to {4, 5, 6} ∩ {1, 2}, as there are no common elements in this

concatenation. Then we transfer this information into the table: L2. The entries

in the column “Itemset” are the concatenation of the clusters and the common

elements. Thus for example, the first entry of column “itemset” is obtained by

the concatenation of cluster 1 {1, 2, 3} in d = 1 dimension and cluster 1 {1, 2} in

dimension d = 2. The common elements in these two clusters is {1, 2}. This will

be the itemset. There are only two common elements, and hence the entry in the


column “Counts of common elements” is 2. The entry in the column “Combination

of clusters” denoted by (1), 1 signifies the result is obtained by the concatenation

of cluster 1 in d = 1 dimension with that of cluster 1 in the d = 2 dimension. The

entries in the table called C3 denote the join operation of the results of Table L2

with those clusters on the d = 3 dimension. Thus the first element is formed by

concatenation of the common elements found by concatenating cluster 1 in dimension

d = 1 and cluster 1 in dimension d = 2 with cluster 1 in the d = 3 dimension.

This is denoted by {1, 2} ∩ {1, 2, 3, 4}. Thus the meaning of the first entry in

the column “Combination of clusters”: (1, 1), 1. Here we find that there are two

common elements, viz. {1, 2}. Hence the first entry in the column “Counts of

common elements” is 2. This process is repeated for the other clusters, and Table

C3 is fully populated. As the threshold is 1, and hence we can eliminate entries

{1, 2} ∩ {5, 6}, and {4, 5, 6} ∩ {1, 2, 3, 4}. The remaining information is transferred

to the table called L3. Here there are only two values which are above the threshold

{1, 2} ∩ {1, 2, 3, 4}, and {4, 5, 6} ∩ {5, 6}. Hence the entries in the final column of

Table L3 are both 2 denoting that there are only two common elements. The entries

in the column “Combination of clusters” denote the way in which the clusters are

formed. For example, the first element is formed by the concatenation of cluster 1

in d = 1 dimension, cluster 1 in d = 2 dimension and cluster 1 in d = 3 dimension.

The “Itemset” column denotes the common elements as a result of the concatenation

process. Since there are only three input dimensions, and hence the process stops.

In this example, we finally conclude that there are two fuzzy rules (as in Table

L3 in Figure 4.3 there are only two remaining entries):


Clustered Data

c {1

, 2, 3

} {4

, 5, 6

}

{1, 2

} {3

, 4, 5

, 6}

{1

, 2, 3

, 4}

{5, 6

}

d

{1,2,3} {4,5,6}

L1

Itemset

{{1,2,3} ∩ {1,2}}

{{1,2,3} ∩ {3,4,5,6}}

{{4,5,6} ∩ {1,2}}

{{4,5,6} ∩ {3,4,5,6}}

C2

Itemset Counts of common elements

{1,2} {4,5,6}

2 3

L2

Join

Join

Assume threshold = 1


3 3

Itemset

{{1,2} ∩ {1,2,3,4}}

{{1,2} ∩ {5,6}}

{{4,5,6} ∩ {1,2,3,4}}

{{4,5,6} ∩ {5,6}}

C3


{1,2} {5,6}

2 2

L3

(1),1

(1),2

(2),1

(2),2

Counts of common elements Combination

of clusters

(1),1 (2),2

Combination of clusters

(1,1),1

(1,1),2

(2,2),1

(2,2),2


(1,1),1 (2,2),2


1 2


2

1

0

3

2

0

1

2

Counts of common elements

Figure 4.3: An example to illustrate the proposed rule formation algorithm.


Rule1: IF cluster1 in d = 1 dimension ∧ cluster1 in d = 2 dimension ∧ cluster1 in

d = 3 dimension THEN Consequence1.

Rule2: IF cluster2 in d = 1 dimension ∧ cluster2 in d = 2 dimension ∧ cluster2 in

d = 3 dimension THEN Consequence2.

From this description it can be observed that the proposed procedure is quite

different from the maximum itemset determination in the Apriori algorithm. It

seeks to find the combination of clusters such that there are common elements in the

clusters. Note that these common elements are represented by the data labels store

in the clusters. Nevertheless the proposed algorithm is inspired by the maximum

itemset determination algorithm in the Apriori algorithm.

It is possible similar to the Apriori algorithm [49] to compute the support and

the confidence of the rules formed. Assume X is the input data, A is the premise

and B is the consequence. The support measures the number of data items that has

both the premises and the consequence in the whole data set:

Sr (A ⇒ B) =

∑i∈Iτ

πτi,r

# (∀X)(4.5)

where S denotes the support and # denotes the number of incidences. I = {I1∪I2∪. . .∪IT}, Iτ , τ = 1, 2, . . . , T denotes the training data set belonging to the class τ , πi,r

denotes the response of the normalised rule πr to the i-th input 5. The confidence

5Each rule has a dependency on the input value. This is not explicitly denoted in the symbol

πr.

4.5. Determination of candidate membership function and the required number ofmembership functions in each input variable 125

measures the number of data items that has both the premises and the consequence

in the data set that has the premises:

Cr (A ⇒ B) =

∑i∈Iτ

πτi,r

I∑i

πτi,r

(4.6)

where C denotes the confidence. High support and high confidence indicate high

strength of the particular rule. Thus, the measures of support and confidence provide

a means of ascertaining the importance of a particular rule found.

This section provides a method whereby it is possible to form only the rules

which are important to the problem, rather than forming all the rules upfront,

irrespective if they will be necessary to the problem or not. By using this rule

formation algorithm the proposed architecture can be applied to practical problems,

as it is not bounded by the input dimension D.

4.5 Determination of candidate membership func-

tion and the required number of membership

functions in each input variable

The Self-organizing mountain clustering membership function [44] is a fully data

driven method to find a suitable membership function for the neuro-fuzzy network.

It does not have a pre-defined shape. The final shape of the membership function


is determined by the data.

We have D inputs, xd. Each dimension of the input xd is sampled. Assume

that we have the input data sampled as xd,i, i = 1, 2, . . . , I, and d = 1, 2, . . . , D.

We further assume that there is an underlying grid in which the closeness of the

input samples, xd,i is measured. With this underlying grid it will be possible to

consider how well the input samples are grouped into clusters. It is assumed that

each input dimension is endowed with a grid with Gd samples in each dimension for

each output class τ . In general, I = Gd. There are two situations dependent on

whether the output is discrete or continuous. We will denote each grid point as zτd,g;

d = 1, 2, . . . , D, and g = 1, 2, . . . , Gd. τ = 1, 2, . . . , T the number of output classes.

T = 1 for continuous outputs.

zτd,g = xmind

+

(xmaxd

− xmind

Gd − 1

)(g − 1) (4.7)

where xmindand xmaxd

are respectively the minimum and the maximum values of

xd in d-th dimension. In other words, we create a grid for each output class τ in the

case of discrete outputs, and in the case of continuous outputs, we will have only

one underlying grid. Figure 4.4 illustrates the grid point distribution in the d-th

dimension.

In many cases, a uniform grid regime is normally used as shown in Equation(4.7).

However, a non-uniform grid regime may be useful for solving some particular prob-

lems when it is intuitively clear that a non-uniform grid regime will be beneficial

to the approximation of the inputs. In this case, we can use a simple method for


maxdxmind

x

,1dzτ

,2dzτ

,d gzτ,d Gzτ

z

Figure 4.4: Illustration of the distribution of grid points in the d-th dimension ofthe input xd.

determining a non-uniform grid as proposed in Chapter 3. When the non-uniform

grid regime is used, each output type τ has its own grid.

Once the grid points are obtained, whether they are using a uniform grid regime,

or a non-uniform grid regime, we wish to investigate how well the input data clus-

tered into groups. For this, the self organising mountain clustering algorithm [44] is

deployed to calculate the input data density distribution using the following:

Υτd,g =

1

N τ

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

∑i∈Iτ

exp

(− (xτ

d,i−zτd,g)

2

2�zτd,(g+1)

−zτd,g

�2

)

∑g∈Gd

∑i∈Iτ

exp

(− (xτ

d,i−zτd,g)

2

2�zτd,(g+1)

−zτd,g

�2

)⎫⎪⎪⎪⎪⎬⎪⎪⎪⎪⎭

(4.8)

where N τ is the number of points which correspond to the output class τ , I =

{I1 ∪ I2 ∪ . . . ∪ IT}, Iτ , τ = 1, 2, . . . , T denotes the training data set belonging to

the class τ . Note that this self organising mountain clustering algorithm differs

slightly from the one proposed in [44]. In Equation (4.8) the accumulated sum is


normalised, while in [44], they use the un-normalised version. The value of Υτd,g

may be interpreted as follows: the numerator computes, for each output class τ , the

accumulated “strength” of each grid point zτd,g measured with respect to the input

xd. For uniform grid regime, zτd,(g+1) − zτ

d,g is a constant. The denominator is a

normalizing factor which will normalise the accumulated “strength” with respect to

the grid points so that the total contribution will sum to 1. In a way, the numerator

or the normalized value Υτd,g already provides information on the distribution of the

data density along particular dimension d. This can be used as a fuzzy membership

function. However, this does not take into account the distribution of the classes

at a particular grid point. Hence the correct formation of the membership function

will be the value of Υτd,g modulated by the distribution of the classes at each grid

point. This modified self organising membership function will provide the correct

membership function for the inputs.

The value of Υτd,g can be normalised across all values of τ at each grid point as

follows:

Υτd,g =

Υτd,g

T∑τ=1

Υτd,g

(4.9)

For continuous outputs, Υτd,g = 1 as τ = 1.

In addition, the Υτd,g can be normalised with respect to the input strength in

each input dimension d as follows:


Υτd,g =

Υτd,g

‖Υd‖ ∀g, and ∀τ& for a particular value of d

=Υτ

d,g

max(svd(Υd))+β∀g and ∀τ & for a particluar value of d

(4.10)

where ‖Υd‖ denotes the matrix norm of Υd, and the T ×Gd matrix Υd is formed by

collecting the rows corresponding to the d-th dimension of values Υτd,g for all values

of τ and g. The algorithm will run for d = 1, 2, . . . , D. β is a small number and

SVD stands for singular value decomposition of a matrix. The evaluation of the

matrix norm ‖ · ‖ using singular value decomposition is a standard procedure [33].

The value β is included to prevent the situation when the largest singular value is

close to zero.

Thus the membership function can be obtained as follows:

ϕτd,g = Υτ

d,gΥτd,g (4.11)

Note that there will be T ×Gd membership functions associated with each input

dimension d.

In order to decrease the computational time, we combine the grid points to-

gether into a cluster if the grid points are close together. This can be performed

by examining the density strength Υ computed, where Υ is the D ×Gd × T matrix

formed by the normalised values Υτd,g. If the slope of the Υ along the grid direction

changes from positive to negative or if it is zero then we merge the data labels which

belong to the winning grid point ω to form a cluster. On the other hand, if the slope

along the grid point direction changes from negative to positive this indicates that


it is the beginning of a new cluster. Note that in this case, there will be one such

graph associated with each input dimension d and each output class τ . The detailed

algorithm is shown in Figure 4.5.

For each output type τFor each dimension d

InitializationFor each grid point g

If (lastpoint > Υτd,g)

slope=-1elseif (lastpoint < Υτ

d,g)slope=1

elseslope=0

End ifIf (lastslope == -1 AND slope == 1)

Store grid position to κτd,c=floor(last grid point + g)/2

Store data label stored in ωτd,g to new cluster

elseMerge data label stored in ωτ

d,g to current clusterEnd iflastpoint=Υτ

d,gIf (slope != 0)

Update lastslope = slopeUpdate last grid point = g

End ifEnd

EndEnd

Slope1

0

-1

FirstCluster

SecondCluster

Figure 4.5: The pseudo code implementation of the proposed algorithm for theformation of clusters. The small diagram on the top right hand corner illustrateswhen a new cluster is formed, and when grid points are merged together to form acluster.

After the input data distribution is analyzed using the self organising mountain

clustering membership function as indicated, the outputs obtained are discrete, due

to the grouping of points together. Such discrete values will need to be interpolated

so that the intermediate values are available. There are many possible interpolation

schemes. In this paper, we will use a particular scheme, called Hermite interpolation

function. We chose to use the Hermite interpolation function because it is a simple


one to apply. In Equations (4.12) and (4.13), there are three parameters which need

to be set: (a) the data input xd, (b) an array {zd,κτc−Δ

, zd,κτc+1+Δ

} that contains the

grid points from the beginning of a cluster to the end of the cluster for the output

class τ and Δ is the discretization interval, and, (c) the values {Υτd,κτ

c−Δ, Υτ

d,κτc+1+Δ

}where the array κ(·) stores the location of the points which are in the same cluster

c. The range of the points which contains the cluster is extended by Δ on either side

to prevent the input data falling out of the kernel to overcome “boundary” effects.

Pd,c (τ) =

f(xd,

{zτ

d,κτd,c−Δ

...zτd,κτ

d,c+1+Δ

},{

Υτd,κτ

d,c−Δ...Υτ

d,κτd,c+1+Δ

}) (4.12)

Pd,c (xd|τ) =

f(xd,

{zτ

d,κτd,c−Δ

...zτd,κτ

d,c+1+Δ

},{

Υτd,κτ

d,c−Δ...Υτ

d,κτd,c+1+Δ

}) (4.13)

where f(·) is the Hermite interpolation function.

We will need a way to combine the probability measure and the density measure

together so as to give the desired membership function. In this case we will use

the naive Bayes classifier [46]. The naive Bayes classifier algorithm classifies a data

point x by using a density function P (x|τ) and a probability function P (τ):

P (τ |x) =P (x|τ) P (τ)

P (x)(4.14)

where P (x) is a normalizing factor which can be omitted.


When an input x belongs to a particular output class τ , the Bayes’ function

belonging to that particular output class will have a large response.

Because the input dimension is partitioned into clusters in the proposed al-

gorithm, the prior probability P (τ) is the probability function within the cluster

as computed using Equation (4.12) and P (x|τ) is the density probability function

within the cluster as computed using Equation (4.13).

Hence from Bayes’ theorem, we have:

ϕτd,c (xd) = Pd,c (xd|τ) Pd,c (τ) (4.15)

where d = 1, 2, . . . , D is d-th dimension of the input, and c, c = 1, 2, . . . , C is the c-th

cluster in the membership function. Note that there will be as many membership

functions in each output class τ as required.

An example of the Self-organizing mountain clustering membership function

smoothed with Hermite interpolation function is shown in Figure 4.6. In this ex-

ample, there is a one-dimensional input, with two output classes. The values of the

input corresponding to each output class are shown in the table on the top left hand

corner. We create a five point grid for each class. The values of Υτd,g are computed

as shown in the top right hand corner. Then, the values of Υτd,g and Υτ

d,g are shown

respectively in the two tables in the middle row. Then the values of the member-

ship function for each output class is computed using the naive Bayes theorem, as

indicated previously in Equation (4.15). These membership functions are shown as

4.6. Parameter estimation 133

graphs in Figure 4.6.

The self organising mountain clustering membership function is a general tech-

nique for constructing a membership function from the given data. Hence this

may be applied to a general ANFIS architecture. In this case, the self organising

mountain clustering membership function will provide an alternative for finding the

centers and spreads of the Gaussian function membership function commonly used

in the ANFIS architecture as follows:

ϕτd,c (xd) = exp

⎛⎜⎜⎜⎝−

(xd −

zτd,κτ

d,c+zτ

d,κτd,c+1

2

)2

(zτ

d,κτd,c+1

− zτd,κτ

d,c

)2

⎞⎟⎟⎟⎠ (4.16)

Thus this may be a viable alternative to finding the centers and spreads of the

Gaussian functions.

4.6 Parameter estimation

The adjustable parameters in the architecture can be estimated from the training

data. There are two approaches. The first approach is to assume that all parame-

ters can be simultaneously adjusted, by an output error function, and the second

approach is a phased approach, which adjusts the parameters in each section of

the architecture individually by holding the parameters in other sections of the ar-

chitecture constant. There are three natural sections in the architecture: (1) the


Raw Data

0.49 0.38 0.38 0.44 0.33 0.11 0.33 0.22

1 1 1 1 1 2 2 2

Strength output ϒ from the mountain function

0.2703 0.3091 0.2513 0.1342 0.0351

0.0087 0.0702 0.2406 0.3807 0.2998

Type 1 Type 2

Gri

d

Probability Measure ϒ

0.9687 0.8150 0.5109 0.2606 0.1049

0.0313 0.1850 0.4891 0.7394 0.8951

Type 1 Type 2

Gri

d

Density Measure ϒ

0.4160 0.4756 0.3867 0.2065 0.0541

0.0134 0.1080 0.3702 0.5859 0.4613

Type 1 Type 2

Gri

d

Type

Use 5 grid points in a dimension

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Grid

Out

put

Final Membership Function

Class 1Class 2

Figure 4.6: Example of finding the membership function using the proposed selforganising mountain clustering membership function method.


input section (the formation of the membership function), (2) the parameters γc

c = 1, 2, . . . , R in the error correction layer (Layer 4), and (3) the output layer

(Layer 6). The simultaneous adjustment of all parameters together is quite diffi-

cult, in that it is a highly complex optimisation problem. Hence in this paper, we

will consider the second approach, viz., to adjust the parameters of a section of the

architecture, while holding the parameters of the other sections constant.

The adjustment of the parameters of the membership function can be carried

out once the type of membership function is chosen. For example, if we choose

the Gaussian function as the membership function, then the parameters pertaining

to the Gaussian membership function may be determined from the training data.

On the other hand, if a self organising mountain clustering membership function is

chosen, then the method shown in Section 4.5 can be used to automatically find the

parameters of the membership function. In this section, we will only be considering

the issues of training the parameters in the error correction layer, and the parameters

in the output layer.

4.6.1 Error Correction Layer (Layer 4) parameter learning

We are given the following training data: xd,i, i = 1, 2, . . . , I, and d = 1, 2, . . . , D,

with an associated desired output set dτi , i = 1, 2, . . . , I; and τ = 1, 2, . . . , T . Here

for simplicity we assume that there is only a scalar output. The training method for

a continuing output and for the discrete outputs (with output classes τ = 1, 2, . . . , T )

are different.


1. Continuous value

It minimizes the output errors ei = di − yi while for the discrete outputs, it

minimizes the error of the type misclassifications.

γnewi = γold

i − η∂E

∂γi(4.17)

The parameters to be adjusted in the error correction layer (layer 4) as de-

scribed 6 by Equation (4.2). The gradients are shown in Equation (4.18). This

is no more than the common chain rule of differentiation.

∂E∂γr

= ∂E∂Ei

I∑i=1

∂Ei

∂ei

∂ei

∂yi

R∑r=1

∂yi

dπir

∂πir

∂πir

∂πir

∂γr(4.18)

Note that the values of πir etc are produced from a particular choice of mem-

bership functions with particular parameters. Hence ∂E∂γr

depends directly on

the parameters of the membership function. The gradient descent algorithm

then provides a way to adjust the parameters of the membership functions.

2. Discrete output classes

6Note that Eq(4.2) describes the signals which are generated from the deployment of the mem-

bership functions used in the antecedent. The parameters associated with these membership func-

tions are the ones to be adjusted. As there are many possibilities for the membership functions,

e.g., Gaussian membership function, self organising mountain clustering membership function, we

decided to refer to the equation, instead of providing an indication of the parameters to be adjusted.


The training of the case with discrete output classes is different from that of

the continuous output case. In this case it is necessary to provide the negative

output class as well. In other words, instead of considering only the particular

output class, we need to consider the possibility of the negative counterpart:

the possibility of the output not in the output class. This situation is shown in

Figure 4.7. It is observed that in this case, the positive part and the negative

part are “intertwined” together. In the case of the continuous output, as

τ = 1, and hence there is no need for the introduction of the negative part of

the architecture.

Error Correction layer

21,2ϕ

dcτϕ �

x

11φ �

12φ �

21φ �

22φ �

rτφ

11φ �

12φ �

21φ �

22φ �

11π �

12π �

21π �

22π �

12π �

21π

22π

rτφ r

τπ � rτπ

��

wr


11π �

ANFIS

21,1ϕ

11,2ϕ

11,1ϕ

��

τ

τ¬

Figure 4.7: A diagram to illustrate the training of the EANFIS for the case ofdiscrete output classes. The notation ¬ denotes the negative of the output class τ .


Thus the training for the discrete output classes is to minimize the misclassifi-

cation errors of the output cluster classes in the rule layer. In the classification

process, each fuzzy rule should give an output which belongs to one output

class. Hence, the normalized output for the correct output cluster from the

summation of different fuzzy rule sets in layer 4 should be 1. On the other

hand, the normalized output for the other output clusters should be 0. The

error function for the correct output cluster is shown in Equation (4.19) and

for all the other output clusters is shown in Equation (4.20).

eτi = 1 −

∑r∈τ

πτir (4.19)

e¬τi = 0 −

∑r∈τ

π¬τir (4.20)

Apply the gradient descent algorithm to minimize these errors result in Equa-

tions 4.21 respectively.

∂E

∂γτr

=∂E

∂Ei

I∑i=1

∂Ei

∂eτi

· ∂eτi

∂πτir

· ∂πτir

∂πir· ∂πir

∂γτr

(4.21)

Again the values of πir etc. depends on the parameters of the membership func-

tions, and hence, it would not be possible to provide a more explicit formula

for the gradient function. Once a particular membership function is chosen,

then it is possible to provide a gradient descent algorithm for the adjustment

of the parameters so as to minimize the error function.


4.6.2 Output layer parameter learning

Training in the output layer is straight forward as it is minimising a cost function

with respect to a linearly parameterized equation. The weights are calculated using

a pseudo-inverse method by using the common normal equation [33].

4.6.3 Summary

To summarise the development up to this point, the system consists of four main

parts:

(1) Membership Function. The membership function can be any classic mem-

bership function such as Gaussian function or the proposed self organising

mountain clustering membership function. if we are using the self organis-

ing mountain clustering membership function, the grid over the input space

can be either a linear grid or non-uniform grid. For the determination of the

non-uniform grid regime we can use the method described in Chapter 3. The

self organising mountain clustering membership function can also be used to

determine the spreads and centers of the Gaussian functions if they are used

as membership functions.

(2) Rule Layer. The rule formation method is inspired by the Apriori Algorithm

which provides a rule formation method. The reduction in the number of rules

which need to be formed would save system memory as it is not necessary

to generate the redundant rules (rules that would not have sufficient firing

4.7. Experimental Results 140

strength even if they are formed).

(3) Error Correction Layer. This layer is used for error correction. It uses a

logistic function. For each rule. This can lead to an improvement of the

overall performance of the EANFIS architecture.

(4) Output Layer. There are three alternatives in the output layer. They are

(a) discrete output training method, (b) Zeroth order single weight method

and (c) First order Takagi-Sugeno-Kang method. The weight in each rule is

adjusted by using either a steepest descent algorithm or a normal equation.

The proposed EANFIS architecture is very flexible. The error correction layers (lay-

ers 4 and 5) can be switched off and leave us with the classic ANFIS architecture.

On the other hand, if layers 4 and 5 are switched on then this can lead to improve-

ments in the performance of the overall network. This claim is supported by the

application of such techniques to a number of practical problems.

4.7 Experimental Results

In this section, we will apply the proposed methods to a number of examples 7. It in-

cludes a classic non-linearly separated Exclusive OR gate problem, the Sunspot cycle

time series based on real world data, the Mackey Glass chaotic time series which is a

7The implementation of the algorithms has been carried out using Matlab, as this allows rapid

evaluation of the various parameters in the algorithm. The attached weights in the output layer

are trained by normal equation.


computer generated example, the iris dataset which is known to be a nonlinearly sep-

arated problem, a high multi-dimensional real world problem: the Wisconsin Breast

Cancer dataset and a well known neuro-fuzzy control system “Inverted Pendulum

on a cart”. These examples are chosen because they exhibit various characteristics

which are useful in illustrating the properties of the proposed architecture. The

Exclusive OR problem is a classic benchmark problem which will reveal some of

the underlying property of the proposed architecture, e.g. the number of rules and

their values. The sunspot cycle time series is a classic benchmark problem which

will indicate how well the proposed architecture can estimate a set of rules for the

problem. The Mackey Glass time series is a chaotic time series. This will show how

well the proposed architecture will approximate such a chaotic time series. The iris

dataset will allow us to find out how well the proposed architecture can estimate

the rules. The Wisconsin breast cancer problem is a multi-dimensional problem.

This will provide a demonstration of the well the proposed architecture handles a

high dimensional dataset. The inverted pendulum problem is a well known control

problem which is used to benchmark fuzzy control algorithms. Hence by applying

our proposed architecture on this classic problem we will be able to show how well

it works as compared to other algorithms. In addition, we will compare the results

with those obtained using an ANFIS architecture.

4.7.1 Exclusive OR problem

The first experiment is the classic exclusive OR (XOR) problem. This is the simplest

possible nonlinear problem which cannot be linearly separated [51]. In other words,


it cannot be solved using a linear classifier. By applying the proposed EANFIS

architecture to this very simple problem, it is hoped that it will shed some insight

into the operation of the architecture. For the XOR problem, there are two discrete

output classes: 0 and 1 respectively. Applying the proposed self organising moun-

tain clustering membership function method to this problem results in two clusters

for each input dimension (there are two input dimensions altogether) as shown in

Figure 4.8. The results are shown in Table 4.1.

Table 4.1: The input output pairs of the exclusive OR problem.

Training Data Output Testing Data Output

0 0 0 0.4 0.4 0

1 0 1 0.6 0.4 1

0 1 1 0.4 0.6 1

1 1 0 0.6 0.6 0

Note that in the exclusive OR problem, we are normally given the training data set,

but not the testing data set, as the testing data set is the same as the training data

set. In Table 4.1 we have created a testing data set (so that the training data set

and the testing data set do not have any overlaps) which consists of four values as

shown. It is observed that the trained model correctly identified the output classes

of the testing data set.

Further analysis of the self organising mountain clustering membership function

together with the error correction layers reveal that the problem is solved by using

four fuzzy rules. This is shown in Table 4.2.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4MF of output type 0 in dimension 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3


Figure 4.8: The resulting data clusters for the exclusive OR problem after theapplication of the mountain clustering membership function; ‘*’ denotes the firstcluster, ‘o’ denotes the second cluster.


Table 4.2: The fuzzy rules found for the exclusive OR problem using our proposedmethod for rule formation.

R1: If x1 is ϕ11 and x2 is ϕ21 then

Output class is type 0 (Support:25%, Confidence:100%)








Here the functions ϕij(·) denotes the membership function as found by the self

organising mountain clustering membership function method (their shapes are as

shown in Figure 4.8), where i denotes the i-th dimension and j denotes the j-th

cluster, i = 1, 2; and j = 1, 2.

In addition, we compare the performance of our proposed method with two other

methods:

• The EANFIS architecture with the self organising mountain clustering mem-

bership function method.

• The EANFIS architecture but with Gaussian membership functions. The cen-

ters and spreads of the Gaussian functions are determined using the self or-

ganising mountain clustering membership function method.

• The ANFIS architecture with Gaussian membership functions. Again the

centers and spreads of the Gaussian membership functions are determined

using the self organising mountain clustering membership function method.

The membership functions are shown in Figure 4.9.

The results of comparison of these three methods are shown in Table 4.3. For

simplicity, the outputs of the three methods all use the single weight mechanism as

discussed in Section 2.5.1.

The architecture found for this problem is shown in Figure 4.10.

The following observations can be made from the results of this experiment:


0 0.2 0.4 0.6 0.8 10

0.5

1Gaussian Membership Function, Dimension 1

Input

Out

put

0 0.2 0.4 0.6 0.8 10

0.5

1Gaussian Membership Function, Dimension 2

Input

Out

put

Figure 4.9: The Gaussian membership functions for the ANFIS architecture for theexclusive OR problem. There are two membership functions per dimension.

Table 4.3: The results of the XOR problem by comparing three methods: EANFISarchitecture with mountain clustering membership function, EANFIS architecturewith Gaussian membership functions, and ANFIS architecture with Gaussian mem-bership functions.

Desired Output EANFIS EANFIS ANFIS

(Mountain MF) (Gaussian MF) (Gaussian MF)

Output Layer Single weight Single weight Single weight

0 0 0.463422 0.476742

1 1 0.534846 0.523258

1 1 0.534846 0.523258

0 0 0.466803 0.476742


X1

X2

y

Figure 4.10: The architecture of the XOR probelm as found by the proposed EAN-FIS architecture.

• The results of the EANFIS architecture with self organising mountain cluster-

ing membership functions performs the best.

• The outputs of the EANFIS architecture with Gaussian membership functions

and the ANFIS architecture are comparable. They both yield the correct

classifications if we incorporate a threshold unit to their outputs, so that an

output value above 0.5 will be classified as 1, while a value below 0.5 it will

be classified as 0.

• In all cases, four fuzzy rules are required. In this instance the EANFIS ar-

chitecture with the self organising mountain clustering membership function

method is not effective in reducing the number of rules required.


4.7.2 Sunspot Cycle Time Series

In this section, we will apply the proposed methods to study the sunspot cycle

time series considered in the last chapter. We will take the average of the sunspot

numbers of each month and use 55 years of data for training and the rest 200 years

for prediction. For convenience the sunspot number time series is as shown in Figure

4.11. The training data is shown in bold while the testing data is shown in lighter

colour. This arrangement of the training data and testing data conforms with the

standard ways [65, 66] in which this time series has been applied in benchmarking

various estimation algorithms.

1750 1800 1850 1900 1950 20000

50

100

150

200

250

300

Date

Sun

spot

Num

ber

Figure 4.11: The monthly average Sun spot number. The training data (55 years)are shown in dark, while the testing data (200 years) are shown in lighter colour.

We arrange the data in the format of input-output pair as follows:

[x(t-4), x(t-3), x(t-2), x(t-1); x(t)]


In other words, we will express the nonlinear relationship between the current

data point, as a function of the immediate past four data points.

x(t) = f(x(t − 1), x(t − 2), x(t − 3), x(t − 4)) (4.22)

where ˆx(t) is the predicted value of x(t) as a function of the immediate past four

data points, and f(·) is a nonlinear function. The aim is to minimize the error

function E:

E =

N∑t=1

e(t)2 =

N∑t=1

(x(t) − x(t))2 (4.23)

where N is the total number of data points in the testing data set. As indicated

previously this time series is a standard benchmark for various estimation algorithms

[65,66]. In this thesis we will not be interested to obtain the best results as compared

to those found in the literature [65,66]. Our aim instead is to use the sunspot cycle

time series to illustrate the versatility of the proposed methods in this thesis. In

particular, we wish to illustrate its potential in reducing the number of rules which

need to be formed. We will however be comparing the results using the proposed

method with those obtained using ANFIS architectures.

We apply the proposed methods using the self organising mountain clustering

membership function on the training data. The resulting self organising mountain

clustering membership function is as shown in Figure 4.12.

It is noted that the membership functions obtained are asymmetrical. This may be


0 50 100 150 2000

0.2

0.4

Membership Function in Dimension 1

0 50 100 150 2000

0.2

0.4


0 50 100 150 2000

0.2

0.4


0 50 100 150 2000

0.2

0.4


Figure 4.12: The self organising mountain clustering membership functions ofsunspot cycle time series ‘*’ denotes the first cluster, ‘o’ denotes the second cluster.


one reason why so far the other prediction methods [65,66] have not been perform-

ing too well. They have not considered the possibility that the input membership

function may be asymmetrical. Note that if we use Gaussian functions to approxi-

mate the asymmetrical shape, this would require quite a large number of Gaussian

functions for adequate approximation.

It is noted that there are two clusters per dimension (there are four dimensions).

The final output which combines the output from Rule 1 and Rule 2 is shown in

Figure 4.13. The details of the fuzzy rules are shown in Table 4.4.

X1,X2,X3,X4 �

� ��∑Rule 1

Rule 2

��

��

Figure 4.13: The combination of the fuzzy rules for the EANFIS architecture.

It is noted that for the fuzzy rules, the first three dimensions x(t − 1), x(t − 2),

and x(t − 3) pertain to the same condition, while x(t − 4) is the discriminating

dimension in that its change “triggers” the change of the rules. This “triggers” the

change in the linear combination weights in forming the output of layer 1.

The prediction outputs using a linear grid regime with self organising mountain

clustering membership function is shown in Figure 4.14. The prediction outputs

using an ANFIS architecture with Gaussian membership functions is as shown in

Figure 4.15.

The architecture found by using the EANFIS architecture is shown in Fig 4.16.

In the following we will compare the results of using the following architectures


Table 4.4: The fuzzy rules found for the sunspot cycle time series.

R1: If x1 >= −34.13 and x1 <= 273.03 AND

x2 >= −34.13 and x2 <= 273.03 AND

x3 >= −34.13 and x3 <= 273.03 AND

x4 >= −21.72 and x4 <= 43.43 THEN

y = −0.076x1 + 0.246x2 + 0.417x3 + 0.505x4 − 1.866

(Support: 23.98%, Confidence:100%)

R2: If x1 >= −34.13 and x1 <= 273.03 AND

x2 >= −34.13 and x2 <= 273.03 AND

x3 >= −34.13 and x3 <= 273.03 AND

x4 >= 0 and x4 <= 260.62 THEN

y = 0.147x1 + 0.116x2 + 0.121x3 + 0.552x4 + 3.839

(Support: 76.02%, Confidence:100%)


1750 1800 1850 1900 1950 20000

50

100

150

200

250

300

Date

Sun

spot

Num

ber

Figure 4.14: The prediction results of the monthly average sunspot number timeseries using an EANFIS architecture with linear grid regime using the self organisingmountain clustering membership functions.

1750 1800 1850 1900 1950 20000

50

100

150

200

250

300

Date

Sun

spot

Num

ber

Figure 4.15: The prediction results of the monthly average sunspot number timeseries using an ANFIS architecture with Gaussian membership functions.


X(t-4)

X(t-3)

X(t-2)

X(t-1)

X(t)

Figure 4.16: The architecture found by using the proposed EANFIS architecture.

(see Table 4.5):

• An ANFIS architecture using two ways of forming the outputs (1) the single

weight method, and (2) the TSK mechanism. In this case, there are a total of

16 rules required. It is observed that the one trained using the TSK mechanism

has a smaller training error, the prediction error on the testing data set is

much worse than the one using a single weight output regime. This may

be attributed to the fact that the one using TSK mechanism exhibit over-

training phenomenon. In other words, while the training error might be small,

the model is trained such that it “accommodates” all the noise content in

the training data set. Hence when it is used to evaluate its generalisation

capabilities, the prediction error on the testing data set is much worse. On

the other hand, the one trained using a single weight output mechanism is

much better in that the error value for the training data set is comparable to

that in the testing data set.

• An EANFIS architecture with Gaussian membership function. Here the EAN-


Table 4.5: The RMS errors of applying various methods on the sunspot numbertime series. Please see the text for explanation of the experimental conditions.

ANFIS

Output Layer Single weight Output Layer TSK

Rule 16 Rule 16

Training Error Prediction Error Training Error Prediction Error

16.0084 16.7495 14.1094 163.0768

EANFIS (Gaussian MF)

linear setup grid Nonlinear setup grid

Output Layer TSK Output Layer TSK

Rule 2 Rule 2


16.1170 16.0396 16.1612 16.0400

EANFIS (Mountain MF)

linear setup grid Nonlinear setup grid


Rule 2 Rule 2


16.1568 15.9939 16.1611 15.9901


FIS architecture finds that only two rules will be sufficient. We have evaluated

the performances of using a linear grid regime and a nonlinear grid regime re-

spectively. It is found that the linear grid regime appears to perform slightly

better than the nonlinear grid regime. If the underlying problem is rather

uniform, then the nonlinear grid method may not be as efficient as the linear

grid method. The nonlinear grid method tunes the neuron centers according

to the training data. It may lead to overtraining.

• An EANFIS architecture with self organising mountain clustering membership

functions. Here again we investigate both the linear grid and nonlinear grid

regimes respectively. Again it is found that the EANFIS architecture informs

us that only two rules are required in each case. We also investigated the

performance of the architecture if we use the single weight and TSK output

mechanisms. It is found that there is a slight difference in the performance of

the two methods in that it appears the one using the TSK mechanism has a

slight edge over the one using a single weight regime.

In general, from Table 4.5 we can make the following observations:

• The EANFIS architecture is capable of using a reduced number of rules. Note

that the number of rules is found automatically using our proposed method.

• The EANFIS architecture performs better than the ANFIS architecture.

• There is virtually no difference in the performance of the EANFIS architecture

using a linear grid or a nonlinear grid.


• There is almost no difference in the performance of the EANFIS architecture

using a single weight or a TSK output mechanism.

From this experiment, we can confirm that our proposed methods appear to

work well for this example.

4.7.3 Mackey-Glass Time Series example

Another well known benchmark problem for the evaluation of various prediction

algorithms is the Mackey-Glass chaotic equation (see Equation (4.24)).

x (t) =0.2x (t − τ )

1 + x10 (t − τ)− 0.1x (t) (4.24)

where x(t) is the output of the equation. Note that this is a nonlinear delay differen-

tial equation with a delay of τ in the argument. As this equation will produce a con-

tinuous time output, we will use a fourth-order Runge-Kutta method [52] to integrate

the equation and obtain the solution. The initial condition is x (0) = 1.2, τ = 17 and

the step size used is 0.1. In order to eliminate the effect due to the initial condition,

we only extract the data from t = 118 to 1117. The first 500 pairs are used for

training and the rest are used for testing. The input-output data format is arranged

as follow:

[x(t-18), x(t-12), x(t-6), x(t); x(t+6)]


In other words, we wish to estimate a model of the following form:

x(t + 6) = f(x(t), x(t − 6), x(t − 12), x(t − 18)) (4.25)

where f is a nonlinear function. Note that in this case we are only sampling once

every 6 samples. This is the standard way in which data processing for this equation

has been carried out [67]. Hence we will follow this standard approach. Note that

again we are not interested to “pitch” our proposed algorithm against the other

prediction methods [67]. But instead we wish to use this equation to illustrate the

potential of our proposed method. We will however compare the results of our

proposed methods with those obtained using the ANFIS architecture. The results

are shown in Table 4.6.

Table 4.6: The RMS errors of the Mackey-Glass compares with ANFIS, EANFISwith the self organising mountain clustering membership function and EANFIS withGaussian membership function.

ANFIS

Output Layer TSK

Training Error Prediction Error Rule

0.00245 0.00234 16

EANFIS (Mountain MF) EANFIS (Gaussian MF)

Output Layer TSK Output layer TSK

Rule 12 Rule 12


0.01146 0.01136 0.00279 0.00275


The results show that the self organising mountain clustering membership func-

tion performs inferior to the Gaussian membership function. So for this experiment,

we propose to use the self organising mountain clustering membership function

method to determine the spreads and centers of the Gaussian membership func-

tion to be used in the EANFIS architecture and compare them with the ANFIS

using the same Gaussian membership functions. The EANFIS architecture using

Gaussian membership functions requires 12 fuzzy rules while the ANFIS architec-

ture requires 16 fuzzy rules. It is observed from Table 4.6 that the performance of

the EANFIS architecture and the ANFIS architecture are comparable.

Figure 4.17 shows the output of the Mackey-Glass equation, while Figure 4.18

shows the membership function using the EANFIS architecture with the Gaussian

membership functions.

700 800 900 1000 11000.4

0.6

0.8

1

1.2

1.4

Time

Am

plitu

de

Figure 4.17: The output of the Mackey-Glass equation.


0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.30

0.5

1Dimension 1

0 0.2 0.4 0.6 0.8 1 1.20

0.5

1Dimension 2

0 0.2 0.4 0.6 0.8 1 1.20

0.5

1Dimension 3

0 0.2 0.4 0.6 0.8 1 1.20

0.5

1Dimension 4

Figure 4.18: The Gaussian membership function for the Mackey-Glass equationexample.


Figure 4.19 shows the output using an ANFIS architecture with 16 Gaussian

membership functions, while Figure 4.20 shows the output using an EANFIS archi-

tecture with 12 Gaussian membership functions. It may be noted that there is very

little difference in the two output signals from different architectures.

700 800 900 1000 11000.4

0.6

0.8

1

1.2

1.4

Time

Am

plitu

de

Figure 4.19: The outputs of a neuro-fuzzy network using the ANFIS architecturewith 16 Gaussian membership functions for the Mackey-Glass equation.

As a result it is more convenient to consider the differences of the outputs of

these networks and compare them with the original output of the Mackey-Glass

equation. These are shown in Figure 4.21 and Figure 4.22 respectively.

It is observed from Figures 4.21 and 4.22 that there is not much difference be-

tween these differences.

The architecture found using the EANFIS architecture is shown in Figure 4.23.


700 800 900 1000 11000.4

0.6

0.8

1

1.2

1.4

Time

Am

plitu

de

Figure 4.20: The outputs of the EANFIS architecture with 12 Gaussian membershipfunctions for the Mackey-Glass equation.

700 800 900 1000 1100−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

Time

Err

or

Figure 4.21: The difference between the output of the ANFIS architecture with16 Gaussian membership functions and the original signal for the Mackey-Glassequation.


700 800 900 1000 1100−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

Time

Err

or

Figure 4.22: The difference between the output of the EANFIS architecture with12 Gaussian membership functions and the original signal for the Mackey-Glassequation.

These results naturally raise a question: what are the added advantages of using

an EANFIS architecture. An obvious advantage would be that the EANFIS archi-

tecture uses a smaller number of rules. However, this may not be sufficient to justify

the addition of two extra layers. This slightly unexpected observation: comparable

performance between the ANFIS architecture and the EANFIS architecture may

be explained as follows: the Mackey-Glass equation does not contain any noise. In

other words, the advantage of having two extra layers in the case of the EANFIS

architecture may have been lost. In other words, the EANFIS architecture will

work better if there are noise in the output, e.g., the sunspot cycle time series. This

may be the reason why the ANFIS architecture performs slightly better than the

EANFIS architecture.


X(t-18)

X(t-12)

X(t-6)

X(t)

X(t+6)

Figure 4.23: The architecture found by using the proposed EANFIS architecture.


4.7.4 Iris Dataset

In this section, we will apply the EANFIS architecture to the Iris example considered

in the last chapter. This dataset consists of three types of iris flowers. Two of them

are linearly inseparable. We will randomly select 51 data points for testing and the

rest 99 data points for training purposes respectively. Once the training and testing

data sets are chosen, they will remain fixed for one experiment cycle using both

the ANFIS architecture and the EANFIS architecture. This evaluation will run

100 times each time using a different set of training and testing data sets, and the

average of the performances of these two architectures are then used for comparison.

These are the values reported in Table 4.7.

We carried out the following experiments:

(1) ANFIS architecture with either single weight output scheme or TSK output

scheme. In this case, we use 16 rules. The membership function is Gaussian

membership function.

(2) EANFIS architecture without the error correction layer using the self organ-

ising mountain clustering membership functions and using either the single

weight output scheme or the TSK output scheme.

(3) EANFIS architecture with the error correction layer using the self organising

mountain clustering membership function and using either the single weight

output scheme or the TSK output scheme.

The main reason why we carried out experiments (2) is that we wish to isolate the


effect between having the error correction layer and not having the error correction

layer. Note that without an error correction layer the main difference between

the ANFIS architecture and the EANFIS architecture is the membership function.

Hence by comparing the results of experiments (1) and (2) we will be able to conclude

the effectiveness of the self organising mountain clustering membership function

as compared with the Gaussian membership function. The other difference is the

number of rules used. In the case without an error correction layer for the EANFIS

architecture, the number of rules used will be 16, while with a probability layer, the

number of rules used will be as observed in the table, only 3. Hence by comparing

the results of experiments (2) and (3) it will inform us on the effectiveness of using

the reduced number of rules.

From Table 5.3, we may make the following observations:

• For the ANFIS architecture with 16 rules, and using Gaussian membership

function, the single weight output scheme performs significantly better than

the TSK output scheme. This implies that the outputs would not be dependent

on the inputs (as the TSK output scheme incorporates influences from the

inputs directly).

• For the EANFIS architecture without the error correction layer using the self

organising mountain clustering membership functions with 16 rules, the TSK

output scheme works better than the single weight output scheme. This result

is at variance with the one using the ANFIS architecture.

• Similarly with an error correction layer, the EANFIS architecture performs


Table 4.7: The outcomes of applying the EANFIS architecture and the ANFISarchitecture on the Iris data set. The values reported in this table are obtained froman average of 100 experiments using randomly selected 99 training data samples and51 testing data samples.

ANFIS

Output Layer Single Weight Output Layer TSK

Accuracy Variance Rule Accuracy Variance Rule

98.1569% 3.48 16 74.0392% 52.90 16



Without Probability Layer With Probability Layer


98.4314% 3.11 16 98.4314% 3.11 3


Output Layer Single weight Output Layer Single weight

Without Probability Layer With Probability Layer


83.1176% 15.06 16 82.7451% 14.45 3


better using a TSK output scheme instead of the single weight output scheme.

• Comparing the results of the EANFIS architecture and the ANFIS architec-

ture, the EANFIS architecture with TSK output scheme performs better than

the ANFIS architecture with single weight output scheme.

This result is quite difficult to explain in isolation with other results. Hence we

will delay the discussions on these results until later when we have considered the

other results on the Wisconsin breast cancer example.

The following figures will show some of the details of the results. Figure 4.24

shows the membership function used in the EANFIS architecture.

It is observed that the membership functions show asymmetry, and hence it is not

surprising that the ANFIS architecture requires 16 rules in order to “approximate”

these shapes adequately.

Table 4.8 shows the fuzzy rules in the EANFIS architecture for the Iris dataset.

It is interesting to note that the three rules are all based on the same bracketed

region:

x1 >= 0.0429 x1 <= 0.1425

x2 >= 0.0388 x2 <= 0.1553

x3 >= −0.0177 x3 <= 0.2066

x4 >= −0.0403 x4 <= 0.2358

It is noted that only the consequence part is different among all three rules.


0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.110

0.05

0.1

0.15

0.2


0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.110

0.05

0.1

0.15

0.2


0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.120

0.1

0.2

0.3

0.4


0.02 0.04 0.06 0.08 0.1 0.12 0.140

0.1

0.2

0.3

0.4


Iris−setosaIris−versicolorIris−virginica

Figure 4.24: The membership function (mountain clustering) for the Iris data set.


Table 4.8: The extracted fuzzy rules for the Iris Dataset. These rules are used inthe EANFIS architecture.

R1: If x1 is ϕ11,1(x1, 0.0429, 0.1425) AND

x2 is ϕ12,1(x2, 0.0388, 0.1553) AND

x3 is ϕ13,1(x3,−0.0177, 0.2066) AND

x4 is ϕ14,1(x4,−0.0403, 0.2358) THEN

y = 0.1377x1 − 1.5479x2 + 9.8673x3 + 2.1638x4 + 0.8182

(Support: 33%, Confidence:96%)

R2: If x1 is ϕ21,1(x1, 0.0429, 0.1425) AND

x2 is ϕ22,1(x2, 0.0388, 0.1553) AND

x3 is ϕ23,1(x3,−0.0177, 0.2066) AND

x4 is ϕ24,1(x4,−0.0403, 0.2358) THEN

y = −30.1001x1 − 17.5581x2 + 5.7272x3 + 16.4500x4 + 3.9185


R3: If x1 is ϕ31,1(x1, 0.0429, 0.1425) AND

x2 is ϕ32,1(x2, 0.0388, 0.1553) AND

x3 is ϕ33,1(x3,−0.0177, 0.1692) AND

x4 is ϕ34,1(x4,−0.0403, 0.2358) THEN

y = −23.4493x1 − 15.1980x2 + 32.4464x3 + 2.5574x4 + 2.6153


where ϕτd,c(input, begin a cluster, end of a cluster) is a mountain clustering

membership function


The architecture found using the proposed EANFIS architecture is shown in Fig

4.25.

X1

X2

X3

X4

y

Figure 4.25: The architecture found using the proposed EANFIS architecture.

4.7.5 Wisconsin Breast Cancer example

The Wisconsin breast cancer data set [53] consists of 699 instances in 9 dimensions;

16 instances have missing attributes. We will randomly select 200 instances for

training and 499 instances for testing purposes.

In this example, we have carried out the following experiments:

• ANFIS architecture with 512 rules (2 rules per dimension and 9 dimensions)

with Gaussian membership functions.


• EANFIS architecture, without the error correction layer with self organising

mountain clustering membership functions with either a linear grid regime, or

a nonlinear grid regime.

• EANFIS architecture with the error correction layer with the self organising

mountain clustering membership functions with either a linear grid regime or

a nonlinear grid regime.

In all cases we will only use the single weight output scheme.

In the ANFIS architecture, it uses two membership functions per dimension.

The total number of fuzzy rules is 512. For the EANFIS architecture, it uses 2 fuzzy

rules. The results are shown in Tables 4.9 and 4.10 respectively using single weight

output scheme and TSK output scheme.

The following observations may be made from the results:

• The ANFIS architecture with 512 Gaussian membership functions gives an

accuracy of 90%.

• The EANFIS architecture without error correction layer but with the self or-

ganising mountain clustering membership functions with linear and nonlinear

grid regimes both give about 94% prediction accuracy.

• The EANFIS architecture with error correction layer and with the self organ-

ising mountain clustering membership functions with linear and nonlinear grid

regimes both give 96% prediction accuracy.


Table 4.9: The comparison of the prediction capabilities of the ANFIS architecturewith membership functions, the EANFIS architecture, with linear and nonlineargrid regimes using single weight output layer.

EANFIS ANFIS

Mountain Clustering MF

(Without (With (Without (With

probability probability probability probability

layer & layer & layer & layer & Gaussian

linear linear nonlinear nonlinear MF Function

setup grid) setup grid) setup grid) setup grid)

Output Layer: Single weight

Rules Rules Rules Rules Rules

2 2 2 2 512

Accuracy Accuracy Accuracy Accuracy Accuracy

94.44% 96.24% 94.70% 96.80% 88.28%

Variance Variance Variance Variance Variance

0.59 0.20 0.36 0.16 4.76


Table 4.10: The comparison of the prediction capabilities of the ANFIS architecturewith membership functions, the EANFIS architecture, with linear and nonlineargrid regimes using TSK output layer.

EANFIS ANFIS







Output Layer: TSK


2 2 2 2 512

Accuracy Accuracy Accuracy Accuracy Accuracy

96.24% 96.27% 96.27% 96.37% 90.32%

Variance Variance Variance Variance Variance

0.38 0.44 0.39 0.43 3.95


Figure 4.26 shows the membership functions used in the EANFIS architecture.

Again it is quite clear from the shapes of the rules it will require a significant number

of Gaussians to approximate them adequately.

The architecture found by using the proposed EANFIS architecture is shown in

Fig 4.27.

4.7.6 Inverted Pendulum on a cart

The EANFIS is applied to a classic fuzzy control system, viz. the control of an

inverted pendulum sitting on a moving cart. A rod is hinged on top of a moving

cart. The cart is free to move in the horizontal plane, and the objective is to balance

the rod to keep it in the upright position and keep the cart in the center position.

The mechanical system is as shown in Figure 4.28. The system takes four inputs:

θ: the angle of the rod makes with the vertical axis, θ is the angular velocity of

the rod, y is the cart position with respect to the center position, and y is the cart

velocity. The aim is to use these four inputs to calculate the control force u [72]

which is required.

This model is simulated by software. The initial parameters are M (mass of

cart) = 2kg, m (mass of rod) = 0.1kg, L (length of rod) = 0.5m and g (gravity)

= 9.81m/sec2. The cart and rod should be back in the desired angle and position

within 2 second. This is simulated using Equation 4.26. A block diagram of the

control system is shown in Figure 4.29. The state-space equation of the system is

given in Equation 4.27.


2 4 6 8 100

0.2

0.4Membership Function in Dimension 1

2 4 6 8 100

0.2


2 4 6 8 100

0.2


2 4 6 8 100

0.2


2 4 6 8 100

0.2


2 4 6 8 100

0.2


2 4 6 8 100

0.2


2 4 6 8 100

0.2


2 4 6 8 100

0.2


Figure 4.26: The self organising mountain clustering membership functions for theWisconsin breast cancer example. Solid line denotes “benign”, and dashed linedenotes “malignancy”.


Table 4.11: The extracted fuzzy rules for the Wisconsin Breast Cancer. These rulesare used in the EANFIS architecture.

R1: If x1 is ϕ11,1(x1, 1, 10) AND x2 is ϕ1

2,1(x2, 1, 10) AND

x3 is ϕ13,1(x3, 1, 10) AND x4 is ϕ1

4,1(x4, 1, 10) AND

x5 is ϕ15,1(x5, 1, 10) AND x6 is ϕ1

6,1(x6, 1, 10) AND

x7 is ϕ17,1(x7, 1, 10) AND x8 is ϕ1

8,1(x8, 1, 10) AND

x9 is ϕ19,1(x9, 1, 10) THEN

y = 0.0310x1 + 0.0428x2 + 0.0475x3 + 0.0531x4 + 0.0433x5 + 0.1129x6

−0.0108x7 + 0.0574x8 − 0.0785x9 − 1.4208

(Support: 48.76%, Confidence:82.70%)

R2: If x1 is ϕ21,1(x1, 1, 10) AND x2 is ϕ2

2,1(x2, 1, 10) AND

x3 is ϕ23,1(x3, 1, 10) AND x4 is ϕ2

4,1(x4, 1, 10) AND

x5 is ϕ25,1(x5, 1, 10) AND x6 is ϕ2

6,1(x6, 1, 10) AND

x7 is ϕ27,1(x7, 1, 10) AND x8 is ϕ2

8,1(x8, 1, 10) AND

x9 is ϕ29,1(x9, 1, 10) THEN

y = 0.0178x1 − 0.0216x2 − 0.0007x3 + 0.0107x4 − 0.0020x5 − 0.0043x6

+0.0177x7 − 0.0093x8 + 0.0108x9 + 0.9290

(Support: 39.80%, Confidence:96.98%)

where ϕτd,c(input, begin a cluster, end of a cluster) is a mountain clustering

membership function


X1

X2

X3

X4

X5

X6

X7

X8

X9

y

Figure 4.27: The architecture found using the proposed EANFIS architecture.


u = −kx (4.26)

where k is the desired feedback gain vector and x is the input vector x = [ x1 x2 x3 x4 ]T.

In this example k = [ −298.15 −60.697 −163.099 −73.394 ].

⎡⎢⎢⎢⎢⎢⎢⎢⎣

x1

x2

x3

x4

⎤⎥⎥⎥⎥⎥⎥⎥⎦

=

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0 1 0 0

M+mML

g 0 0 0

0 0 0 1

−mM

g 0 0 0

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎢⎢⎣

x1

x2

x3

x4

⎤⎥⎥⎥⎥⎥⎥⎥⎦

+

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0

− 1ML

0

1M

⎤⎥⎥⎥⎥⎥⎥⎥⎦

u (4.27)

where x1 = θ, x2 = θ, x3 = y and x4 = y

Figure 4.28: Inverted pendulum on a cart.

We train the system with initial conditions θ = 0.3, θ = 0, y = 0 and y = 0. In

order to obtain the rod in the desired angle and position, we apply a control force


Neural Fuzzy System

θθ�yy�

u

Demux

Figure 4.29: Block diagram of the Inverted pendulum on a cart control system.

shown in Figure 4.30. After the control force is applied, it induces a new position,

velocity, rod angle and angular velocity. This is shown in Figure 4.31.

We test the system with another initial condition θ = −0.3, θ = 0, y = −1 and

y = 0. The control force using EANFIS is shown in Figure 4.32 and the input status

is shown in Figure 4.33. We can observe that there is no difference between the

EANFIS control force and the desired control force. While the ANFIS generates a

different control force it still can balance the rod (as shown in Figure 4.34). The

input status using ANFIS is shown in Figure 4.35. In Table 4.12, it is shown that

the EANFIS architecture performs better than ANFIS architecture.

The architecture found using the proposed EANFIS architecture is as shown in

Fig 4.36.


0 0.5 1 1.5 2 2.5 3 3.5 4−20

0

20

40

60

80

100

Time

Con

trol

For

ce (

N)

Figure 4.30: Control force of the training system

Table 4.12: The table shows the RMS error compare with ANFIS and differentEANFIS architecture.

EANFIS ANFIS







Output Layer: TSK


7 7 7 7 16

RMS RMS RMS RMS RMS

0.0000001775 0.0000001816 0.0000001106 0.0000002000 16.4280339280


0 0.5 1 1.5 2 2.5 3 3.5 4−0.4

−0.2

0

0.2

0.4

Ang

le (

rad)

Time (sec)

0 0.5 1 1.5 2 2.5 3 3.5 4−3

−2

−1

0

1

Time

Ang

ular

Vel

ocity

(ra

d/se

c)

0 0.5 1 1.5 2 2.5 3 3.5 4−0.1

0

0.1

0.2

0.3

Time

Pos

ition

(m

)

0 0.5 1 1.5 2 2.5 3 3.5 4−1

0

1

2

Time

Vel

ocity

(m

/s)

Figure 4.31: Input status of the Inverted pendulum on a cart control system

4.8. Conclusions 183

0 0.5 1 1.5 2 2.5 3 3.5 4−300

−250

−200

−150

−100

−50

0

50

100

Time

Con

trol

For

ce (

N)

Desire inputEANFIS

Figure 4.32: Control force of the EANFIS system

4.8 Conclusions

The proposed EANFIS architecture is a robust and efficient adaptive neuro-fuzzy

inference system. It has six feed-forward layers and four learning phases. The

network structure is flexible. The self-organizing mountain clustering membership

function, the Apriori rule formulation method and the error correction layer are

independent which can be replaced with other classic methods. The self-organizing

mountain clustering membership function is completely data driven. It does not

have a pre-defined shape. It can handle complex data shapes that the traditional

membership functions find them difficult to describe. The determination of the

number of fuzzy rules is inspired by a data mining method ‘Apriori Algorithm’

which explores the number of required rules first before they are generated. So, this

method only generates useful rules. This method is different from the other neuro-


0 0.5 1 1.5 2 2.5 3 3.5 4−0.5

0

0.5

1

Time (sec)

Ang

le (

rad)

Desire inputEANFIS

0 0.5 1 1.5 2 2.5 3 3.5 4−5

0

5

10

Time

Ang

ular

Vel

ocity

(ra

d/se

c)

Desire inputEANFIS

0 0.5 1 1.5 2 2.5 3 3.5 4−2

−1

0

1

Time

Pos

ition

(m

)

Desire inputEANFIS

0 0.5 1 1.5 2 2.5 3 3.5 4−4

−2

0

2

4

Time

Vel

ocity

(m

/s)

Desire inputEANFIS

Figure 4.33: Input status of the Inverted pendulum on a cart control system usingEANFIS


0 0.5 1 1.5 2 2.5 3 3.5 4−300

−250

−200

−150

−100

−50

0

50

100

Time

Con

trol

For

ce (

N)

Desire inputANFIS

Figure 4.34: Control force of the ANFIS system

fuzzy systems which expand the network first then shrink it using a pruning process.

Moreover this algorithm can extract the rule support and the rule confidence. It is

information which indicates the rule strength to let the domain expert to implement

the neuro-fuzzy system. The fault tolerant mechanism is also considered in the

network. The error correction layer can improve the accuracy in a noisy environment.

The output layer is independent of the architectural considerations; it can be a single

weight scheme, a TSK scheme or any other methods. We cannot draw the conclusion

that the TSK mechanism would performs better than the single weight output layer.

The TSK mechanism can easily lead the network to be over-trained on the training

data set (see Section 4.7.2 and Section 4.7.4, The ANFIS architecture with TSK).

So, using whichever type of output mechanism depends on the data set and practical

situation. In practice it is best to use both the single weight output mechanism and

the TSK output mechanism and see which one will provide better results.


0 0.5 1 1.5 2 2.5 3 3.5 4−1

−0.5

0

0.5

1

Time (sec)

Ang

le (

rad)

Desire inputANFIS

0 0.5 1 1.5 2 2.5 3 3.5 4−5

0

5

10

Time

Ang

ular

Vel

ocity

(ra

d/se

c)

Desire inputANFIS

0 0.5 1 1.5 2 2.5 3 3.5 4−2

−1

0

1

Time

Pos

ition

(m

)

Desire inputANFIS

0 0.5 1 1.5 2 2.5 3 3.5 4−5

0

5

Time

Vel

ocity

(m

/s)

Desire inputANFIS

Figure 4.35: Input status of the Inverted pendulum on a cart control system usingANFIS


θ

θ�

y�

yu

Figure 4.36: The architecture found using the proposed EANFIS archit4ecture.

In the above example, the EANFIS architecture achieves good prediction and

classification accuracies when applied to some practical problems. However, it is

essentially a local method in that it only considers one dimension at a time in the

formation of membership functions using the self organising mountain clustering

algorithm. On the other hand, if this self organising mountain clustering algorithm

is combined with a method which can extract global information from the training

data set, then this combined architecture may further improve the performance.

This topic will cover in the next chapter.

Chapter 5

Combining Local and Global

Input Structures for the Extended

Adaptive Neuro-Fuzzy Inference

System

5.1 Motivation

The self organising mountain clustering membership function determination algo-

rithm in the extended adaptive neuro-fuzzy inference system (EANFIS) proposed in

the last chapter considers one dimension of the input at a time in a multi-dimensional

input situation. This stems from the kernel density method upon which the self or-

188

5.2. Introduction 189

ganising mountain clustering membership function is based on. In the kernel density

method, it is common to tackle the multi-dimensional input by dealing with one di-

mension at a time. This approach, while very useful and convenient, ignores possible

interactions when the multi-dimensional nature of the input is taken into account.

On the other hand, there are a number of methods, e.g. principal component analy-

sis method [55], Fisher’s discriminant analysis method [57], which can be used to

tackle the type of information which is contained in a multi-dimensional input con-

text. In this chapter, we will consider the possibility of combining a local method,

e.g. self organising mountain clustering membership function method, and a global

method, e.g. a principal component analysis method, with a view to enhance the

performance of the combined architecture.

5.2 Introduction

In the last chapter we introduced the concept of extending the traditional neuro-

fuzzy network [22] with the addition of two extra layers, one implementing a simple

competitive learning paradigm in discrete output situation and the second one a

common normalisation layer so that the outputs are normalised. In addition, we

consider the self organising mountain clustering membership function method, which

is an approximation of the classical kernel density method. It uses the same approach

as kernel density methods when dealing with multi-dimensional inputs, that is, it

considers one input dimension at a time.

However, there are a number of methods commonly used in statistics which take

5.2. Introduction 190

account of the multi-dimensional aspects of the inputs. For example, principal com-

ponent analysis (PCA) [55] considers the covariance matrix of the multi-dimensional

inputs, project it down onto a lower dimension in an attempt to “extract” the

“essence” of the input structure. Here in PCA, the “essence” has specific meanings.

It determines the number of dimensions in which most of the “energy” are conserved.

It is this lower dimensional input which is used as input for further analysis.

There are many other such “global” methods, Fisher’s linear discriminant analy-

sis (LDA) algorithm [57] in which the global structure of the inputs is determined

first, and a “reduced input” is used as input instead for further analysis.

In this chapter, we will explore the possibility of extending the EANFIS ar-

chitecture to encompass some of the information which might be contained in the

“global” structure of the inputs, and use it as additional information to “guide”

the adaptation process.

The structure of this chapter is as follows: in Section 5.3, we propose a general

architecture in which the “global” method, e.g., PCA [55], and “local” method, e.g.,

EANFIS, can be combined. Then we describe some possible global methods, e.g.,

PCA, LDA, in Section 5.4, before applying these global methods in combination

with the EANFIS to some practical examples.

5.3. Possible architectures for combining the local and global methods 191

5.3 Possible architectures for combining the local

and global methods

There are three possible alternatives to combine the EANFIS with a global method.

• Preprocessing stage. In this case, the global method can be considered as a

preprocessing stage for the EANFIS architecture. This architecture is shown

in Figure 5.1 in the case of EANFIS architecture, and Figure 5.2 in the case

of ANFIS architecture.

Global Method EANFIS

Figure 5.1: A block diagram to show the preprocessing method of combining theEANFIS architecture and global method.

Global Method ANFIS

Figure 5.2: A block diagram to show the preprocessing method of combining ANFISarchitecture and a global method.

Thus, in this case, the global method, e.g. PCA, LDA acts a preprocessing

stage for the EANFIS architecture. In other words, the PCA or LDA extracts

the “essence” of the inputs and feed the transformed inputs into the EANFIS

architecture or the ANFIS architecture.

• Parallel architecture. There are two ways in which the global processing mod-


ule may be connected in parallel with the EANFIS architecture. This is be-

cause there are two possible connection points in the EANFIS architecture,

one connecting with the output of the membership function module, and the

other one at the end of the EANFIS architecture. We will show this possibil-

ity in Figure 5.3 in which the global module is connected in parallel with the

membership function module.

Membership Function

Normalization Layer & Error

Correction Layer

Global Method

Figure 5.3: A block diagram showing the parallel connection of the global modulewith the membership function module in an extended EANFIS architecture.

In this case, the intuitive idea is that if we can provide the inputs from trans-

formed inputs (from the global module), and connect it in parallel with the

outputs of the membership function and present the combined inputs into the

competitive and normalisation layers. In this manner, one may adjust the level

of relative influence between the global module and the membership module

into the competitive and normalisation modules.

• Series-parallel architecture. In this case, the arrangement of the global module

is connected with the EANFIS architecture as shown in Figure 5.4. The global

module is connected in parallel with the series connection of the membership

function module and the competitive and normalisation layers.


Membership Function

Normalization Layer & Error

Correction Layer

Global Method

Figure 5.4: A block diagram showing the series-parallel connection of the globalmodule and the series connection of the membership function module and the com-petitive and normalisation layers in the EANFIS architecture.

In this approach, the intuitive idea is that we extract the “essence” of the input

and use it in parallel with the output of the competitive and normalisation

module outputs. In other words, we wish to combine the relative influence of

the outputs of the transformed inputs (through the global module) and the

outputs of the competitive and normalisation modules. This is in the same

spirit of the TSK mechanism to influence the outputs of the network by the

inputs.

Obviously there are other possibilities. For example, instead of using a global

method as a preprocessing module, it is possible to use it as a post-processing

module. However, we will not consider this further, as it is at the input end that we

wish to consider if it is possible to extract the “essence” of the inputs rather than

at the output end.

Given these choices, one may ask: which one will provide the best performance.

In other words, there is no reason a priori to select one way or the other to combine

the outputs of the global module with those of the competitive and normalisation

5.4. Possible Global methods 194

modules. This can only be ascertained from practical examples. We will carry out

such an experiment after we have described some possible global methods.

5.4 Possible Global methods

In this section, we will describe two simple global methods. These are: principal

component analysis (PCA) method, and the linear discriminant analysis (LDA)

method. Note that these are only examples of possible global methods. There are

many other possible methods, e.g., one considering the manifold of the input space,

one considering the positive matrix factorisation. However in this thesis we will only

explore simple global methods, and hence we will not be concerned with the more

advanced methods.

5.4.1 Principal Component Analysis (PCA)

Principal component analysis (PCA) is a commonly used global method. It works

by considering the covariance matrix of the inputs, and find out the dimensions in

which maximum variances occur. The dimensions with lower variances are ignored.

Then, it preserves these high variance dimensions through a transformation of the

inputs.

Consider that we are given a set of inputs X which may have a number of classes,

c = 1, 2, . . . , C. Let the inputs associated with the class c be denoted by Xc. This


is a D × N c matrix, where N c is the number of input vectors associated with class

c, and D is the dimension of the input vector. We will form the covariance D × D

matrix Sc for class c as shown in Equation (5.1).

Sc =1

N c − 1

(Xc − Xc

) (Xc − Xc

)T(5.1)

where Xc is the mean of Xc of class c over the samples of class c and superscript T

denotes the transpose of a matrix.

Since the matrix Sc is symmetric, it is possible to find the eigenvalues of the

matrix as well as its corresponding eigenvectors. Let this be represented compactly

as shown in Equation (5.2):

VcΛc = ScVc (5.2)

where Λc is a diagonal matrix with the diagonal values λc1, λ

c2, . . . , λ

cD as the eigen-

values of the matrix Sc, Vc ∈ �D×D each vector is the normalized eigenvector of

the covariance matrix Sc. For convenience the eigenvalues λc1, λ

c2, . . . , λ

cD are sorted

such that λc1 ≥ λc

2 ≥ · · · ≥ λcD. Because the matrix is symmetric, the eigenvalues

are all positive. In addition, we have V cV cT = V cT V c = I. In other words, the

eigenvectors are orthonormal.

The eigenvalues, e.g., λci can be thought of as related to the “energy” level

associated with that particular dimension i. Since the eigenvalues are sorted in a

descending order, this implies that the “importance” of the dimensions are sorted


in a descending order. Hence it may be possible to ignore dimensions which do not

have a high energy content. This can be performed by associating an energy index

Θci associated with the normalised energy level of dimension i:

Θci =

λci∑D

i=1 λci

× 100% (5.3)

By construction Θc1 ≥ Θc

2 ≥ · · · ≥ ΘcD. If we say that a dimension with nor-

malised energy level less than τ , a preset threshold, can be ignored, then we will

have Θc1 ≥ Θc

2 ≥ · · · ≥ Θcnc

, where nc is the total number of significant dimensions

for class c. In this case, the diagonal of the eigenvalue matrix Λc may be expressed

as follows: [λc1, λ

c2, . . . , λ

cnc

, 0, . . . , 0]. It is possible to reconstruct an approximation

of the covariance matrix Sc:

Sc = VcΛcVcT (5.4)

where Sc is the approximation of the covariance matrix Sc and Λc is a D × D

diagonal matrix with the diagonal elements being given by [λc1, λ

c2, . . . , λ

cnc

, 0, . . . , 0].

It is noted that Equation (5.4) can be written equivalently as follows:

Sc = VcΛcVcT

(5.5)

where D×nc matrix Vc consists of the first nc columns of the matrix Vc, and Λc is

a nc × nc diagonal matrix with diagonal elements λ1, λ2, . . . , λnc .


Now if we define a transformation: Xc as follows:

Xc = Xc +(Xc − Xc

)Vc

{1...j}VcT

{1...j} (5.6)

where Xc is the mean of Xc, Vc ∈ �D×D is the normalized eigenspace, j ∈ {1 . . .D},1 ≤ j ≤ D is the number of transformed dimension, Vc

1...j = [vc1,v

c2, . . . ,v

cj ]. Each

dimension is orthonormal.

We will gain a deeper insight of the operations of the PCA through an example.

Figure 5.5 plots the data in the first and second dimensions of the raw data of iris

flower data set.

0.05 0.06 0.07 0.08 0.09 0.1 0.110.06

0.07

0.08

0.09

0.1

0.11

0.12

d1

d 2


Figure 5.5: The raw data of the first and second dimensions of the iris data set.

It is noted that the three classes are intermingled together in this projection from a


three dimensional space onto a two dimensional space.

Figure 5.6 shows the iris flower dataset projected down to one transformed di-

mension with PCA.

0.05 0.06 0.07 0.08 0.09 0.1 0.110.06

0.07

0.08

0.09

0.1

0.11

0.12

d1

d 2


Figure 5.6: The iris flower dataset projected down to one transformed dimensionusing the PCA algorithm.

It is noted that in this case, the data clustered very well along the one trans-

formed dimension. Figure 5.7 shows the iris flower dataset projected down to two

transformed dimensions using the PCA algorithm. Referring to Figure 5.6, it shows

that the iris-versicolor and iris-virginica data sets are close together whereas in Fig-

ure 5.7 these two data sets overlap considerably which means that the PCA algorithm

transforms the data such that it maximizes the variance within class regardless of

the relationship between class. This is a disadvantage of using PCA if it is expected

that there are significant differences in the between class relationships. One way in


which this may be overcome is to use the Fisher’s linear discriminant analysis which

we will consider next.

0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.120.06

0.07

0.08

0.09

0.1

0.11

0.12

d1

d 2


Figure 5.7: The iris flower dataset projected from three dimensions to two trans-formed dimensions using the PCA algorithm.

5.4.2 Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) derives from the Principal Component Analysis

(PCA) [55]. LDA has a better between-class classification than the PCA. LDA

utilizes Fisher’s Linear Discriminant (FLD) analysis [59] to replace the covariance

matrix in PCA and maximizes the ratio of between-class scatter to within-class

scatter [60]. There are two alternatives of LDA method; they are respectively class-

dependent transformation and class-independent transformation.


The class-dependent transformation uses a different covariance matrix for dif-

ferent input types while class-independent transformation uses a single covariance

matrix for different input types. The choice of the type of LDA depends on the

data set and the goals of the classification problem. If the generalization property of

the resulting classification method is important, then the class independent variety

should be used. On the other hand, if good discrimination property is important,

then the class dependent variety is a good choice. [58]

In this section, we will first briefly discuss Fisher’s linear discriminant analysis

method before considering the class-dependent and class-independent LDA methods.

Let xc be the inputs related to class c. Let μc be the mean of the inputs related

to class c. This is defined as follows:

μc =1

Nc

∑i∈c

xi (5.7)

where Nc is the total number of inputs belonging to class c. Moreover we define the

overall mean of the inputs as:

x =1

N

∑i

xi =∑

c

Nc

Nμc (5.8)

Then we can define the between-class covariance matrix and the within-class covari-

ance matrix respectively as follows:


SB =∑

c

Nc(μc − x)(μc − x)T (5.9)

SW =∑

c

∑i∈c

(xi − μc)(xi − μc)T (5.10)

Then the Fisher discriminant analysis is obtained by maximising the following

cost criterion:

maxw

J(w) =wTSBw

wTSWw(5.11)

The maximisation problem can be transformed into the following problem:

minw

−1

2wTSBw (5.12)

subject to the constraint

wT SWw = 1 (5.13)

Using the Lagrange multiplier this constrained optimisation problem can be

transformed to the minimisation of the following Lagrangian function:

L = −1

2wTSBw +

1

2λ(wT SWw − 1) (5.14)


Differentiating this function with respect to w and setting it to zero, we obtain:

SBw = λSWw (5.15)

This equation can be transformed to the following generalised eigenvalue problem:

S−1W SBw = λw (5.16)

This can be solved by noting that SB is positive and symmetric. Hence it is

possible to form SB = UΛUT , where U is an orthonormal matrix, and Λ is a

diagonal matrix. Then SB = S1/2B S

1/2B , where S

1/2B = UΛ1/2UT . Let v = S

1/2B w,

then we have:

S1/2B S−1

W S1/2B v = λv (5.17)

This is a common eigenvalue problem which can be solved readily. The maximum

eigenvalue found using this process corresponds to the solution of Fisher’s discrim-

inant analysis problem. Indeed, if we find all the eigenvalues, λi, and the corre-

sponding eigenvectors vi, i = 1, 2, . . . , D, then it is possible to express the projected

matrix as follows:

SV = V Λ (5.18)


where Λ is a diagonal matrix with diagonal elements λ1, λ2, . . . , λD, arranged such

that λ1 ≥ λ2 ≥ · · · ≥ λD, and V is the corresponding eigenvector matrix. The matrix

S = S1/2B S−1

W S1/2B . In this case, similar to the PCA situation, it is possible to consider

a reduced dimension problem, dependent on the magnitude of the eigenvalues λi,

i = 1, 2, . . . , D. Thus, if λj >> λj+1, then it is possible to assume that λj+1 = 0

and the dimension of the input space is reduced to j instead of D. In this case, the

transformed vector is obtained as follows:

xi = μ + (xi − μ)v[1,2,...,j]vT[1,2,...,j] (5.19)

where μ is the class independent mean of the inputs. v[1,2,...,j] denotes the matrix

formed by the first j eigenvectors.

This method is called a class independent LDA method.

In a very similar manner, it is possible to formulate a class dependent LDA

method as follows: In this case, the cost criterion to be maximised is given as

follows:

maxw

J(w) =wTSBw

wTScw(5.20)

where Sc = 1Nc−1

(xi − μc)(xi − μc)T , and μc is as defined previously. Carrying out

the same derivation as before, we will obtain the following characterisation of the

class dependent LDA method:


S1/2B S−1

c S1/2B v = λv (5.21)

In this case, in a very similar manner, the transformed input vector is given by

xi = μc + (xi − μc)v[1,2,...,j]vT[1,2,...,j] (5.22)

where μc is the class dependent mean of the input vectors xi, and xi ∈ c, the class

c.

Thus the only difference between the class dependent and class independent LDA

method is the covariance matrix. In the case of the class independent LDA method

the covariance matrix SW is used, while in the case of the class dependent LDA

method, the covariance matrix Sc is used. The class independent LDA method

maximises the overall gap between the classes, irrespective of their classes, while the

class dependent LDA method maximises the gap in between the classes.

To gain a deeper insight into the differences between the class dependent and

class independent LDA methods, we will apply both methods to the iris flower data

set.

Figure 5.8 shows the iris flower dataset projected down to one transformed dimen-

sion with a class-dependent LDA method while Figure 5.9 shows the same dataset

projected down to two transformed dimensions.

It is obvious from Figures 5.8 and 5.9 respectively that the projection onto one


−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01−0.035

−0.03

−0.025

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

d1

d 2


Figure 5.8: The iris flower data set projected down onto one transformed dimensionusing the class-dependent LDA method.

−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02−0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

d1

d 2


Figure 5.9: The iris flower data set projected down to two transformed dimensionswith a class-dependent LDA method.


dimension does not separate the various classes while the projection onto two di-

mensions (as shown in Figure 5.9) the various classes are separated into different

clusters.

Using the class independent LDA method, we can carry out the same operations.

The results are shown in Figures 5.10 and 5.11 respectively for projection onto one

dimension and two dimensions.

−0.04 −0.03 −0.02 −0.01 0 0.01 0.02−0.05

−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

d1

d 2


Figure 5.10: Iris flower data set projects down to one transformed dimension withclass-independent LDA

It is observed that the projection onto one dimension does not separate the classes,

while the projection onto two dimensions separates the various classes.

We note the differences in the performance of the class dependent and class

independent LDA methods (referring to Figures 5.9 and 5.11 respectively) in that

the class dependent LDA method appears to be able to separate classes better than


−0.02 −0.01 0 0.01 0.02 0.030.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

d1

d 2


Figure 5.11: Iris flower data set projects down to two transformed dimension withclass-independent LDA

the class independent LDA method. However, though it is not obvious from these

diagrams, the class independent LDA method has a better generalisation capability

than the class dependent LDA method.

Note that there are basic assumptions for the validity of the Fisher discrimi-

nant analysis methods. These are that the underlying distribution of the inputs are

Gaussian in nature. If this assumption is not satisfied, then the performance of the

Fisher linear discriminant analysis cannot be predicted. In this case it might be nec-

essary to consider more complex discriminant methods, e.g. nonlinear discriminant

analysis methods. In this thesis, we will not consider these more complex topics,

but reserve them as topics for further research.


5.4.3 Selection of the combined architecture

As indicated in the previous section, there are a number of possible combinations

of the LDA methods with the EANFIS architecture. In this subsection, we will

carry out some experiments to indicate which may be a better architecture for the

combined methods. In this case, we run a number of experiments with different

architectures, e.g. ANFIS architecture, EANFIS architecture, and with various

combinations of LDA methods, or PCA methods. We use a number of practical

examples, for these evaluations: iris flower data set, Wisconsin breast cancer data

set, and the sunspot number time series.

The following architectures have been considered (together with the notations

as indicated in Table 5.1):

For the iris flower data set, and the Wisconsin breast cancer data set, we carried

out the following experiments:

• The ANFIS architecture (ANFIS)

• The LDA method in series with the ANFIS architecture (LDA ⇒ ANFIS). In

this case, the LDA is used as a preprocessing algorithm.

• The EANFIS architecture (EANFIS)

• The LDA method is used in series with the EANFIS architecture (LDA ⇒EANFIS). In this case the LDA method is used as a preprocessing algorithm.


• The LDA method is used in parallel with the self organising mountain clus-

tering membership function method (LDA // MF)

• One LDA method is used as a preprocessing method and then a second LDA

method used in parallel with the self organising mountain clustering member-

ship function method (LDA ⇒ (LDA // MF))

The LDA methods are carried out using both the class dependent and class inde-

pendent LDA methods.

For the sunspot number time series, we carried out the following experiments:

• The ANFIS architecture (ANFIS)

• The PCA method used as a preprocessing method for the ANFIS architecture

(PCA ⇒ ANFIS)

• The EANFIS architecture (EANFIS)

• The PCA method used as a preprocessing method for the EANFIS architecture

(PCA ⇒ EANFIS)

• PCA in series with a parallel architecture involving the LDA method and the

self organising mountain clustering membership function method (PCA ⇒(LDA // MF))

Again the LDA method in this case is carried out both with the class dependent

and class independent version. Note that the main reason why we carried out the


time series experiments with a PCA instead of a LDA method is that the proposed

method will consider the continuous output as a one class system. Thus, it is not

as conducive to the LDA method.

Table 5.1: Using the LDA methods as pre-processing methods in combination withthe ANFIS or EANFIS architectures.

LDA ⇒ LDA ⇒

LDA ⇒ LDA ⇒ (LDA (LDA (LDA (LDA

ANFIS ANFIS EANFIS EANFIS \\MF) \\MF) \\MF) \\MF)

(Class independent) (Class dependent)

Breast 88.40% 96.58% 94.95% 95.84% 96.87% 96.39% 96.44% 96.39%

Cancer

Variance 4.05 0.27 0.18 0.43 0.20 0.33 0.51 0.33

Rule 512 512 2 2 2 2 2 2

Iris 98.22% 95.75% 98.43% 97.88% 98.18% 98.08% 97.84% 98.02%

Variance 2.73 7.54 2.80 3.63 2.82 3.18 3.53 3.15

Rule 16 16 3 3 3 3 3 3

PCA ⇒ PCA ⇒

PCA ⇒ PCA ⇒ (LDA (LDA (LDA (LDA

ANFIS ANFIS EANFIS EANFIS \\MF) \\MF) \\MF) \\MF)

(Class independent) (Class dependent)

Sunspot 16.7495 17.4060 15.9901 16.3934 15.9915 N/A 15.9894 N/A

Rule 16 16 2 1 2 1 2 1

From Table 5.1, it is observed that


• The ANFIS architecture with or without the preprocessing unit has an inferior

performance compared with the various EANFIS architectures.

• The preprocessing unit (using LDA or PCA) when connected in series with the

EANFIS architecture does not have good performance compared with other

combination of EANFIS architectures.

• The parallel connection between the preprocessing unit and the self organis-

ing mountain clustering membership function algorithm by itself appears to

have superior performance when compared with other combination of EANFIS

architectures.

• The addition of a LDA preprocessing unit to the parallel connection between

a self organising mountain clustering membership function algorithm does not

appear to improve the performance.

Based on these experiments we conclude that the parallel connection between

a self organising mountain clustering membership function algorithm appears to

provide the best performance. This is then followed by the normalisation layer

before the output layer. This combined architecture is shown in Figure 5.12. This

will be the combined architecture which we will use in the rest of this chapter.

Note that this combination of the LDA method and the EANFIS architecture

can be explained using the following heuristics: the LDA method is very good in

separating the input data into clusters. The self organising mountain clustering

membership function algorithm is one method for dealing with the clustering of the

inputs. Hence if we combine the two in parallel, they will be aiding one another.


However by combining them in series, e.g. using the LDA as a preprocessing unit for

the EANFIS architecture, we are not using the best capability of the LDA method,

and hence it gives a lower performance.

Thus in this case, the output similarity of the input data can be calculated using

a Gaussian function as shown in Equation (5.23) where σr is the standard deviation

of the projected output xr.

θr = exp

(−∥∥xr − ¯xr

∥∥2

2σ2r

)(5.23)

where the ¯xr is the mean of the projected cluster r, xr is the projected output of

input x using r eigenspace.

Note that in this combined architecture, it is obvious that the LDA method

and the error correction layer in the EANFIS architecture are working in concert

with one another. The LDA method cannot solve some problems by itself, e.g.,

the exclusive OR problem. The LDA method combined with the error correction

layer in the EANFIS architecture can help the decision making in marginal cases.

It is possible to combine both the error correction layer in the EANFIS architecture

and the LDA method as shown in Equation (5.24). φr is the output from the error

correction layer of the EANFIS architecture. θr is the output from the LDA method.

α is a weight to adjust the weighting between the two methods.

Layer 2A:

εr = α · φr∑r

φr+ (1 − α) · θr∑

r

θr(5.24)


rθ

11ϕ

12ϕ

21ϕ

22ϕ

dcϕ

x1

x2

1φ

2φ

3φ

4φ

φ r

1φ �

2φ �

3φ �

4φ �

1π �

2π �

3π �

4π �

1π �

2π �

3π �

4π �

rφ rπ rπ

��

wr


1ε �

2ε �

3ε �

4ε �

α

β

xd

y

1θ

2θ

3θ

4θ

Error Correction Layer

Global Method (LDA)

ANFIS

Figure 5.12: The extended adaptive neuro-fuzzy inference system with the LDAmethod.


5.5 Application Examples

In this section, we will illustrate the application of the concept of combining the

local method, e.g. EANFIS architecture together with a global method, e.g. a LDA.

We demonstrate a Sunspot number cycle which is a time series data, a non-linear

separated Iris dataset and a high dimensional Wisconsin Breast Cancer dataset.

5.5.1 Sunspot Cycle

The NOAA sunspot number time series is compiled by the US National Oceanic and

Atmospheric Administration (NOAA). The number of sunspots has been measured

daily since January 1749 at the Zurich Observatory [54]. The time series is quite

“jugged”. This is a common benchmark problem for time series prediction algo-

rithms. Instead of using the daily figure, we take the average for each month and, as

used commonly in time series benchmark studies, use 55 years of data for training

and the rest 200 years of data for prediction. The entire time series is shown in Fig-

ure 5.13. It is observed that the time series appears to exhibit a cyclic behaviour.

However, this is a hotly debated topic: whether the sunspot time series exhibits a

cyclic behaviour. There are various estimates of the cycle, e.g. 11 years. Others

indicated that this time series may exhibit a chaotic behaviour with no observable

periodic behaviour.

We rearrange the data format as input-output pairs as follow.


1750 1800 1850 1900 1950 20000

50

100

150

200

250

300

Date

Sun

spot

Num

ber

Figure 5.13: Monthly Average Sunspot number time series in which the first 55 yeardata is used for training, and the rest of 200 year data is used for testing. Thetraining data is shown in continuous line, while the testing data is shown in dottedline.

[x(t-4), x(t-3), x(t-2), x(t-1); x(t)]

In other words, we use the immediate past four data points x(t− i), i = 1, 2, 3, 4 as

the inputs and attempt to predict the next data point: x(t). We then applied the

combined architecture as indicated in Section 5.3 to the time series.

Refer to the results in Table 5.2, the combined method using a class dependent

LDA technique has the smallest RMS error. The difference between the EANFIS

architecture, and the combined architecture though is small. Here we are compar-

ing the performance of an ANFIS architecture, which uses 16 rules, and Gaussian

functions, with those of the EANFIS architecture, and the combined architectures,

using a TSK output mechanism, and only 2 rules.


Table 5.2: Prediction RMS Errors for the Sunspot number time series.

Combined EANFIS Combined EANFIS

ANFIS EANFIS Class dependent Class Independent

Output Single weight TSK TSK TSK

Layer

Setup Grid Linear Nonlinear Nonlinear Nonlinear

RMS Error 16.7495 15.9901 15.9894 15.9915

Rules 16 2 2 2

Figures 5.14 and 5.15 respectively show the LDA technique in projecting the data set

to one transformed dimension using class dependent method and class independent

method respectively. It is noted that in both cases, the data sets align well within

the projected dimension. This may be the reason why the prediction using the

combined architecture works well. The LDA technique, either class dependent or

class independent, appears to be able to align the data so that they fall within a

transformed dimension.

Note that the ANFIS architecture using a TSK output mechanism shows insta-

bility (as indicated in Section 4.5), so we use a single weight output mechanism in

the output layer. On the other hand the EANFIS architecture, or the combined ar-

chitectures do not show any instability using a TSK output mechanism, and hence

in this case, we compare their results with those in the ANFIS architecture using a

single weight output mechanism. The results show that the class dependent LDA

method has a slightly better classification ability. The combined method can signif-

icantly predict the trend of the sunspot number time series as shown in Figure 5.16.


−5 0 5 10 15 20−20

−10

0

10

20

30

40

d1

d 2

Data points support Rule 1Data points support Rule 2

Figure 5.14: The Sunspot number time series data set using a class dependent LDAtechnique to project it to one transformed dimension.

0 5 10 15 20 250

5

10

15

20

25

30

35

40

d1

d 2

Data points support Rule 1Data points support Rule 2

Figure 5.15: The Sunspot number time series data set using a class independentLDA technique to project it to one transformed dimension.


The predicted output errors are shown in Figure 5.17.

1750 1800 1850 1900 1950 20000

50

100

150

200

250

300

Date

Sun

spot

Num

ber

Figure 5.16: The prediction outputs of the class dependent LDA method combinedwith the EANFIS architecture.

The maximum error in Figure 5.17 is 93.4475 and the total RMS error is 15.9894

as shown in Table 5.2. Note the prediction errors as shown in Figure 5.17. They

appear reasonably balanced around the zero axis, and that the “undulations” around

this axis appear to be reasonably balanced as well.

5.5.2 Iris Dataset

Fisher’s Iris dataset consists of three types of iris leaves: Iris-Virginica, Iris-Versicolor

and Iris-Setosa. Each species has measurements pertaining to the sepal length, the

sepal width, the petal length, and the petal width. Iris-Virginica and Iris-Versicolor

are linearly inseparable. We randomly select 51 data points for testing and 99 data

points for training.


1750 1800 1850 1900 1950 2000−150

−100

−50

0

50

100

150

Date

Err

or

Figure 5.17: The prediction output errors of the class dependent LDA methodcombined with the EANFIS architecture.

The first two dimensions of the raw data are plotted as in Figure 5.5. It is noted

that the data are well “intertwined”.

Once the training and testing data sets are chosen, these same data sets will be

put into different neuro-fuzzy architectures used for performance evaluation. The

evaluation process will run for 100 times and the average over the 100 runs will

be used for comparison purposes. In order to show the effect of the combined

method, we use a zeroth order polynomial function (constant weights) in the output

layer. Figure 5.18 shows the membership function using the self-organizing mountain

clustering membership function method.

Figure 5.8 to Figure 5.11 show the effect of LDA transformation.

The combined architecture using an EANFIS architecture, and a LDA method

outperforms that of an EANFIS architecture. The classification accuracy is rel-


0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.110

0.05

0.1

0.15

0.2


0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.110

0.05

0.1

0.15

0.2


0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.120

0.1

0.2

0.3

0.4


0.02 0.04 0.06 0.08 0.1 0.12 0.140

0.1

0.2

0.3

0.4


Figure 5.18: The membership function formed using the self-organizing mountainclustering membership function method. ’.’ denotes the iris-setosa, ’o’ denotes theiris-versicolor and ’+’ denotes the iris-virginica.


atively close to that obtained using an ANFIS architecture, though the ANFIS

architecture uses 16 rules and the EANFIS architecture only uses 3 rules. Table 5.3

shows the classification accuracies and root mean square errors of various architec-

tures on the testing data set. It shows that the RMS errors in the combined EANFIS

architecture is smaller than that obtained using an ANFIS architecture. As shown

in Table 5.3, we have evaluated various architectures. We have used the following

architectures:

• ANFIS architecture

• An EANFIS architecture without the error correcting mechanisms

• An EANFIS architecture with the error correcting and normalisation layers

turned on

• An EANFIS architecture without the error correcting and normalisation layers

turned on, and with a class dependent LDA method

• An EANFIS architecture without the error correcting and normalisation layers

turned on, and with a class independent LDA method

• An EANFIS architecture with the error correcting and normalisation layers

turned on, and with a class independent LDA method


Table 5.3: Prediction classification accuracy comparison on the Iris data set.

EANFIS Combined EANFIS ANFIS

Without With Error Corr. & Error Corr. &

Error Corr. Error Corr. Class Class Class Class

Layer Layer Dependent Independent Dependent Independent


Accuracy

87.92% 87.82% 97.90% 98.25% 97.90% 98.31% 98.10%

Variance of Accuracy

16.06 15.40 3.90 3.18 3.90 3.03 3.53

Rule

3 3 16

RMS Error

0.3068 0.3076 0.1638 0.1550 0.1607 0.1515 0.1920

Variance of RMS Error

0.00055 0.00055 0.00110 0.00082 0.00114 0.00086 0.00035


5.5.3 Wisconsin Breast Cancer

The Wisconsin breast cancer dataset [53] consists of 699 instances in 9 dimensions.

A total of 16 instances have missing attribute. We randomly select 200 instances for

training and use the entire 499 instances for testing purposes as commonly carried

out in using this data set for benchmark purposes. This evaluation will run for 100

times and we take the average of the 100 runs for comparison purposes. In the ANFIS

architecture, it uses two membership functions per dimension. The total number

of fuzzy rules is 512. For the EANFIS architecture and the combined method, it

uses only 2 fuzzy rules. Figure 5.19 shows the breast cancer dataset projected down

to one transformed dimension with class dependent LDA method and Figure 5.20

shows the transformed data with class independent LDA method.

−2 0 2 4 6 8 10−2

0

2

4

6

8

10

d1

d 2

BenignMalignant

Figure 5.19: The breast cancer dataset projected down onto one transformed di-mension with the class dependent LDA method.


0 1 2 3 4 5 60

2

4

6

8

10

12

d1

d 2

BenignMalignant

Figure 5.20: The breast cancer dataset projected down to one transformed dimensionwith the class independent LDA method.

The results shown in (Table 5.4 and Table 5.5) indicate that the accuracy in the

combined method is much higher than that obtained using the ANFIS architecture.

5.6 Conclusions

An adaptive neuro-fuzzy inference system uses membership functions to compare

the output similarities. The self organising mountain clustering membership func-

tion uses a local method to find the parameters of an asymmetrical membership

function. It works on one dimension first, and then work on another dimension

and so on. Hence this process does not incorporate any global information in the

input space. A method which can incorporate to some extent the global information


Table 5.4: The output accuracy comparison of the Wisconsin breast cancer data setusing various architectures.

EANFIS (Mountain MF) ANFIS

Output Layer: Single weight TSK

Rules 2 2 2 2 512

Error Corr. N Y N Y N/A

Setup Grid linear linear nonlinear nonlinear linear

Accuracy 94.43% 96.22% 94.63% 96.77% 90.58%

Variance 0.45 0.21 0.28 0.16 3.87

Table 5.5: The output accuracy comparison on the Wisconsin breast cancer dataset.

Combined EANFIS

Number of rules: 2


Global LDA LDA

Method Class Dependent Class Independent

Setup Grid Linear Nonlinear Linear Nonlinear

Error Corr. N Y N Y N Y N Y

Accuracy 95.28% 96.13% 95.69% 96.59% 95.31% 96.59% 95.69% 96.81%

Variance 0.50 1.27 0.39 0.89 0.45 0.31 0.40 0.27


on the input space like the LDA method which transforms the input data to an-

other space by maximizing the between class scatter and minimizing the within class

scatter. However, the LDA method assumes that the data distribution is Gaussian

otherwise it may fail to provide a good classification, for instance, in the case of

the exclusive OR problem. The combined architecture involving a global processing

method, e.g., the LDA method and an EANFIS architecture in some cases can pro-

vide much improved performance when compared with the EANFIS architecture by

itself. Hence a judicious application of this combined architecture could improve on

the performance of the EANFIS architecture.

Chapter 6

Conclusions and

Recommendations

6.1 Conclusions

This thesis presents a novel extension to the architecture of the adaptive neuro-fuzzy

inference system (ANFIS). The extension is inspired by the observation that often in

using the ANFIS architecture on discrete output classification problems, the results

are improved if we include an additional logistic function layer (error correction) in

the architecture, in a manner similar to the use of a logistic function in multilayer

perceptrons when used as classifiers. Hence, the extended architecture is to represent

each class of output by a separate membership function structure to represent the

input variables. In the case of continuous output, there is only one membership

227


function structure. The total number of rules required will be increased by many

folds in comparison with the ANFIS architecture. The idea is to insert additional

sections to the ANFIS architecture incorporating a logistic function for each rule

formation and its associated normalisation. This counter-intuitive approach (as it

will greatly increase the number of rules) allows us to follow up with two innovative

ideas:

• Structural determination. By investigating the input variable structure, using

a method inspired by the associative rule determination from data mining

concepts, we are able to determine which rule needs to be formed, prior to

learning the system parameters. The implicit assumption is that those rules

which are not formed would not be needed in the final architecture. What this

says is that even if these “unused” rules are formed, their firing strength would

be negligible from the set of training data point of view. Hence, there is no need

to form them in the EANFIS architecture. Thus, by being able to determine

which rule needs to be formed before the learning process commences, we are

able to “reduce” the number of rules, and thus, allowing the architecture to be

used for potentially problems with a large number of inputs. This approach

overcomes one of the major difficulties in the ANFIS architecture, viz., it is

difficult to use the architecture for practical problems with a large number

of input variables, as the number of rules which need to be formed grows

explosively with the number of input variables. By limiting the formation of

rules which are necessary to the inference we are able to use the proposed

architecture for problems with a large number of input variables.


• Membership function determination. Traditionally in fuzzy system studies,

the idea is to select a membership function from a set of candidate member-

ship functions. Then the parameters of the selected membership function are

determined. In our proposed approach, we suggest investigating the input

variables with a view to find a suitable, possibly non-symmetric membership

function which is most appropriate to the problem at hand. In other words,

we do not fix the shape of the membership function a priori, but we use the

input data to determine the required shape. This approach is called the self

organising mountain clustering membership function approach. If it happens

to be non-symmetric, so be it. Thus this proposed approach allows the input

data to determine the required membership function.

These new algorithms were evaluated on a number of benchmark problems, e.g.

sunspot cycle problem, Wisconsin breast cancer problem. It was found that in

general, the results support our intuition, i.e. by including the softmax-type layers

on the ANFIS architecture, the results are imporved. We compared our results on

the extended ANFIS architecture with those obtained by the ANFIS architecture,

and we can claim that in all cases, the EANFIS architecture provides an imporved

result.

We further enhance the proposed extended ANFIS architecture by asking the

question: what happens if there is additional information available on the input

variables. This is based on the concept that the membership function representa-

tion used in the EANFIS architecture, viz. the self organising mountain clustering

membership function approach, is based on “local” properties of the input variables.


What if some information on the “global” structure of the input variables is avail-

able, would this additional information improve the performance of the EANFIS

architecture. The idea is obvious. If there is additional global information on the

input variables, then the aggregate input may be a linear combination of the “local”

information as represented by the self organising mountain clustering membership

functions, and the “global” information as represented in our case by the linear

discriminant analysis.

We have again evaluated the proposed algorithm on a set of benchmark prob-

lems, including the sunspot cycle problem, and the Wisconsin breast cancer problem.

Again we found that the proposed combination of local and global information im-

proves the performance of the proposed architecture, compared with those obtained

by using EANFIS alone.

We have also en route to study the ANFIS considered a minor problem, how

to provide a non-uniform grid on the input structures. We have devised a method

which can be used to provide a non-uniformed grid on a set of input variables.

This idea of a non-uniformed grid, instead of a uniform grid may represent the

inputs more “faithfully”. The application in this case is the radial basis function

neural network (or equivalently an ANFIS architecture with a Gaussian function

membership function).

We compared the results of applying this algorithm to a number of practical

cases, e.g. the currecny exhcnage problem, and found that the proposed algorithm

performs better than the one using a linear grid. Thus, the nonlinear grid deter-

mination idea may be used whenever there is a need to interpolate a signal in a

6.2. Future areas of research 231

nonlinear sampling fashion.

6.2 Future areas of research

There are a number of areas of future research which may be interesting following

on the work which was performed in this thesis. These include:

• We have used a stage-wise approach in the training of the proposed EANFIS

architecture. We first determine the membership function and its associated

parameters, and then the parameters of the inference mechanism. This de-

coupling of the training process appears to work well. The question is: can

we combine the parameters of the inference mechanism, and the parameters

of the membership function and use a combined training regime. In other

words, instead of a stage-wise training process, can we combine the training of

the membership function parameters, and the inference mechanisms together.

This is a highly nonlinear process. However, if this can be performed then it

will make the proposed method more automatic from the users’ point of view,

as they do not need to consider various steps, but instead can use the method

as a “black box” with little adjustment required.

• There are many ways to consider the global information on input structures,

e.g. principal component analysis. The one chosen in this thesis is Fisher’s

linear discriminant analysis. However, there are assumptions of Gaussianity of

the input variables in this approach. There are other approaches which allow

6.2. Future areas of research 232

the extraction of the global structures of the input variables. The question is:

would the use of a different “global” information method alter the behaviour

of the proposed EANFIS?

• In the rule “selection” method used to select which rule to be formed in the

EANFIS architecture, the assumption is that the characteristics of the training

data and the testing data sets are similar. Hence, by selecting which rules to

form, these rules would be useful for deployment on the testing data set. An

interesting question arises: what if the characteristics of the testing data set

and the training data set are different? Would it be possible to find a method

which allows rules to be switched on or off dependent on the data. If this can

be done, then it will lead to a more adaptive structure modelling the training

data set.

These questions will form fruitful areas of further research work in this area.

References

[1] A. C. Tsoi, S. Tan, “Recurrent neural networks: A constructive algorithm, and

its properties”, Neurocomputing, vol. 15, pp. 309-326, June. 1997

[2] J. Durkin, Expert Systems, Design and Development, Prentice Hall, 1994.

[3] M. Minsky, S. A. Papert, Perceptrons: An Introduction to Computational Com-

plexity MIT Press, Cambridge, MA, 1969.

[4] W. McCulloch, W. Pitt, “A logical calculus of the idea immanent in nervous

activity” Bulletin of Mathematical Biophysics, Vol 5, pp. 115-133, 1943.

[5] F. Rosenblatt, Principles of Neurodynamics Spartan, Washington DC, 1962.

[6] J. Perl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible

Inference, Morgan Kaufman, 1988.

[7] M. Negnevitsky, Artificial Intelligence, A Guide to Intelligent Systems, Addison

Wesley, First Edition, 2002.

[8] E. Cox, The fuzzy systems handbook: a practitioner’s guide to building, using,

and maintaining fuzzy systems, Academic Press, 1994.

233

References 234

[9] R. J. Schalkoff, Artificial Neural Networks, McGraw-Hill, 1997.

[10] S. Haykin, Neural Networks, Prentice Hall, Second Edition, 1999.

[11] T. M. Mitchell, Machine Learning, McGraw Hill, 1997.

[12] J.S.R. Jang, C.T. Sun, E. Mizutani, Neuro-Fuzzy and Soft Computing, A Com-

putational Approach to Learning and Machine Intelligence, Prentice Hall, 1997.

[13] J. L. McClelland, D. Rumelhart and the PDP Research Group Parallel Dis-

tributed Processing Volumes 1 and 2, MIT Press, Cambridge, MA, 1986.

[14] K. Hornik, M. Stinchcombe, H. White, “Multilayer feedforward networks are

universal approximators” Neural Networks Vol. 2, pp.359-366, 1989.

[15] C. M. Bishop, Neural Networks for Pattern Recognition Oxford University

Press, 1995.

[16] W. Hardel, Applied Non-parametric Regression Cambridge University Press,

New York, 1990.

[17] Zadeh, L.A., “Fuzzy sets”, Information and Control, Vol. 8, pp. 338-353, 1965

[18] S. Yasunobu, S. Miyamoto, “Automatic train operation system by predictive

fuzzy control” in Industrial Applications of Fuzzy Control M. Sugeno Ed. Ams-

terdam: North Holland, 1985.

[19] E. H. Mamdani, “Application of Fuzzy Logic to Approximate Reasoning Using

Linguistic Synthesis”, IEEE Trans. Computers, vol. 26, pp. 1182-1191, 1977

References 235

[20] E. H. Mamdani, S. Assilian, “An experiment in linguistic synthesis with a fuzzy

logic controller”, International Conference on Fuzzy Systems, vol. 7, pp. 1-13,

1975

[21] T. Takagi, M. Sugeno, “Fuzzy identification of systems and its applications to

modeling and control”, IEEE Trans. on Systems, Man and Cybernetics, vol. 15,

pp. 116-132, Jan. 1985

[22] J.S. Jang, “ANFIS: Adaptive-Network-Based Fuzzy Inference System”, IEEE

Trans. on Systems, Man and Cybernetics, vol. 23, pp. 665-685, May/June 1993.

[23] C.-T. Lin, Y.-C. Lu, “A Neural Fuzzy System with Linguistic Teaching Signals”,

IEEE Trans. on Fuzzy Systems, vol. 3, pp. 169-189, May 1995.

[24] C. Li, C.-Y. Lee, K. H. Cheng, “Pseudoerror-Based Self-Organizing Neuro-

Fuzzy System”, IEEE Trans. on Fuzzy Systems, vol. 12, pp. 812-819, Dec. 2004.

[25] L. Rutkowski, K. Cpalka, “Flexible Neuro-Fuzzy Systems”, IEEE Trans. on

Neural Networks, vol. 14, pp. 554-574, May. 2003.

[26] D. Chakraborty, N. R. Pal, “A Neuro-Fuzzy Scheme for Simultaneous Fea-

ture Selection and Fuzzy Rule-Based Classification”, IEEE Trans. on Neural

Networks, vol. 15, pp. 110-123, Jan. 2004.

[27] C. S. Velayutham, S. Kumar, “Asymmetric Subsethood-Product Fuzzy Neural

Inference System (ASuPFuNIS)”, IEEE Trans. on Neural Networks, vol. 16, pp.

160-174, Jan. 2005.

[28] W.L. Tung, C. Quek, “GenSoFNN: A Generic Self-Organizing Fuzzy Neural

Network”, IEEE Trans. on Neural Networks, vol. 13, pp. 1075-1086, Sept. 2002.

References 236

[29] L.-X. Wang, A Course in Fuzzy Systems and Control, Prentice Hall, 1997.

[30] R. C. Berkan, S. L. Trubatch, Fuzzy Systems Design Principles, Building Fuzzy

IF-THEN Rule Bases, IEEE Press, 1997.

[31] R. Beale, T. Jackson, Neural Computing: an Introduction, IOP Publishing

Ltd., 1990.

[32] G. B. Arfken, H. J. Weber Mathematical Methods for Physicist, Academic

Press, Fourth Edition, 1995.

[33] R. Larson, B. H. Edwards, D. C. Falvo, Elementary Linear Algebra, Houghton

Mifflin, Fifth Edition, 2004.

[34] R. Larson, R. Hostetler, B. H. Edwards, Calculus of a single variable, Houghton

Mifflin, Seventh Edition, 2002.

[35] R. Reed, “Pruning Algorithms-A Survey”, IEEE Trans. on Neural Networks,

vol. 4, pp. 740-747, Sept. 1993.

[36] P. P. Kanjilal, D. N. Banerjee, “On the Application of Orthogonal Transfor-

mation for the Design and Analysis of Feedforward Networks”, IEEE Trans. on

Neural Networks, vol. 6, pp. 1061-1070, Sept. 1995.

[37] K. I. Diamantaras, S. Y. Kung, Principal component neural networks: theory

and applications, John Wiley & Sons Inc, Fifth Edition, 1996.

[38] J. H. Mathews, K. Fink, Numerica; Methods Using MATLAB, , Fourth Edition,

2003.

References 237

[39] R. J. Schilling, J. J. Carrol, A. Al-Ajlouni, “Approximation of Nonlinear sys-

tems with Radial Basis Function Neural Networks”, IEEE Trans. on Neural

Networks, vol. 12, pp. 1-15, Jan. 2001.

[40] T. Kohonen, Self Organising Maps, Spring-Verlag, Second Edition, 1997.

[41] R. Agrawal, R. Srikant, “Fast Algorithms for Mining Association Rules”, Pro-

ceedings of the 20th VLDB Conference, Santiago, Chile, 1994.

[42] W. Chu, S. S. Keerthi, and C. J. Ong, “Bayesian Support Vector Regression

Using a Unified Loss Function”, IEEE Trans. on Neural Networks, vol. 15, pp.

29-44, Jan. 2004.

[43] A. Nurnberger, C. Borgelt, and A. Klose, “Improving Naive Bayes Classifiers

Using Neuro-Fuzzy Learning”, Dept. of Knowledge Processing and Language

Engineering, Otto-von-Guericke-University of Magdeburg, Germany

[44] R.R. Yager and D.P. Filev, “Approximate Clustering Via the Mountain

Method”, IEEE Trans. on Systems, Man and Cybernetics, vol. 24, pp. 1279-

1284, Aug. 1994

[45] R. L. Burden, J. D. Faires, Numerical Analysis, Thomson Learning Inc.,

Seventh Edition, 2001.

[46] J. C. Principe, N. R. Euliano, W. C. Lefebvre, Neural and Adaptive Systems:

Fundamentals through Simulations, John Wiley & Sons Inc, First Edition, 2000.

[47] P. P. Kanjilal, D. N. Banerjee, “On the Application of Orthogonal Transfor-

mation for the Design and Analysis of Feedforward Networks”, IEEE Trans. on

Neural Networks, vol. 6, pp. 1061-1070, Sept. 1995.

References 238

[48] R. Reed, “Pruning Algorithms-A Survey”, IEEE Trans. on Neural Networks,

vol. 4, pp. 740-747, Sept. 1993.

[49] J. Han and M. Kamber, Data Mining, Concepts and Techniques, Morgan Kauf-

mann, 2000.

[50] T. M. Mitchell, Machine Learning, McGraw-Hill, 1997.

[51] R. Beale, T, Jackson, Neural Computing: An Introduction, Institute of Physics

Publishing, 1994.

[52] S. C. Chapra, R. P. Canale, Numerical methods for engineers : with program-

ming and software applications, McGraw-Hill, c1998.

[53] K. P. Bennett, O. L. Mangasarian, “Robust linear programming discrimination

of two linearly inseparable sets”, Optimization Methods and Software 1, 1992,

23-34 (Gordon and Breach Science Publishers).

[54] David H. Hathaway, The Sunspot Cycle, 02 Sept., 2006.

http://solarscience.msfc.nasa.gov/SunspotCycle.shtml

[55] B. Flury, A First Course in Multivariate Statistics, Springer-Verlag, 1997.

[56] S. Z. Li, J. Lu, “Face Recognition Using the Nearest Feature Line Method”,

IEEE Trans. on Neural Networks, vol. 10, pp. 439-443, Mar. 1999.

[57] B.D. Ripley, Pattern recognition and neural networks, Cambridge University

Press, 1996.

[58] S. Balakrishnama, A. Ganapathiraju, Linear Doscriminant Analysis - A Brief

Tutorial, Mississippi State University, 2002.

References 239

[59] R.A. Fisher, “The Use of Multiple Measures in Taxonomic Problems”, Ann.

Eugenics, vol. 7, pp. 179-188, 1936.

[60] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. Fish-

erfaces: Recognition Using Class Specific Linear Projection”, IEEE Trans. on

Pattern Analysis and Machine Intelligence, vol. 19, pp. 711-720, July. 1997.

[61] M.-H. Yang, N. Ahuja, D. Kriegman, “Face Recognition using kernel eigen-

faces”, IEEE International Conference on Image Processing, vol. 1, pp. 37-40,

Sept. 2000.

[62] G. J. McLachlan and T. Krishnan, The EM algorithm and extensions, John

Wiley, 1996.

[63] B. W. Silverman, Density Estimation for Statistics and Data Analysis. Chap-

man and Hall: London, 1986.

[64] J. E. Moody, C. Darken, “Fast learning networks of locally-tuned processing

units” Neural Computation, Vol. 1, pp. 281-294, 1989.

[65] H. Tong, Nonlinear time series: a dynamic system approach Clarendon Press,

Oxford, 1990.

[66] M. J. Morris, “Forecasting the sunspot cycle” Journal of the Royal Statistical

Society, Series A, Vol. 140, pp 437 - 468, 1977.

[67] M. Casdagli, “Nonlinear prediction of chaotic time series” Physica D, Vol 35,

pp. 335-356, 1989.

References 240

[68] Aggarwal R K, Xuan Q Y, Johns A T, Li F R, Bennett A, “A novel approach

to fault diagnosis in multi-circuit transmission lines using fuzzy ARTmap neural

networks” IEEE transactions on neural networks, Vol.10, No.5, pp.1214-1221.,

1999.

[69] Potter, C.; Negnevitsky, M., “ANFIS application to competition on artificial

time series (CATS)” IEEE Proceedings International Conference on Fuzzy Sys-

tems, Volume 1, Issue , 25-29 July 2004 pp.: 469-474 vol.1, 2004.

[70] A.Berizzi, C. Bovo, M. Delfanti, M.Merlo, “A Neuro-Fuzzy Inference System for

the Evaluation of Voltage Collapse Risk Indices”, Bulk Power System Dynamics

ans Control - VI, Cotina d’Ampezzo, Italy, 22-27 August. 2004.

[71] Duane Hanselman and Bruce Littlefield, The Student Edition of Matlab: ver-

sion 5, user’s guide, Prentice Hall, 1997.

[72] Hung T. Nguyen, Nadipuram R. Prasad, Carol L. Walker and Elbert A. Walker,

A First Course in Fuzzy and Neural Control, CRC Press, 2002.

[73] Sushmita Mitra and Yoichi Hayashi, “Neuro-Fuzzy Rule Generation: Survey

in Soft Computing Framework”, IEEE Trans. on Neural Networks, vol. 11, pp.

748-765, May, 2000

Appendix A

Network Training Algorithm

A.1 RBFN Network Training Algorithm

The aim of the steepest descent algorithm is to minimize the accumulated errors for

the whole network with respect to the underlying parameters. The total error for

the network is shown in Equation (A.1.1).

E =1

2

∑i

e2i (A.1.1)

where the error ei is defined as follows:

ei = δi − yi (A.1.2)

241

A.2. ANFIS Network Training Algorithm 242

where δi is the desired value of ith output and yi is the output from the network.

To minimize the error, we determine the derivative of total error function with

respect to the weights (parameters). Apply the chain rule [34] we obtain Equation

(A.1.3).

∂E∂wr

= ∂E∂ei

· ∂ei

∂yi· ∂yi

∂wr

∂E∂wr

= −eiBr (xi)(A.1.3)

Apply ∂E∂wr

to the delta rule [10] we obtain Equation (A.1.4).

wnewr = wold

r − η ∂E∂wr

wnewr = wold

r + ηeiBr (xi)(A.1.4)

where η is a learning constant, x ∈ �I×D, i = 1, 2, . . . , I, d = 1, 2, . . . , D.

A.2 ANFIS Network Training Algorithm

In this section, we will consider the derivation of the parameter adaptation method

for the ANFIS architecture. The task is to minimize the total error (shown in

Equation (A.2.5)).

E =1

2

∑i

e2i (A.2.5)

A.3. Gradient determination in EANFIS Layer 4 for continuous output values 243

where

ei = δi − yi (A.2.6)

where δi and yi are respectively the desired output and the output of the network.

The derivative of the total error function with respect to the weight in aggregation

layer is in Equation (A.2.7).

∂E∂αrd

= ∂E∂ei

· ∂ei

∂yi· ∂yi

∂wir· ∂wir

∂αrd

∂E∂αrd

= −eiπirxid

(A.2.7)

Apply Equation (A.2.7) to the delta rule we obtain Equation (A.2.8).

αnewrd = αold

rd − η ∂E∂αrd

αnewrd = αold

rd + ηeiπirxid

(A.2.8)

where η is a learning constant, x ∈ �I×D, i = 1, 2, . . . , I, d = 1, 2, . . . , D.

A.3 Gradient determination in EANFIS Layer 4

for continuous output values

In this section, we will consider the derivation of the learning rules for the EANFIS

architecture when the output is a continuous value. The total error is E =∑

i e2i ,

A.3. Gradient determination in EANFIS Layer 4 for continuous output values 244

where ei = δi − yi, δi and the desired output and the output of the EANFIS ar-

chitecture. The parameter estimation of the weight γr (please see Chapter 4 for

the definition of the parameters) depends on the evaluation of the following partial

derivatives:

∂E

∂γr=

∂E

∂Ei

I∑i=1

∂Ei

∂ei· ∂ei

∂yi·

R∑r=1

∂yi

∂πir·∂πir

∂πir· ∂πir

∂γr(A.3.9)

∂E

∂Ei=

1

I(A.3.10)

∂Ei

∂ei

= ei (A.3.11)

∂ei

∂yi= −1 (A.3.12)

∂yi

∂πir= wir (A.3.13)

∂πir

∂πir=

(∑r

πir

)− πir(∑

r

πir

)2 (A.3.14)

∂πir

∂γr= (1 − πir) (10πir − 5πir) (A.3.15)

A.4. Gradient determination in EANFIS Layer 4 for discrete output types 245

By substituting Equations (A.3.10), (A.3.11), (A.3.12), (A.3.13), (A.3.14), (A.3.15)

into Equation (A.3.9) we obtain Equation (A.3.16).

∂E

∂γr=

1

I·

I∑i=1

ei

R∑r=1

wir ·

(∑r

πir

)− πir(∑

r

πir

)2 · (πir − 1) (10πir − 5πir) (A.3.16)

A.4 Gradient determination in EANFIS Layer 4

for discrete output types

In this case, the derivation is quite similar to the one derived for the continuous

output values in the previous section. The total error is E =∑

i e2i =

∑i(δi − yi)

2,

where δi and yi are respectively the desired output and the output from the EANFIS

architecture. Then the learning rule depends on the evaluation of the following

partial derivative (please refer to Chapter 4 for the notations).

∂E

∂γτr

=∂E

∂Ei

I∑i=1

∂Ei

∂eτi

· ∂eτi

∂πτir

· ∂πτir

∂πir

· ∂πir

∂γτr

(A.4.17)

and we have:

∂E

∂Ei

=1

I(A.4.18)

A.4. Gradient determination in EANFIS Layer 4 for discrete output types 246

∂Ei

∂eτi

= eτi (A.4.19)

∂eτi

∂πτir

= −1 (A.4.20)

∂e¬τi

∂πτir

= −1 (A.4.21)

∂πτir

∂πir=

(∑r

πir

)− πir(∑

r

πir

)2 (A.4.22)

∂πir

∂γτr

= (1 − πir) (10πir − 5πir) (A.4.23)

By substituting Equations (A.4.18), (A.4.19), (A.4.20), (A.4.21), (A.4.22), (A.4.23)

into Equation (A.4.17) we obtain Equation (A.4.24).

∂E

∂γτr

=1

I

I∑i=1

eτi ·

(∑r

πir

)− πir(∑

r

πir

)2 · (πir − 1) (10πir − 5πir) (A.4.24)

Appendix B

An example of Linear

Discriminant Analysis

transformation

In this section, we will provide a detailed worked example of Fisher’s linear discrimi-

nant analysis based on the iris data set. The input data shown in Table B.1 contains

two discrete output data.

Figure B.1 shows the original data input.

Assume that two rule sets are formed during Apriori algorithm. The first rule set

contains data labels {1, 2, 3, 4} and the second rule set contains data label {5, 6, 7, 8}.The various covariances, Sr = 1

n−1

(Xr − Xr

)T (Xr − Xr

), r = 1, 2 are formed as

follows:

247

Appendix B. An example of Linear Discriminant Analysis transformation 248

Table B.1: Input Data

Data Label d1 d2 δ

1 0.2520 0.2036 1

2 0.2982 0.1018 1

3 0.2828 0.2962 1

4 0.2571 0.2314 1

X1 0.2725 0.2083

5 0.3908 0.3703 2

6 0.3959 0.4813 2

7 0.4217 0.4443 2

8 0.4628 0.4906 2

X2 0.4178 0.4466

X 0.3452 0.3274

0.4 0.5 0.6 0.7 0.8 0.90.1

0.2

0.3

0.4

0.5

0.6

d1

d 2

Type 1Type 2

Figure B.1: Input data


• S(1): the covariance matrix of rule set 1:

S(1) = 14−1

⎛⎜⎜⎜⎜⎜⎜⎜⎝

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.2520 0.2036

0.2982 0.1018

0.2828 0.2962

0.2571 0.2314

⎤⎥⎥⎥⎥⎥⎥⎥⎦−

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.2725 0.2083

0.2725 0.2083

0.2725 0.2083

0.2725 0.2083

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎞⎟⎟⎟⎟⎟⎟⎟⎠

T

·

⎛⎜⎜⎜⎜⎜⎜⎜⎝

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.2520 0.2036

0.2982 0.1018

0.2828 0.2962

0.2571 0.2314

⎤⎥⎥⎥⎥⎥⎥⎥⎦−

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.2725 0.2083

0.2725 0.2083

0.2725 0.2083

0.2725 0.2083

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎞⎟⎟⎟⎟⎟⎟⎟⎠

S(1) = 14−1

⎛⎜⎜⎜⎜⎜⎜⎜⎝

⎡⎢⎢⎢⎢⎢⎢⎢⎣

−0.0206 −0.0046

0.0257 −0.1064

0.0103 0.0879

−0.0154 0.0231

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎞⎟⎟⎟⎟⎟⎟⎟⎠

T ⎛⎜⎜⎜⎜⎜⎜⎜⎝

⎡⎢⎢⎢⎢⎢⎢⎢⎣

−0.0206 −0.0046

0.0257 −0.1064

0.0103 0.0879

−0.0154 0.0231

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎞⎟⎟⎟⎟⎟⎟⎟⎠

S(1) =

⎡⎢⎣ 0.0004760 −0.0006981

−0.0006981 0.006540

⎤⎥⎦

(B.0.1)

• S(2): the covariance matrix of rule set 2 :


S(2) = 14−1

⎛⎜⎜⎜⎜⎜⎜⎜⎝

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.3908 0.3703

0.3959 0.4813

0.4217 0.4443

0.4628 0.4906

⎤⎥⎥⎥⎥⎥⎥⎥⎦−

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.4178 0.4466

0.4178 0.4466

0.4178 0.4466

0.4178 0.4466

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎞⎟⎟⎟⎟⎟⎟⎟⎠

T

·

⎛⎜⎜⎜⎜⎜⎜⎜⎝

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.3908 0.3703

0.3959 0.4813

0.4217 0.4443

0.4628 0.4906

⎤⎥⎥⎥⎥⎥⎥⎥⎦−

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.4178 0.4466

0.4178 0.4466

0.4178 0.4466

0.4178 0.4466

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎞⎟⎟⎟⎟⎟⎟⎟⎠

S(2) = 14−1

⎛⎜⎜⎜⎜⎜⎜⎜⎝

⎡⎢⎢⎢⎢⎢⎢⎢⎣

−0.0270 −0.0764

−0.0219 0.0347

0.0039 −0.0023

0.0450 0.0440

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎞⎟⎟⎟⎟⎟⎟⎟⎠

T ⎛⎜⎜⎜⎜⎜⎜⎜⎝

⎡⎢⎢⎢⎢⎢⎢⎢⎣

−0.0270 −0.0764

−0.0219 0.0347

0.0039 −0.0023

0.0450 0.0440

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎞⎟⎟⎟⎟⎟⎟⎟⎠

S(2) =

⎡⎢⎣ 0.001082 0.001091

0.001091 0.002992

⎤⎥⎦

(B.0.2)

• Sb: the between-class scatter:


Sb =R∑

r=1

(Xr − X

)T (Xr − X

)Sb =

([0.2725 0.2083

]−[

0.3452 0.3274

])T

·([

0.2725 0.2083

]−[

0.3452 0.3274

])

+

([0.4178 0.4466

]−[

0.3452 0.3274

])T

·([

0.4178 0.4466

]−[

0.3452 0.3274

])

Sb =

⎡⎢⎣ 0.01055 0.01731

0.01731 0.02841

⎤⎥⎦

(B.0.3)

• Sw: the within-class scatter:

Sw =R∑

r=1

Nr

NSr

Sw = 48

⎡⎢⎣ 0.0004760 −0.0006981

−0.0006981 0.006540

⎤⎥⎦+ 4

8

⎡⎢⎣ 0.001082 0.001091

0.001091 0.002992

⎤⎥⎦

Sw =

⎡⎢⎣ 0.0007789 0.0001963

0.0001963 0.0047661

⎤⎥⎦

(B.0.4)

There are two types of transformations: class independent and class dependent

ones. The normalized eigenspace and eigenvalues of FLD of class dependent, Sr−1Sb,

and of the class independent, S−1w Sb, are shown respectively in Equation (B.0.5) and

Equation (B.0.7).

• Class dependent LDA transformation


λrvr = Sr−1Sbv

r

V(1) =

⎡⎢⎣ 0.7546 −0.6562

0.3413 0.9400

⎤⎥⎦ , λ(1) =

⎡⎢⎣ 40.6390

0

⎤⎥⎦

V(2) =

⎡⎢⎣ 0.7133 −0.7009

0.6891 0.7247

⎤⎥⎦ , λ(2) =

⎡⎢⎣ 11.9840

0

⎤⎥⎦

(B.0.5)

Xr = XrVr{1...j}V

rT

{1...j}, j = 1

X(1) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.2520 0.2036

0.2982 0.1018

0.2828 0.2962

0.2571 0.2314

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎣ 0.7546

0.3413

⎤⎥⎦⎡⎢⎣ 0.7546

0.3413

⎤⎥⎦

T

X(1) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.1959 0.0886

0.1961 0.0887

0.2373 0.1073

0.2060 0.0932

⎤⎥⎥⎥⎥⎥⎥⎥⎦

X(2) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.3908 0.3703

0.3959 0.4813

0.4217 0.4443

0.4628 0.4906

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎣ 0.7133

0.6891

⎤⎥⎦⎡⎢⎣ 0.7133

0.6891

⎤⎥⎦

T

X(2) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.3808 0.3679

0.4380 0.4231

0.4329 0.4182

0.4766 0.4604

⎤⎥⎥⎥⎥⎥⎥⎥⎦

(B.0.6)

• Class independent LDA transformation


λv = S−1w Sbv

V =

⎡⎢⎣ 0.7511 −0.6601

0.4137 0.9104

⎤⎥⎦ , λ =

⎡⎢⎣ 17.8600

0

⎤⎥⎦ (B.0.7)

Xr = XrV{1...j}VT{1...j}, j = 1

X(1) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.2520 0.2036

0.2982 0.1018

0.2828 0.2962

0.2571 0.2314

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎣ 0.7511

0.4137

⎤⎥⎦⎡⎢⎣ 0.7511

0.4137

⎤⎥⎦

T

X(1) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.2055 0.1132

0.1999 0.1101

0.2516 0.1386

0.2170 0.1195

⎤⎥⎥⎥⎥⎥⎥⎥⎦

X(2) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.3908 0.3703

0.3959 0.4813

0.4217 0.4443

0.4628 0.4906

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎣ 0.7511

0.4137

⎤⎥⎦⎡⎢⎣ 0.7511

0.4137

⎤⎥⎦

T

X(2) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.3356 0.1848

0.3730 0.2054

0.3760 0.2071

0.4136 0.2278

⎤⎥⎥⎥⎥⎥⎥⎥⎦

(B.0.8)

The transformed data X using class dependent of rule set 1 and rule set 2 respec-

tively are respectively shown in Equation (B.0.6) and the transformed data X using


class independent of rule set 1 and rule set 2 are respectively shown in Equation

(B.0.8).

• Class dependent LDA transformation

X(1) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.2520 0.2036

0.2982 0.1018

0.2828 0.2962

0.2571 0.2314

⎤⎥⎥⎥⎥⎥⎥⎥⎦→ X(1) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.1959 0.0886

0.1961 0.0887

0.2373 0.1073

0.2060 0.0932

⎤⎥⎥⎥⎥⎥⎥⎥⎦

X(2) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.3908 0.3703

0.3959 0.4813

0.4217 0.4443

0.4628 0.4906

⎤⎥⎥⎥⎥⎥⎥⎥⎦→ X(2) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.3808 0.3679

0.4380 0.4231

0.4329 0.4182

0.4766 0.4604

⎤⎥⎥⎥⎥⎥⎥⎥⎦

• Class Independent LDA transformation

X(1) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.2520 0.2036

0.2982 0.1018

0.2828 0.2962

0.2571 0.2314

⎤⎥⎥⎥⎥⎥⎥⎥⎦→ X(1) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.2055 0.1132

0.1999 0.1101

0.2516 0.1386

0.2170 0.1195

⎤⎥⎥⎥⎥⎥⎥⎥⎦

X(2) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.3908 0.3703

0.3959 0.4813

0.4217 0.4443

0.4628 0.4906

⎤⎥⎥⎥⎥⎥⎥⎥⎦→ X(2) =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

0.3356 0.1848

0.3730 0.2054

0.3760 0.2071

0.4136 0.2278

⎤⎥⎥⎥⎥⎥⎥⎥⎦

Figures B.2 and B.3 respectively show the transformed data by class-dependent

LDA transformation and class-independent LDA transformation.


0.2 0.25 0.3 0.35 0.4 0.45 0.50.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

d1

d 2

Type 1Type 2

Figure B.2: Class-dependent LDA transformation

0.2 0.25 0.3 0.35 0.4 0.45 0.50.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

d1

d 2

Type 1Type 2

Figure B.3: Class-independent LDA transformation

Extended adaptive neuro-fuzzy inference systems

Documents

Transcript of Extended adaptive neuro-fuzzy inference systems