CHAPTER 5 ARTIFICIAL NEURAL NETWORKS (ANN) AND...

46

CHAPTER 5

ARTIFICIAL NEURAL NETWORKS (ANN) AND

SUPPORT VECTOR MACHINES (SVM)

5.1 AN OVERVIEW OF ANN

Artificial neural network (ANN) is a form of computing inspired by

the functioning of the brain and nervous system. ANN approach is based on

the highly interconnected structure of the brain cells (Margrave et al 1999).

They are based on present understanding of biological nervous systems,

though much of biological detail is neglected (Freeman and Skapura 1991).

Neural networks represent highly idealized mathematical models of our

understanding of such complex systems (Jordan et al 1998). It is an

information processing paradigm that is inspired by the way biological

nervous systems, such as brain, process information. The key element of this

paradigm is the novel structure of the information processing system. It is

composed of a large number of highly interconnected processing elements

(Neurons) working in union to solve specific problems. An ANN is

configured for a specific application, such as pattern recognition (or) data

classification, through a learning process (Roy et al 1995, Jordan et al 1998,

Silverman and Noetzel 1990, Wendel and Dual 1996). Learning in biological

system involves adjustments to the synoptic connections that exist between

the neurons. This is true of ANNs as well.

ANN are mathematical models whose purpose is to simulate the

human brain in a simple and objective way. And so, a model should have the

47

fundamental capacity of a brain – learning capacity, which permits carrying

out tasks that are considered typical of the human brain, such as patterns

recognition, creation of associations, systems identification and clustering etc

(Veiga et al 2005). Although they are less complex than the human brain, the

neural networks can process enormous amounts of data in a short period of

time that typically could only be analyzed by one specialist. One of the most

important characteristics of the artificial neural network is the ability to be

trained or learn by example, exactly like the human brain (Haykin 1994).

ANNs should be brought under the category of parametric models

that are generally lumped (Masnata and Sunseri 1995). Application of ANN

does not require complete details about the catchment’s characteristics,

because ANN is a black box approach. The ANN technique can be applied to

data which may be incomplete, noisy, and ambiguous. They are ideally suited

to dynamic problems and are stingy in storing information. ANNs are simple

and easy to adopt comparing to other kinds of models (Berry et al 1991, Kim

et al 2001and Spanner et al 2000).

Artificial neural networks have been developed as generalizations

of mathematical models of human cognition or neural biology, based on the

assumptions that:

Information processing occurs at many simple elements calledneurons.

Signals are passed between neurons over connection links.

Each connection link has an associated weight, which, in atypical neural net, multiplies the signal transmitted.

Each neuron applies an activation function (usually non linear)to its net input (sum of weighted input signals) to determineits output signal.

48

A neural network is characterized by

Its pattern of connections between the neurons (called its

architecture)

Its method of determining the weights on the connections

(called its training or learning algorithms)

Its activation function.

Commonly neural networks are adjusted or trained, so that a

particular input leads to a specific target output. Such a situation is shown in

Figure 5.1. There, the network is adjusted, based on a comparison of the

output and the target, until the network output matches the target (Wasserman

1989). Typically many such input/target pairs are used, in this supervised

learning, to train a network.

Figure 5.1 Basic function of ANN

A typical ANN consists of large number of neurons, units, cells (or)

nodes that are organized according to a particular arrangement. Each neuron

is connected to other neuron by means of directed communication links, each

with an associated weight. The weights represent information being used by

the net to solve the problem.

49

Each neuron has an internal state, called its activation (or) activity

level, which is a function of the inputs it has received. Typically a neuron

sends its activation as a signal to several other neurons. It is important to note

that a neuron can send only one signal at a time, although that signal is

broadcast to several other neurons.

5.2 OPTIMAL NETWORK ARCHITECTURE

Neural networks operate on the principles of learning from a

training set. When applying ANN to any classification problems a thorough

knowledge is essential regarding choosing of an appropriate network type,

appropriate training algorithm, selection of suitable values for the parameters

like initial weight, learning rate, momentum rate, appropriate network

structure, training periods and the method of pre and post processing of input

and output data (Baum and Haussler 1989). It must be noted that the whole

exercise is based only on trial and error approach. There exists a variety of

neural network models and learning procedures. The Figure 5.2 shows the

architecture of neural system.

Figure 5.2 Architecture of neural system

Input patternsstored in a file

Neural Networkparameters stored

in a file

Normalizedtarget patternsstored in a file

Training the NeuralNetwork

EndUser

Production Mode

Final weightvalues, aftertraining isstored in a

file

Final Result

50

The determination of optimal network architecture as a part of thelearning strategy was proposed by Houghton and Shen (1990). Determinationof the appropriate neural network architecture is one of the most difficulttasks in the model building process (Tho et al 2004). The various types ofnetwork architecture available are feed forward network, Jordan-elman nets,ward nets, jump connection nets, unsupervised kohonen, probabilistic, generalregression net, GMDH (Polynomial set), Recurrent Neural Network (RNN)and Radial Basis Function (RBF).

The process of selecting a suitable architecture for a requiredproblem can be broadly classified into three steps (Carling and Alison 1995).

1. Fixing the architecture

2. Training the network

3. Testing the network

The following procedure is used for determination of optimalnetwork architecture by a trial and error. First the input / output parameters,training size, and learning algorithm are decided and a network is chosen witha trial number of nodes in the hidden layer. Hidden nodes perform a two foldfunction; first compute a signal from all incoming information, and secondthey transform this signal using a non-linear activation function (Roy et al1995). The network is trained for a fixed number of epochs. The networkgradient is observed over these epochs. Then, the network architecture thatresulted in the maximum is changed by increasing (or) decreasing the numberof hidden nodes. The training procedure is repeated for the new architecture.This procedure is continued for several different architectures. Eventually, thenetwork architecture that resulted in the maximum error gradient over thetraining epochs is adopted as the optimal architecture (Gallent 1993).

Finding suitable network architecture can be a very time consuming

exercise. If the architecture is too small, the network may not have sufficient

51

degrees of freedom to learn the process correctly. On the other hand, if the

network is too large it may not coverage during training or may over fit the

data (Thavasimuthu et al 1996). Another way of determining ‘optimum’

number of neurons in the hidden layer(s) is to add links and hidden nodes to a

simple network until convergence occurs.

A different approach was attempted by Ogilvy (1993) by

progressively adding or removing nodes until an optimum structure is

attained. The number of hidden layer neurons required is much more difficult

to determine since no general methodology is available for its determination.

In general, the most popular way of determining the appropriate number of

neurons in the hidden layer is by trial and error approach (Brown and

DeNale1991, Katragadda et al 1997 and McBride et al 2004).

Regardless of the approach used to optimize the number of neurons

in the hidden layer care needs to be taken since too many neurons will

increase training times unnecessarily by making it more difficult to estimate

suitable set of interconnection weights while too few neurons can cause

difficulties in mapping input to output in the training set. Multi Layer

Perceptron is one of the most fundamental and proper type of ANN

architecture for practical applications of model identification. It is reported

from literature that other kinds of ANN architectures like Radial Basis

Function, Recurrent Neural Networks do not provide any major advantage

over MLP architecture. Both the accuracy of classification and a networkslearning ability can be severely affected if the architecture is not suitable.

Two well known classes of neural networks that can be used for

classification applications are feed forward networks and probabilistic

networks. Of the two, feed forward have found to have maximum application

and has thus been adopted in this study.

52

5.3 FEED FORWARD NETWORKS

In a feed forward network the weighted connections feed

activations only in the forward direction from the input layer to the output

layer. The input neurons receive and process the input signals and send the

output to other neurons in the network where this process is continued

(Margrave et al 1999). This type of network where information passes one

way through the network is known as a feed forward network. A three layered

feed forward ANN also known as Multi Layer Perceptron (MLP) along with a

typical processing element, an activation function, and a threshold function

embedded to its body is shown in Figure 5.3.

Figure 5.3 Three layer feed forward ANN along with processing element

53

The data passing through the connections from one neuron to

another are manipulated by weights that control the strength of a passing

signal. When these weights are modified, the data transferred through the

network change and the network output alters. In the feed forward network,

the nodes are generally arranged in layers, starting from the first input layer

and ending at the final output layer. There can be several hidden layers, with

each layer having one or more nodes. Each neuron consists of one or more

number of inputs and number of outputs. The output is computed according to

the weighted sum of all its inputs and a selected activation function. The

various types of activation functions available are linear, logistic, hyperbolic

tangent, Gaussian. In most of the studies the logistic sigmoidal function or

hyperbolic tangent functions are adopted. The basic characteristics of the

sigmoid functions are that it is continuous, differentiable everywhere and is

monotonically increasing. The number of neurons in the input, hidden and

output layer is specified by the user dealing with the problem to which the

network is applied (Moura et al 2001).

The number of input variables determine the number of input

neurons while the number of output variables determines the number of

output neurons. Excessive number of neurons may become a hindrance to the

training process by way of delaying it. Information passes from the input to

the output side. The nodes in one layer are connected to those in the next, but

not to those in the same layer. Thus, the output of a node in a layer is only

dependent on the inputs it receives from the previous layer and the

corresponding weights. The multilayer feed forward networks have been

found to have the best performance with regard to input output function

approximation. Song et al (2002) have mentioned that three – layer feed

forward ANNs can be used to model real world functional relationships that

may be of unknown (or) poorly defined form and complexity. The feed

forward network is capable of nonlinear pattern recognition and memory

54

association (Bishop 1995). One of the most important types of feed forward

network is the Back Propagation Network.

5.3.1 Back Propagation Networks (BPN)

Back propagation is a systematic method for training multi-layer

artificial neural networks. It has a mathematical foundation that is strong if it

is not highly practical. It is a multi-layer feed forward network using extend

gradient- descent based delta-learning rule, commonly known as back

propagation (of errors) rule. Back propagation provides a commonly efficient

method for changing the weights in a feed forward network, with

differentiable activation function units, to learn a training set of input-output

examples. Being a gradient descent method it minimizes the total squared

error of the output computed by the net. The network is trained by supervised

learning method. The aim of this network is to train the net to achieve the

balance between the ability to respond correctly to the input patterns that are

used for training and the ability to provide good responses to the input that are

similar.

Back propagation networks (BPN) are multi-layer networks with

the hidden layers of sigmoid transfer function. The transfer function in the

hidden layers should be differentiable and thus, either log-sigmoid or tan-

sigmoid functions are typically used. In this study, the tan-sigmoid transfer

function, ‘tansig’ is used for both the hidden layers and the output layer. They

calculate a layer’s output from its net input. Each hidden layer and output

layer is made of artificial neurons, which are connected through adaptive

weights. The training function selected for the network is ‘trainlm’.

This type of neural network is trained using a process of supervised

learning in which the network is presented with a series of matched input and

output patterns and the connection strengths or weights of the connections

55

automatically adjusted to decrease the difference between the actual and

desired outputs. The schematic diagram of feed forward back propagation

network structure is shown in Figure 5.4. After the training phase, the testing

data set is presented to the trained model, to see how well the network has

learnt and how well the network has performed.

Figure 5.4 Schematic diagram of BPN structure

There are generally four steps to perform the classification of data

1. Assemble the training data

2. Create the network object

3. Train the network

4. Simulate the network response to new inputs

Before any data has been run through the network, the weights for

the nodes are random, which has the effect of making the network much like a

newborn's brain – developed but without knowledge. When presented with an

input pattern, each input node takes the value of the corresponding attribute in

56

the input pattern. These values are then ``fired'', at which time each node in

the hidden layer multiplies each attribute value by a weight and adds them

together. If this is above the node's threshold value, it fires a value of '1';

otherwise it fires a value of '0'. The same process is repeated in the output

layer with the values from the hidden layer, and if the threshold value is

exceeded, the input pattern is given the classification. When training the

network, once a classification has been given, it is compared to the actual

classification.

This is then ``back propagated'' through the network, which causes

the hidden and output layer nodes to adjust their weights in response to any

error in classification, if it occurs. The modification of the weights is done

according to the gradient of the error curve, which points in the direction to

the local minimum near the instance. Unfortunately, the local minimum is not

always the global minimum, which causes the network to settle in a non-

optimal configuration. The network can sometimes be deterred from settling

in local minima by increasing or decreasing the number of hidden layer nodes

or even by rerunning the algorithm (this is because the weights will be

reinitialized to a different set of random numbers, which may keep them from

falling into a local minimum that is not the global minimum).

Standard back propagation is a gradient descent algorithm, as is the

Widrow-Hoff learning rule, in which the network weights are moved along

the negative of the gradient of the performance function. The term back

propagation refers to the manner in which the gradient is computed for

nonlinear multilayer networks. There are a number of variations on the basic

algorithm that are based on other standard optimization techniques, such as

conjugate gradient and Newton methods.

57

5.4 TRAINING OF ANN

Connection weights of the network are learned through a process

called ‘training’ in which large number input-output pattern pairs are

presented to the network in a repetitive fashion designed to provide iterative

corrections to the weights. Each iteration (‘epoch’) is a single pass through all

training pattern pairs. In most of the studies the MLP is trained using the

error back propagation algorithm (Wendel and Dual 1996). The final weight

vector of a successful trained neural network represents its knowledge about

the problem. In general, it is assumed that the network does not have any prior

knowledge about the problem before it is trained. So, at the beginning of

training the network weights are initialized with a set of random values

(Selvakumar et al 2004). Learning in neural networks involves adjusting the

weights of interconnections.

The most commonly used training algorithm for feed forward

networks is the back propagation algorithm (Veiga et al 2005). The back

propagation algorithm is a gradient descent method in which weights of the

connections are updated using partial derivations of error with respect to

weights. However, the standard back propagation algorithm can train only on

a network of predetermined size. In the BP algorithm, the weight associated

with the neuron is adjusted by an amount proportional to the strength of the

signal in the connection and the total measure of the error. The total error at

the output layer is then reduced by redistributing this error backward through

the hidden layers until the input layer is reached. The process continues for

the number of prescribed sweeps or until the prescribed error tolerance is

reached (Moura et al 2001).

58

Some major limitations of BP algorithm are

They are easily trapped by local optima

The convergence is a slow process

The architecture is often ineffective when searching weight

spaces of high dimensions.

Performance of a BP-ANN simulator is quite sensitive to the

initial starting point.

During training, the network is trained to associate outputs with

input patterns. When the network is trained, it identifies the input pattern and

tries to output the associated output pattern. The power of neural networks

comes to life when a pattern that has no output associated with it, is given as

an input. In this case, the network gives the output that corresponds to a

taught input pattern that is least different from the given pattern. Training

process of a typical ANN is shown in Figure 5.5.

Figure 5.5 Training process of a typical ANN

The input layer is thus transparent and is a means of providing

information to the networks. The last (or) output layer consists of values

predicted by then network and thus represents the model output. The process

59

consists of presenting an input pattern to the network, making predictions as

to the output and then comparing this predicted output to the input pattern’s

actual output. Excessive number of input variables to ANN increase training

time and decrease performance (Ogaji and Singh 2003). The goal of the

training process is to present a sufficient number P of unique input-output

pattern pairs, which when coupled with a suitable methodology for iterative

correction of the interconnection weights, produces a final set of weights that

minimizes the global error.

The number of hidden layers and the number of nodes in each

hidden layer are usually determined by a trial–and–error procedure. The

nodes within the neighbouring layers of the network are fully connected by

links. A synoptic weight is assigned to each link to represent the relative

connection strength of two nodes at both ends in predicting the input-output

relationship. Figure 5.6 represents the structure of a neuron.

Figure 5.6 Neuron

5.5 DATA PREPARATION FOR ANN

One of the major strengths of neural networks, is their ability to

deal with incomplete noisy and non stationary data (Oleg Karpash et al 2006).

However, with appropriate data preparation in advance, it is quite possible to

60

improve the performance of neural network still further (Carling and Alison

1995). The various steps involved in data preparation are selection of suitable

inputs and outputs, noise reducing, pre-processing such as standardizing /

normalizing the data and finally grouping data in to calibration and validation

sets.

5.5.1 Selection of Suitable Inputs / Outputs

Selecting too many input variables and therefore too many free

parameters can lead to poor generalization performance. As a result, it is

crucial to reduce the dimension of the input vector either by constructing

more powerful variables in the preprocessing phase or by eliminating

variables with low information content.

5.5.2 Data Preprocessing

The input and output variables should be standardized to make sure

that they receive equal attention during the training process (Bond et al 1992).

Friedman and Kandal (1999) have emphasized the importance of correct

standardization factors. They have mentioned that the choice of

standardization ranges significantly influences the performance of the ANN

and also cautioned that ANN should not be used for extrapolation. Without

standardization in MLPs, input variables measured on different scale will

dominate training to a greater (or) lesser extent. Data standardization plays a

vital role in improving the efficiency of the training algorithm.

5.5.3 Model Training and Testing

Available records have been divided into two independent sets; the

training set and the testing set. The training set is used to minimize the error

and the testing set is used to avoid over fitting, when implementing the neural

61

model (Opitz 1999, Ravanbod 2005). The method of splitting the data

(Systematic (or) Random) can significantly affect the data.

5.6 STRUCTURE OF BPN NETWORK

The command structure for creating, training and testing the Back

propagation network is given and clearly explained below.

Creating and Initializing BPN Network

To create a general feed forward neural network, the command is

net = newff (input range, [number1,number2], {transfer1,transfer2}, training

algorithm);

Here, Input range is the maximum and minimum value of input data.

[number1, number2] is a list of the number of units in each layer.

transfer1, transfer2 is a list of transfer functions for each layer.

These are strings. The tan Sigmoidal transfer function which is called ‘tansig’

and Linear transfer function which is called ‘purelin’in Matlab are typically

used.

The tan sigmoid transfer function shown in Figure 5.7 takes the

input, which may have any value between plus and minus infinity, and

squashes the output into the range -1 to 1. The purelin transfer function takes

the input, which may have any value between plus and minus infinity and the

network output take any value. It is shown in Figure 5.8.

62

Figure 5.7 Tan sigmoidal transfer Figure 5.8 Linear transfer

function function

Training algorithm is described by a string. Matlab includes many

algorithms: gradient descent, gradient descent with adaptive learning rate,

conjugate gradient descent, Scaled conjugate gradient back propagation and

others. The Table 5.1 shows commonly used fastest algorithms for

classification.

Table 5.1 List of fastest algorithms

S. No. Algorithm

1. trainscg - Scaled Conjugate Gradient

2. trainlm - Levenberg-Marquardt

3. traingda, traingdx - Variable Learning Rate back propagation

4. traincgf - Fletcher Reeves Conjugate Gradient

Default Training Parameters

The training parameters for the network are initialized by the

following default training parameters.

net.trainparam.show

net.trainparam.epochs

net.trainparam.goal

63

The training parameters are depending on the selection algorithm.

The number of hidden layers, number of neurons in the hidden layer is

determined by the experimentation (i.e.) trial and error approach.

Training BPN network

The training of neural network is done by using the command,

net = train(net,p,t);

Here, net is network object. This will train the network using the

training data p and target data t by using learning algorithm specified when

net was created using new.

Testing BPN Network

The results of the input data acting on the network is tested by

using the sim function in Matlab.

a=sim (net, q);

where q=new input or testing input

5.7 PROBABILISTIC NEURAL NETWORK

Probabilistic neural network (PNN) is predominantly a classifier

which maps any input pattern into a number of classifications. It is an

implementation of a statistical algorithm called kernel discriminant analysis in

which the operations are organized into a network with four layers. They are

input layer, hidden layer, Pattern layer/Summation layer and output layer.

Figure 5.9 illustrates the schematic diagram of probabilistic neural network.

Their design is straightforward and does not depend on training.

64

Figure 5.9 Probabilistic neural network structure

Input layer: There is one neuron in the input layer for each predictor

variable. In the case of categorical variables, N-1 neurons are used where N is

the number of categories. The input neurons (or processing before the input

layer) standardize the range of the values by subtracting the median and

dividing by the interquartile range. The input neurons then feed the values to

each of the neurons in the hidden layer.

Hidden layer: This layer has one neuron for each case in the training data set.

The neuron stores the values of the predictor variables for the case along with

the target value. When presented with the x vector of input values from the

input layer, a hidden neuron computes the Euclidean distance of the test case

from the neuron’s center point and then applies the RBF kernel function using

the sigma value(s). The resulting value is passed to the neurons in the pattern

layer.

Pattern layer / Summation layer: There is one pattern neuron for each

category of the target variable. The actual target category of each training

case is stored with each hidden neuron; the weighted value coming out of a

hidden neuron is fed only to the pattern neuron that corresponds to the hidden

65

neuron’s category. The pattern neurons add the values for the class they

represent (hence, it is a weighted vote for that category).

Decision layer: The decision layer compares the weighted votes for each

target category accumulated in the pattern layer and uses the largest vote to

predict the target category.

The Probabilistic Neural Network (PNN) (Specht 1988) is a

representative alternative as it has all the advantages of neural networks while

excluding the typical disadvantages of back-propagation neural networks

(Schemerr et al 2000 and Song et al 2002). Given the fact that the architecture

of a PNN can be directly determined by the provided flaw classification

problem, the training of a PNN classifier can be completed instantaneously

(Zaknich 1998). These researches also suggested applying a Bayes decision

strategy to make the classification performance of PNNs consistent.

B. Sadoun also supported PNN as a flaw classifier in his study and compared

PNN with various other ANN paradigms including backprogapation

networks, radial basis function network, general regression neural network

and LVQ network (Sadoun 2001).

Even though PNN has advantages as an alternative paradigm of

ANNs to the multi layered neural networks, PNN has drawbacks which are

commonly admitted (Ko and Byun 2002, Zaknich 1998). The first

disadvantage of PNN is that it requires higher memory demands during the

execution, thus the execution of the trained network for applying new test

data becomes slow. The other drawback is that its efficiency depends strongly

on its initial training data. It means that PNN need to be trained with the

correct proper data in order to achieve a good efficiency. Both of those two

main disadvantages of PNN cannot be simply ignored when we consider PNN

as the classifier for the ultrasonic flaw signals.

66

PNN are derived from Bayes Decision Networks. They train

quickly since the training is done in one pass of each training vector, rather

than several. Probabilistic neural networks estimate the probability density

function for each class based on the training samples using Parzen or a similar

probability density function. This is calculated for each test vector. Usually a

spherical Gaussian basis function is used, although many other functions

work equally well.

Vectors must be normalized prior to input into the network. There

is an input unit for each dimension in the vector. The input layer is fully

connected to the hidden layer. The hidden layer has a node for each

classification. Each hidden node calculates the dot product of the input vector

with a test vector subtracts 1 from it and divides the result by the standard

deviation squared. The output layer has a node for each pattern classification.

The sum for each hidden node is sent to the output layer and the highest

values wins.

The Probabilistic neural network trains immediately but execution

time is slow and it requires a large amount of space in memory. It really

works only for classifying data. The training set must be a thorough

representation of the data. Probabilistic neural networks handle data that has

spikes and points outside the norm better than other neural nets.

PNN have advantages and disadvantages compared to Multilayer

Perceptron networks (BPN):

It is usually much faster to train a PNN network than a

multilayer perceptron network.

PNN networks often are more accurate than multilayer

perceptron networks.

67

PNN networks generate accurate predicted target probability

scores.

PNN networks approach Bayes optimal classification. PNN

networks are slower than multilayer perceptron networks at

classifying new cases.

PNN networks require more memory space to store the model.

5.8 ADVANTAGES OF NEURAL NETWORK

Neural Networks, with their remarkable ability to derive meaning

from complicated (or) imprecise data, can be used to extract patterns and

detect trends that are too complex to be noticed by either human or other

computing techniques. A trained neural network can be thought of as an

‘expert’ in the category of information it has been given to analyse. This

expert can then be used to provide projections given new situations of interest

and answer ‘what if’ questions.

ANN models have many advantages. Some of them are as follows

The application of a neural network does not require a prior

knowledge of the underlying process. (Black box approach)

All the existing complex relationship between various aspects

of the process under investigation need not be known.

This approach is faster when compared with its conventional

compatriots, flexible in the range of problems it can solve and

highly adaptive to newer environments

ANNs are data driven when compared to conventional

approaches which are model driven (Lee and Castro 2005)

68

The data used do not have to follow a Gaussian distribution

The data used may possess irregular seasonal variation

ANNs are non-linear models and perform well even when

limited data are available

They are very robust and are able to deal with outliers and

noisy or incomplete data (Mekdeci and McLaughlin 1995).

Other advantages includes

Adaptive learning: An ability to learn how to do task based on

the data given for training (or) initial experience

Self-organisation: An ANN can create its own organization or

representation of the information it receives during the

learning time.

Real time operation: ANN computations may be carried out in

parallel, and special hardware devices are being designed and

manufactured which take advantage of this capability.

Fault tolerance via Redundant information coding: Partial

destruction of a network leads to the corresponding

degradation of performance. However, some network

capabilities may be retained even with major network damage.

Neural network allows more complex modeling than the

regression procedure (Windsor et al 1993).

The combination of simplicity, interpolation, reasonably

accurate prediction statistics, ability to provide conditional

simulations and computational speed suggests that an artificial

69

neural networks can be a useful tool in water resources

systems analysis (Pittner and Kamarthi 1999). Due to these

established advantage, currently the ANN has numerous real

world applications such as image processing, speech

processing and robotics and stock market predictions. There

has been extensive research on its implementation in the

system engineering related fields, such as time series

prediction, rule based control and rainfall runoff modeling.

5.9 SUPPORT VECTOR MACHINES

5.9.1 Introduction to SVM

Support vector machines (SVMs) are a set of related supervised

learning methods used for classification and regression. In simple words,

given a set of training examples, each marked as belonging to one of two

categories, an SVM training algorithm builds a model that predicts whether a

new example falls into one category or the other (Olivier bousquet 2001).

Intuitively, an SVM model is a representation of the examples as points in

space, mapped so that the examples of the separate categories are divided by a

clear gap that is as wide as possible. New examples are then mapped into that

same space and predicted to belong to a category based on which side of the

gap they fall on (Lee and Estivill-Castro 2004).

More formally, a support vector machine constructs a hyperplane or

set of hyperplanes in a high or infinite dimensional space, which can be used

for classification, regression or other tasks. Intuitively, a good separation is

achieved by the hyperplane that has the largest distance to the nearest training

datapoints of any class (so-called functional margin), since in general the

larger the margin the lower the generalization error of the classifier (Carl

Gold and Peter Sollich 2003, Osuna et al 1997).

70

5.9.2 SVM for Classification

Classifying data is a common task in machine learning. The given

data points belongs to one of two groups. The goal of classification is to

predict the new data point belongs to the correct group among the two groups.

There are two types in SVM classification they are Binary class and

multiclass which are discussed as follows.

5.9.2.1 Binary class SVM

Binary class SVM is one type of Support Vector Machine. It is

used for classification of two different classes. In Binary class SVM,

classification is done by constructing an N-dimensional hyper plane that

optimally separates the data into two categories. The Figure 5.10 shows the

classification of two different classes by binary class SVM.

Figure 5.10 Classification of two different classes by binary class SVM

The Figure 5.10 shows classification of two different groups of data

points. The dot points belong to one type of group and the holes belongs to

another type of group. The plane which separates the two different classes is

known as the hyper plane. The points which are used to create the hyper plane

Support vector

Hyperplane

Margin

71

is called support vector. The gap between the support vectors is known as the

margin.

5.9.2.2 Multi class SVM

Multi class SVM is used to classify more than two different types

of classes. The data points of different groups are classified by creating

various hyperplanes in between the groups.

The single multiclass problem is reduced into multiple binary

problems. Each of the problems yields a binary classifier, which is assumed to

produce an output function that gives relatively large values for examples

from the positive class and relatively small values for examples belonging to

the negative class Sathiya and Keerthi (2002). Classification of new instances

for one-versus-all case is done by a winner-takes-all strategy, in which the

classifier with the highest output function assigns the class.

Figure 5.11 Classification of four different classes by Multi class SVM

72

For the one-versus-one approach, classification is done by a max-

wins voting strategy, in which every classifier assigns the instance to one of

the two classes, then the vote for the assigned class is increased by one vote,

and finally the class with most votes determines the instance classification

(Tong and Chang 2001). The Figure 5.11 shows the classification of four

different types of classes by using Multi class SVM.

5.10 SUMMARY

In this chapter an overview of Artificial Neural Networks with

respect to optimal network architecture, Training of ANN and data

preparation for ANN is clearly discussed with the extensive literature. Feed

forward Back Propagation Network, structure of BPN network and

Probabilistic Neural Network are also clearly explained. In addition, a brief

explanation of Support Vector Machines is provided at the end.

CHAPTER 5 ARTIFICIAL NEURAL NETWORKS (ANN) AND...

Documents

Transcript of CHAPTER 5 ARTIFICIAL NEURAL NETWORKS (ANN) AND...