Chapter6.doc

43
Chapter 6 CLASSIFIERS Chapter 6 CLASSIFIERS Classification is defined as “the act of forming into a class or classes; a distribution into groups, as classes, orders, families, etc., according to some common relations or affinities”. Therefore, a classifier will be a subject that creates classifications”. In recent times, the automatic categorisation of patterns has become of great interest in many research areas. Machine learning methods for classification learn from data that incorporates classified instances, called training set (e.g., a collection of attribute values that are classified to a certain class), and attempt to develop models that would, given the set of attribute-values, predict a class for such instance. In the problem of supervised learning we are given a sample of input-output pairs (also called the training sample), and the task is to find a deterministic function that maps any input to an output such that disagreement with future input-output 77

Transcript of Chapter6.doc

Page 1: Chapter6.doc

Chapter 6 CLASSIFIERS

Chapter 6

CLASSIFIERS

Classification is defined as “the act of forming into a class or classes; a distribution

into groups, as classes, orders, families, etc., according to some common relations or

affinities”. Therefore, a classifier will be “a subject that creates classifications”.

In recent times, the automatic categorisation of patterns has become of great interest

in many research areas. Machine learning methods for classification learn from data that

incorporates classified instances, called training set (e.g., a collection of attribute values

that are classified to a certain class), and attempt to develop models that would, given the

set of attribute-values, predict a class for such instance. In the problem of supervised

learning we are given a sample of input-output pairs (also called the training sample), and

the task is to find a deterministic function that maps any input to an output such that

disagreement with future input-output observations is minimised. There exist a huge

number of classification techniques in the literature, for instance neural networks,

classification trees, variants of naive Bayes, k-nearest neighbours, classification through

association rules, function decomposition, logistic regression, and support vectors

machines. The performance of different classification methods is to some extent

dependent on the target task. For this reason, one classifier cannot be said to be better

than another one and therefore many alternatives are usually attempted when facing one

unique categorisation problem.

77

Page 2: Chapter6.doc

Chapter 6 CLASSIFIERS

6.1 Classifiers used in emotional recognition

Several pattern recognition methods have been explored for automatic emotion

recognition (s. [Pet99, Bat00]). Dellaert [Del96], for instance, tried maximum likelihood

Bayes classification, Kernel regression, and k-nearest neighbour methods, whereas

Roy and Pentland [Roy96] used Fisher linear discrimination method. Many more

studies have been conducted by using different classifiers to deal with the emotional

discrimination through the speech signal; this section provides an overview on methods

employed by a number of published studies.

[Lee01] reports on methods for automatic classification of spoken utterances based

on the emotional state of the speaker. Linear discriminant classification with Gaussian

class-conditional probability distribution and k-nearest neighbour methods are used

to classify utterances into two basic emotion states, negative and non-negative. In

addition, to improve classification performance, two feature selection methods are

employed: promising first selection and forward feature selection. Principal component

analysis is used to reduce the dimensionality of the features while maximizing

classification accuracy.

A study carried out by Amir [Ami01] also makes use of K-nearest neighbours

approach. The method estimates the local posterior probability of each class by the

average of class membership over the K nearest neighbours. They ran the algorithm for K

from 1 to 15 but the results were mainly poor when compared with the neural networks

classifiers performance. [Che98] employs supervised classification of six basic emotions1

with leave-one out (LOO) cross validation (CV). They applied two methods to perform

the classification:

- the nearest mean criterion,

- and model each class with a Gaussian distribution and normalise by the mean

and variance of the class, then find the most probable class to which the test sample

belongs.

Despite distance-based measurements have longer tradition, new automatic

classification tools, principally Neural Networks, have recently increased their acceptance

for this task. Noam Amir compares in [Ami01] the performance of two algorithms: a

1 Happiness, sadness, fear, anger, surprise and disgust.

78

Page 3: Chapter6.doc

Chapter 6 CLASSIFIERS

classification algorithm based on Euclidean distances, and a classification algorithm

based in neural networks. Both perform the classification of four emotions2 using

identical feature set, on a database of emotional speech, which was validated through

subjective listening tests. The distance measure method was previously discussed and

outlined in detail in a study performed by the same author [Ami00] where it proved

success when the characterization of each emotion was unique to each subject being

studied. This method obtains representative values for each emotion by averaging the

feature vectors over the whole set of utterance and then applies the Mahalanobis distance

measure to compute the distance of each vector to the centroid. A small distance from a

certain centroid indicates that the measurement is most likely to belong to that specific

emotion. The drawback of methods based on distance is that they only model a standard

way to express the emotion, for instance if we attempt to classify an utterance whose level

of anger is extremely intense compared with what the classifier is used to recognise, the

distance to the centroid will be larger, even when this utterance could be considered as

“angrier” than many others. For the neural network classification [Ami01] uses four

Feed-Forward Neural Networks, one for each emotion (OCONN). Each network had

twelve input neurons and one output neuron in the range [0,1]. The internal architecture

varies specifically for each network, i.e. emotion. The transfer function is Log-Sigmoid

and the training method applied is the Levenberg-Marquadrat backpropagation.

Neural networks were also used in [Pet99] in three different ways:

a) Two-layer backpropagation neural network architecture with a 8- 10- or 14-

input vector, 10 or 20 nodes in the hidden sigmoid layer and five nodes in the output to

classify into five different emotions3.

b) Ensembles of neural network classifiers, i.e. an odd number of neural network

classifiers, which have been trained on different subsets of the training set using the

bootstrap aggregation [Bri96] or the cross-validated committees [Prm96]. The ensemble

makes decision based on the majority voting principle. They employed ensemble sizes

from 7 to 5.

c) Set of experts. Instead of training a neural network to recognize all emotions, a

set of specialists is built. Each of these “experts” can recognize only one emotion and 2 Anger, sadness, happiness and neutral.3 Normal state, happiness, anger, sadness and fear.

79

Page 4: Chapter6.doc

Chapter 6 CLASSIFIERS

then combine their results to classify a given sample. For the expert networks they used a

two layer backpropagation neural network architecture with an 8-element input vector, 10

or 20 nodes in the hidden sigmoid layer and one node in the output linear layer.

In [Hub98], multi layer perceptrons (MLP) were trained for the discrimination

between angry and neutral patterns. PHYSTA project4 uses hybrid technology, i.e. a

combination of classical artificial intelligence (AI) computing and neural networks.

The classical component allows for the use of known procedures and logical operations,

which are suited to language processing. The neural net component allows for learning at

various levels, for instance the weights that should be attached to various inputs,

adjacencies, and probabilities of particular events given certain information.

6.2 Classifiers tried in the present work

Neural Network classifier has been mainly employed during this work and,

consequently, its operation is specifically detailed in section 6.3. However, other

classification methods, gaussian mixture models (GMMs), linear regression and decision

trees were also tried.

6.2.1 Gaussian mixture models

If there is a reason to believe that a data set is comprised of several distinct

populations, a mixture model can be used. Mixture Models are a type of density model

which comprise a number of component functions, usually Gaussian. These component

functions are combined to provide a multimodal density.

A Gaussian mixture model represents each class of data as a linear combination of

several Gaussian densities in the feature space.

Generally, the main motivations for using Gaussian mixtures are:

• a linear combination of Gaussian basis functions is capable of forming smooth

approximations of arbitrarily shaped densities.

• in speaker recognition, for instance, the individual component densities could be

able to model some underlying acoustic classes, such as vowels, nasals or fricatives.4 Principled Hybrid Systems: Theory and Applications (PHYSTA) is a collaboration of Kings College London, University of Milan, Queen's University of Belfast and the National Technical University of Athens.

80

Page 5: Chapter6.doc

Chapter 6 CLASSIFIERS

This method was employed in our experiments to discriminate between two classes

attending to the voice quality features and using 1 and 32 Gaussian functions. However,

none of the experiments carried out through this method yield better results than by using

the neural network classifier and, consequently, the classification method was discarded.

6.2.2 Linear discriminant analysis

Linear regression is the simplest form of regression and is usually used to predict a

continuous class. Linear regression assumes that the class variable can be expressed as a

linear function of one attribute:

(6.1)

The linear discriminant analysis method consists of searching some linear

combinations of selected variables, which provide the best separation between the

considered classes. These different combinations are called discriminant functions

[Mja01].

6.2.3 Decision trees

A decision tree is a graphical representation of a procedure for classifying or

evaluating an item of interest. It represents a function that maps each element of its

domain to an element of its range, which is typically a class label or numerical value. A

decision tree takes as input an object or situation described by a set of properties, and

outputs a yes/no decision. Therefore, they represent Boolean functions. Functions with a

larger range of outputs can also be represented.

At each leaf of a decision tree, one finds an element of the range. At each internal

node of the tree, one finds a test that has a small number of possible outcomes. By

branching according to the outcome of each test, one arrives at a leaf that contains the

class label or numerical value that corresponds to the item in hand. Leaves are usually not

of one class, so one typically chooses the most frequently occurring class label.

Decision trees method was tried during the introductory experiments using C5.0, a

state-of-the-art system that constructs classifiers in the form of decision trees and rulesets.

Since many disadvantages were found, due to the simplicity of the classifier for such a

81

Page 6: Chapter6.doc

Chapter 6 CLASSIFIERS

complex problem as emotion discrimination, they were early discarded. The fundamental

problems with decision trees are at least four folds:

They look at very simple combination of attributes within a table, and hence miss

many patterns.

By their nature, they need to break numeric fields into fixed ranges, hence missing

even more patterns, and providing less information. They are quite brittle on inexact data,

and a small change in a value can have a large impact on the outcome.

Decision trees can at best work on small samples of data and can not easily

approach large data sets resulting in significant loss of information.

Since they ignore some attributes, they may make less accurate predictions, and if

some values are missing from the new data item, they make no predictions at all.

Furthermore, given the same data set, one can obtain several decision trees, each making

a different prediction on new data items.

6.2.4 Neural Networks

Neural networks, broadly described in section 6.3, conforms the selected method to

build an emotional classifier in the framework of this thesis. Reasons that make neural

networks more convenient for our purposes are described in this section. Further detailed

information is found in 6.3.

Neural Networks, with their remarkable ability to derive meaning from complicated

or imprecise data, can be used to extract patterns and detect trends that are too complex to

be noticed by either humans or other computer techniques. Emotions are a complex field

of investigation, which includes many discrepancies even in its theoretic domain. A

trained neural network can be thought of as an "expert" in the category of information it

has been given to analyse. This expert can then be used to provide projections given new

situations of interest and answer "what if" questions. Other advantages include:

1. Adaptive learning: An ability to learn how to do tasks based on the data given for

training or initial experience.

2. Self-Organisation: An ANN can create its own organisation or representation of the

information it receives during learning time.

82

Page 7: Chapter6.doc

Chapter 6 CLASSIFIERS

3. Real Time Operation: ANN computations may be carried out in parallel, and

special hardware devices are being designed and manufactured which take advantage of

this capability.

4. Fault Tolerance via Redundant Information Coding: Partial destruction of a

network leads to the corresponding degradation of performance. However, some network

capabilities may be retained even with major network damage.

The multiple advantages of neural networks, in addition to the general acceptance

and widespread use of this method in several former approaches concerning emotion

recognition through the speech signal, leads us to employ this method as our main

classification tool.

6.3 Neural Networks.

As established in section 6.1, Neural Networks are a frequently employed tool in the

aim of emotional recognition. Such a complex classifier involves a huge number of

possible configurations and therefore, the term Neural Network does not only denote a

single classifier but a family of them including a countless amount of different

possibilities.

At the present work, diverse configurations have been tried, following some previous

scientific approaches (s. [Ami01, Hub98]. All the attempted methods and architectures

are detailed in later sections after a brief introduction about NN in section 6.3.1. Since a

great deal of information about Neural Networks can be found in the literature, section

6.3.1 provides the reader with the basic concepts needed to understand the posterior

configuration details.

The software employed is the SNNS (Stuttgart Neural Network Simulator), a

simulator for neural networks on Unix workstations developed at the Institute for Parallel

and Distributed High Performance Systems (IPVR) at the University of Stuttgart. The

software allows two possibilities: batch programming or graphical interface operation.

For further information about the software see [Zel95]5.

6.3.1 Introduction to Neural Networks.

5 http://www-lehre.informatik.uni-osnabrueck.de/~nn/html_info/UserManual/UserManual.html

83

Page 8: Chapter6.doc

Chapter 6 CLASSIFIERS

An Artificial Neural Network (ANN) is an information processing paradigm that is

inspired by the way biological nervous systems, such as the brain, process information.

The term Artificial is included to differentiate these networks from the biological neural

systems, on which they are based, but it is usually understood within the computational

environment and they can also be identified simply as Neural Networks (NNs).

The key element of this paradigm is the novel structure of the information processing

system. It is composed of a large number of highly interconnected processing elements

(neurones) working in unison to solve specific problems. An input is presented to some of

(or all) its input units, this input vector is propagated through the whole network and

finally, some kind of output is splitted out. So, essentially, they are functions: the network

gets an input as an argument and gives an output for that particular input. Because input

and output can consist of many units or components, they are considered as vectors.

However, ANN's real power is on its ability to learn, that is, the function is not

constant but can be changed dynamically. ANNs, like people, learn by example. An ANN

is configured for a specific application, such as pattern recognition or data classification,

through a learning process. Learning in biological systems involves adjustments to the

synaptic connections that exist between the neurones. This also happens in NN learning.

Accordingly, neural networks are a form of multiprocessor computer system, with the

following elements:

simple processing elements (neurons or nodes),

a high degree of interconnection (links between nodes),

simple scalar messages,

adaptive interaction between elements.

84

Figure 6.1. Artificial neuron model

Page 9: Chapter6.doc

Chapter 6 CLASSIFIERS

The simple processing element, the artificial neuron or nodes (figure 6.1), is a device

based in the biological neuron model with many inputs and one output. Each input comes

via a connection that has a strength (or weight); these weights correspond to synaptic

efficacy in a biological neuron. Each neuron also has a single threshold value. The

weighted sum of the inputs is formed, and the threshold subtracted, to compose the

activation of the neuron. Then, the activation signal is passed through an activation

function (also known as a transfer function) to produce the output of the neuron. Figure

6.2 shows this activation process. The activation function is not unique, but it can be

changed and even self-programmed to get a better performance in a specific task.

The artificial neuron has two modes of operation; the training mode and the using

(testing) mode. In the training mode, the neuron can be trained to fire (or not), for

particular input patterns. In the using mode, when a taught input pattern is detected at the

input, its associated output becomes the current output. If the input pattern does not

belong in the taught list of input patterns, the firing rule is used to determine whether to

fire or not.

Depending on their function in the net, one can distinguish three types of units,

depicted in figure 6.3: The units whose activations are the problem input for the net are

called input units; the units whose output represent the output of the net output units. The

remaining units are called hidden units, because they are not visible from the outside. One

85

Figure 6.2. Artificial neural neuron activation process.

Page 10: Chapter6.doc

Chapter 6 CLASSIFIERS

neural network must have both input and output units, but there can be no hidden units

(single-layer), one or many layers of hidden units (multi-layer).

By combining these simple units and using links between them, many different

network configurations can be found. A neural network is characterised by its particular:

Architecture; its pattern of connections between the neurones.

Learning Algorithm; its method of determining the weights on the connections.

Algorithms used during this thesis are detailed in section 6.3.3.

Activation function; which determines its output. The most common activation

functions are step, ramp, sigmoid and Gaussian function. Activation functions

used during this thesis are detailed in section 6.3.2.

Attending to the architecture, regardless of the number of layers (single-layer or

multi-layer), there are two main kinds of ANN:

1. Feed-forward networks allow signals to travel one way only; from input to output.

There is no feedback (loops) i.e. the output of any layer does not affect that same layer.

Feed-forward ANNs tend to be straightforward networks that associate inputs with

outputs. They are extensively used in pattern recognition. This type of organisation is also

referred to as bottom-up or top-down.

86

Figure 6. 3. Different types of units within the structure of an artificial neural network

Page 11: Chapter6.doc

Chapter 6 CLASSIFIERS

2. Feedback networks can have signals travelling in both directions by introducing

loops in the network. Feedback networks are very powerful and can get extremely

complicated. Feedback networks are dynamic; their 'state' is changing continuously until

they reach an equilibrium point. They remain at the equilibrium point until the input

changes and a new equilibrium needs to be found. Feedback architectures are also

referred to as interactive or recurrent, although the latter term is often used to denote

feedback connections in single-layer organizations.

In the framework of this Thesis, only Feed-Forward architecture is employed, due to

its more general use. However, with relation to the learning algorithm, activation and

analysis function, diverse options are tried. A better description of these particular neural

network characteristics is made in next sections.

6.3.2 Initialisation of adaptive parameters in neural networks.

Before a Neural Network is trained, its weights must be initialised, in order to reach

an iterative optimisation. The initialisation of adaptive parameters in neural networks, far

from being trivial, is pointed by several studies (s. [Duc97, Fer01] as a key factor to

create robust neural networks. There is no definitive initialisation. Putting the weights to

zero will halt all the gradient dependent optimisation techniques. In [Duc97] it is

87

Figure 6.4. Multilayer perceptron employing feed forward, fully connected topology

Page 12: Chapter6.doc

Chapter 6 CLASSIFIERS

concluded that Neural Network initialization, most frequently done by randomizing

weights, can also be accomplished by prototypes based on initial clusterization giving

much better results enabling solutions to complex, real life problems. Introduction of such

methods of parameter initialization should allow for creation of neural systems requiring

little optimization in further training stages. However, complex initialization techniques

still require deeper investigation and further assessment.

Usually it is a good design to fit the weights so that the summation in the receiving

unit (hidden or output unit) is in the range [-1,1]. That is, adjusting the weights according

to the standard deviation of the transmitting unit, and the number of transmitting units

(the fan in). Therefore, the initialisation function used in this work is the Randomise

Weights of the SNNS toolkit in the mentioned range [-1, 1]. By random initialisation

different parts of the weight space can be search, minimising the behaviour a local

minimum have for the particular training set.

6.3.3 Learning Algorithms.

One of the most important questions when using NN is how to adjust the weights of

the links to get the desired system behaviour. This modification is very often based on the

Hebbian rule, which states that a link between two units is strengthened if both units are

active at the same time. The Hebbian rule in its general form is:

(6.2)

Where

wij = weight of the link from unit i to unit j.

aj(t) = activation of unit j in step t.

tj = teaching input of unit j.

oi = output of the preceding unit i.

g(…) = function, depending on the activation of the unit and the teaching input.

h(…) = function, depending on the output of the preceding element and the

current weight of the link.

Training a feed-forward neural network with supervised learning consists of the

following two phases:

88

Page 13: Chapter6.doc

Chapter 6 CLASSIFIERS

1. An input pattern is presented to the network. The input is then propagated forward

in the net until activation reaches the output layer. This constitutes the so-called forward

propagation phase.

2. The output of the output layer is then compared with the teaching input. The error,

i.e. the difference (delta) between the output and the teaching input of a target output unit

j is then used together with the output of the source unit i to compute the necessary

changes of the link . To compute the deltas of inner units for which no teaching input is

available, (units of hidden layers) the deltas of the following layer, which are already

computed, are used in a formula given below (6.3). In this way the errors (deltas) are

propagated backward, so this phase is called backward propagation.

There are two kind of training according to when the weights are updated. In online

learning, the weight changes are applied to the network after each training pattern, i.e.

after each forward and backward pass. In offline learning or batch learning the weight

changes are cumulated for all patterns in the training file and the sum of all changes is

applied after one full cycle (epoch) through the training pattern file.

Methods and algorithms tried during this Diploma Thesis are described in following

subsections.

6.3.3.1 Backpropagation learning algorithm.

The basic idea of Backpropagation learning algorithm, is the repeated application of

the chain rule to compute the influence of each weight in the network with respect to an

arbitrary error function E:

(6.3)

Where

wij = weight from neuron j to neuron i.

ai = activation value.

neti = weighted sum of the inputs of neuron i.

89

Page 14: Chapter6.doc

Chapter 6 CLASSIFIERS

Once the partial derivative of each weight is known, the aim of minimising the error

function is achieved by performing a simple gradient descent:

(6.4)

Where

η = learning rate.

Learning rate parameter is selected by the user and, as it can be deduced from

equation 6.4, it plays an important role in the convergence of the network in terms of

success and speed. For our experiments the most commonly used parameters are selected.

The inspection of advanced possibilities related to neural network learning procedures

conforms a broad field of investigation and could be, therefore, a point of further

experimentation.

In the backpropagation learning algorithm online training is usually significantly

faster than batch training, especially in the case of large training sets with many similar

training examples. On the other hand, results of the training with backpropagation and

update after every pattern presentation, heavily depend on a proper choice of the

parameter η [Sci94].

The backpropagation weight update rule, also called generalized delta-rule, for the

SNNS software reads as follows:

EMBED Equation.3 ijij ow

(6.5)

(6.6)

Where

η = learning factor (a constant).

δj = error (difference between the real output and the teaching input) of unit j.

oi = output of the preceding unit i.

tj = teaching input of unit j.

i = index of a predecessor to the current unit j with link wij form I to j.

j = index of the current unit.

90

if unit j is an output unit

if unit j is a hidden unit

Page 15: Chapter6.doc

Chapter 6 CLASSIFIERS

k = index of a successor to the current unit j with link wjk from j to k.

There are several backpropagation algorithms supplied with. In our research we made

use of two of them:

Vanilla backpropagation / Standard Backpropagation.

Vanilla backpropagation corresponds to the standard backpropagation learning

algorithm introduced by [Rum86] and described above. It is the most common learning

algorithm. Its definition reads as equation 6.6.

In SNNS, one may either set the number of training cycles in advance or train the

network until it has reached a predefined error on the training set.

In order to execute this algorithm, the following learning parameters are required by

the learning function that is already built into SNNS:

- η: Learning rate specifies the step width of the gradient descent. Typical values of

η are 0.1 …1. Some small examples actually train even faster with values above 1, like

2.0.

- dmax: the maximum difference between a teaching value tj and an output

oj of an output unit which is tolerated, i.e. which is propagated back as dj=0. If values

above 0.9 should be regarded as 1 and values below 0.1 as 0, then dmax should be set to

0.1. This prevents overtraining of the network. Typical values of are 0, 0.1 or 0.2.

Backpropagation with chunkwise update.

There is a form of backpropagation that comes in between the online and batch

versions of the algorithm with regard to updating the weights. The online version is the

one described before (vanilla backpropagation). The batch version has a similar formula

as vanilla backpropagation but, while in Vanilla Backpropagation an update step is

performed after each single pattern, in Batch Backpropagation all weight changes are

summed over a full presentation of all training patterns (one epoch). Only then, an update

with the accumulated weight changes is performed.

Here, a chunk is defined as the number of patterns to be presented to the network

before making any alterations to the weights. This version is very useful for training cases

91

Page 16: Chapter6.doc

Chapter 6 CLASSIFIERS

with very large training sets, where batch update would take too long to converge and

online update would be too unstable.

Besides parameters required in Vanilla Backpropagation, this algorithm needs to fix

the chunk size N, defined as the number of patterns to be presented during training before

an update of the weights with the accumulated error will take place. Based on this

definition, Backpropagation with Chunkwise update can also be seen as a mixture

between Standard backpropagation (N =1) and Batch Backpropagation (N =number of

patterns in the file) For the experiments carried out in this thesis, which make use of this

learning algorithm, the chunk size is set to 50 patterns.

6.3.3.2 RPROP learning algorithm.

Rprop stands for “Resilient back propagation” and is a local adaptive learning

scheme, performing supervised batch learning in multi-layer perceptrons.

The choice of the learning rate η for the Backpropagation algorithm in equation 6.4,

which scales the derivative, has an important effect on the time needed until convergence

is reached. If it is set too small, too many steps are needed to reach an acceptable

solution; on the contrary a large learning rate will possibly lead to oscillation, preventing

the error to fall bellow a certain value. Figure 6.5 shows both phenomena. In case (a),

long convergence times are required, and in the (b) case, an oscillation can be seen in the

proximity of local minima.

(a) (b)

Figure 6.5. Error functions for the case of (a) a small learning rate and (b) a large learning rate.

The basic principle of Rprop is to eliminate the harmful influence of the size of the

partial derivative on the weight step. This algorithm considers the local topology of the

error function to change its behaviour. As a consequence, only the sign of the derivative

92

Page 17: Chapter6.doc

Chapter 6 CLASSIFIERS

is considered to indicate the direction of the weight update. The size of the weight change

is exclusively determined by a weight-specific, so-called 'update-value' .

(6.7)

Where

= Summed gradient information over all patterns of

the pattern set.

The basic idea for the improvement realised by the Rprop algorithm was to achieve

some more information about the topology of the error function so that the weight-update

can be done more appropriately. Each ‘update-value’ evolves during the learning process

according to its local sight of the error function E. Therefore, the second step of Rprop

learning is to determine the new update-values. This is based on a sign-dependent

adaptation process:

(6.8)

With

0 < η- < 1 < η+

Note that the update-value is not influenced by the magnitude of the derivatives, but

only by the behaviour of the sign of two succeeding derivatives. Every time the partial

derivative of the corresponding weight changes its sign, which indicates that the last

update was too big and the algorithm has jumped over a local minimum (figure 6.5a), the

update-value is decreased by the factor η-. If the derivative retains its sign, the

update-value is slightly increased in order to accelerate convergence in shallow regions.

Rprop also avoids the problem encountered in the well-known SuperSAB6 algorithm

[Toll90]. There, the weight-update is still strongly dependent on the magnitude of the

6 Super self-adjusting back-propagation algorithm

93

if

if

Page 18: Chapter6.doc

Chapter 6 CLASSIFIERS

partial derivative and the effects of this influence spread all over the entire network.

Rprop prevents this influence by changing the value of the weight update directly, only

depending on the sign of the partial derivative without reference to its magnitude.

Since Rprop tries to adapt its learning process to the topology of the error function, it

follows the principle of 'batch learning' or 'learning by epoch'. That means, that weight-

update and adaptation are performed after the gradient information of the whole pattern

set is computed.

In order to reduce the number of freely adjustable parameters, often leading to a

tedious search in parameter space, the increase and decrease factor in SNNS are set to

fixed values (η-=0.5, η+=1.2). Thus, the Rprop algorithm takes only three parameters:

- = initial update value.

- =limit for the maximum step size.

- = weight decay exponent.

When learning starts, all update-values are set to an initial value . Since directly

determines the size of the first weight step, it should be chosen according to the initial

values of the weights themselves, for example =0.1 (default setting). The choice of this

value is rather uncritical, because it is adapted as learning proceeds. In order to prevent

the weights from becoming too large, the maximum weight-step determined by the size of

the update-value, is limited. The upper bound is set by the second parameter of Rprop,

. The default upper bound is set somewhat arbitrarily to =50. Usually,

convergence is rather insensitive to this parameter as well. Nevertheless, for some

problems it can be advantageous to allow only very cautious (namely small) steps, in

order to prevent the algorithm getting stuck too quickly in suboptimal local minima

=1e-6. The remaining parameter α (weight decay exponent) determines the relationship

between the output error and to reduction in the size of the weights. The choice of the

third parameter is set to 4, what corresponds to a ratio of weight decay term to output

error of 1:10000 (1:104).

6.3.3.3 Pruning algorithms.

Pruning algorithms try to make neural networks smaller by pruning unnecessary links

or units, for different reasons:

94

Page 19: Chapter6.doc

1. Choose a reasonable network architecture.

2. Train the net with backpropagation or any similar learning function into a

minimum of the network.

3. Compute the saliency (relevance for the performance of the network) of each

element (link or unit respectively).

4. Prune the net with the smallest saliency.

5. Retrain the net (into a minimum again).

6. If the net error is not too big, repeat the procedure from step 3 on.

7. Recreate the last pruned element in order to achieve a small net error again.

Chapter 6 CLASSIFIERS

It is possible to find a fitting architecture this way.

The cost of a net can be reduced (think of runtime, memory and cost for hardware

implementation).

The generalisation can (but need not) be improved.

Unnecessary input units can be pruned in order to give evidence of the relevance

of input values. (A kind of feature selection, chapter 5).

Pruning algorithms can be rated according to two criterions:

What will be pruned? We distinguish weight pruning and node pruning. Special

types of node pruning are input pruning and hidden unit pruning.

How will be pruned? The most common possibilities are penalty term algorithms

(like Backpropagation with Weight and sensitivity algorithms. Sensitivity algorithms,

which are used in this Thesis, perform training and pruning of a neural net alternately,

according to the following algorithm:

For the experiments carried out during this study, Magnitude Based Pruning

algorithm is employed. This is the simplest weight-pruning algorithm. After each training,

the link with the smallest weight is removed. Thus the saliency of a link is just the

absolute size of its weight. Though this method is very simple, it rarely yields worse

results than the more sophisticated algorithms. The (subordinated) learning method

95

Figure 6.6. Pruning general algorithm.

Page 20: Chapter6.doc

Chapter 6 CLASSIFIERS

employed in step 2 (figure 6.6) is set, for our purposes, to the Standard Backpropagation

algorithm.

There are two criterions to stop the pruning based on the error after retraining. It must

not exceed:

- the error before the first pruning by more then a certain percentage determined by

the user in the SNNS field “Maximum error increase in %:” (default setting=10) and

- the absolute SSE value given in the field “Maximum accepted SSE” (default

setting=5).

SNNS also allows to select the number of epochs of the subordinated learning

function, for the first training and each retraining separately (default settings = 1000 and

100 respectively). The training, however, stops when the absolute error falls short of the

“Minimum error to stop” (default setting=1). This prevents the net from overtraining.

For experiments made during this thesis, all the parameters read above are set to their

default values. However, since the subordinated function also has its own parameters (see

section 6.3.2.1), variations over them are tried in different experiments.

6.3.3.4 Multiple step vs. One step procedure.

Multiple step method is not exactly a learning algorithm but a training procedure.

When a neural network is trained, with one of the previously explained algorithms, the

user must fix a learning rate η. The selection of this parameter strongly influences the

convergence of the network; small learning rates lead to long convergence times while

large learning rates can cause oscillation in the proximity of a local minimum (figures

6.6.). As said in section 6.3.2.2, Rprop tries to solve this problem. However, another way

of avoiding is attempted in this work, based on a script written by Dr. Vicky Lam that

allows to select between two types of training: one step and multiple step.

In one step case, the network is trained with a fixed learning rate, it can be considered

as “the ordinary case”. Training stops when the number of training epochs reaches 200

cycles or when the mean square error of the previous epoch in the evaluation set is minor

than the same error in current epoch. That means, that the network has reached a local

minimum and it should stop before worsening its results. The script tries automatically

learning rates from 0.1 to 1 in steps of 0.02 (0.1, 0.12, … 0.98, 1).

96

Page 21: Chapter6.doc

Chapter 6 CLASSIFIERS

On the other hand, instead of achieving the training in one unique stage (one step)

with a fixed learning rate, the multiple step procedure makes use of four different stages.

For every step, the network is trained until the number of training epochs reaches 50

cycles or when the mean square error of the previous epoch in the evaluation set is minor

than the same error in current epoch. Once the training has stopped, the resulting network

is retrained with the learning rate of the next step. Sometimes, results will be better for the

last step and sometimes a better performance is achieved after one of the previous training

steps. In the original script by Dr. Vicky Lam, only two base learning algorithms were

implemented: Vanilla Backpropagation and Backpropagation with Chunkwise Update.

During this thesis, a third learning algorithm, Rprop, is allowed to operate under the

script.

The learning rates corresponding to each stage of the algorithm are as follows:

1. First step: η = 1

2. Second step: η = 0.5

3. Third step: η = 0.1

4. Fourth step: η = 0.05

The remaining parameters are fixed, either for one step and multiple step, within each

base learning algorithm:

- Standard Backpropagation: dmax = 0.1

- Chunkwise Backpropagation: dmax = 0.1 and N = 50

- Rprop: = 50 and = 4

6.3.4 Activation functions

Activation functions for the hidden units are needed to introduce non-linearity into the

network. Without non-linearity, hidden units would not make nets more powerful than

just plain perceptrons (which do not have any hidden units, just input and output units).

The reason is that a linear function of linear functions is again a linear function. However,

it is the non-linearity (i.e, the capability to represent non-linear functions) that makes

multilayer networks so powerful. There are two main classes of activation functions:

sigmoid and threshold.

97

Page 22: Chapter6.doc

Chapter 6 CLASSIFIERS

(a) (b)

The threshold or step function corresponds to figure 6.7 (a). There is a linear

summation of the inputs and nothing happens until the threshold θ is reached at which

point the neuron becomes active (i.e., shows a certain level of activation). Such units are

often called linear threshold.

The sigmoid function is so-called because it is shaped like one form of the Greek

letter Sigma, as illustrated in see figure 6.7 (b). It is, in essence, a smooth version of a

step function. It is zero for low input. At some point it starts rising rapidly and then, at

even higher levels of input, it saturates. This saturation property can be observed in nature

where the firing rates of neurons are limited by biological factors. The slope, ß (also

called gain) of the sigmoid function can be changed: The larger ß, the steeper the slope,

the more closely it approximates the threshold function. Its purpose within an artificial

neurone is to generate a degree of non-linearity between the neurone's input and output.

The sigmoidal functions such as logistic and tanh (hyperbolic tangent) and the Gaussian

function are the most common choices. For hidden units, sigmoid activation functions are

usually preferable to threshold activation functions.

Networks with threshold units are difficult to train because the error function is

stepwise constant, hence the gradient either does not exist or is zero, making it impossible

to use backprop or more efficient gradient-based training methods With sigmoid units, a

very small change in the weights will usually produce a change in the outputs, which

makes it possible to tell whether that change in the weights is good or bad. With threshold

units, a small change in the weights will often produce no change in the outputs. In

addition, DasGupta and Schnitger conducted a comparison study [Das93], in terms of

98

Figure 6.7. Threshold (a) and sigmoid (b) activation functions.

Page 23: Chapter6.doc

Chapter 6 CLASSIFIERS

efficiency and quality of approximation, of different activation functions. They conclude

that the standard sigmoid is actually more powerful than the binary threshold, even when

computing boolean functions. In despite of the agreement among experts about the fact

that sigmoidal activation functions are optimal for neural network training, selection of an

adequate activation function comprises a wide field of investigation and researches (s.

[Duc01, Jan01]).

By means of an the activation function, a new activation is computed from the output

of preceding units, usually multiplied by the weights connecting these predecessor units

with the current unit, the old activation of the unit and its bias. The general formula is:

(6.9)

Where

nj = activation of unit j in step t.

netj(t) = net input in unit j in step t.

= threshold (bias) of unit j.

A considerable amount of different activation functions can be found. During this

Diploma Thesis, we employed mainly the logistic activation function¸ but a small number

of experiments also tried the tanh function. A description of both functions is given in this

section.

6.3.4.1 Logistic activation function.

This function computes the network input simply by summing over all weighted

activations and then squashing the result with the logistic function .

The new activation at time (t+1) lies in the range [0,1]. The variable is the threshold of

unit j.

The net input is computed with:

(6.10)

This yield the well-known logistic activation function:

99

Page 24: Chapter6.doc

Chapter 6 CLASSIFIERS

(6.11)

Where

aj(t) = activation of unit j in step t.

netj(t) = net input in unit j in step t.

oi(t) = output of unit i in step t.

j = index for some unit in the net.

i = index of a predecessor of the unit j.

wij = weight of the link from unit i to unit j.

= threshold (bias) of unit j.

6.3.4.2 Hyperbolic tangent activation function.

This function has a similar sigmoid shape to the logistic function, but values are

spread through the interval [-1, 1], rather than [0, 1]. Its formula reads as follows:

(6.12)

Where

aj(t) = activation of unit j in step t.

netj(t) = net input in unit j in step t.

j = index for some unit in the net.

6.3.5 Analysing Functions.

Once the network has produced some outcomes, the way they are interpreted also has

a big influence on the global results. Analysis functions are not related to the neural

network training itself, but they take the output of a fixed trained network and make

decisions. The output of each node in a neural network is a real value in the range [0,1]

and the aim of the analysing functions is to decide the meaning of the output vector.

SNNS has three different analysis criteria: 402040, WTA and Band. Each rule presents

two adjustable parameters, h and l, whose significance is specific within a given method.

The analysis rule will make a correct, wrong or unknown inference. Note that unclassified

100

Page 25: Chapter6.doc

Chapter 6 CLASSIFIERS

output doesn’t infer any conclusion about the input and therefore no information can be

extracted. For some applications, as we found during the preliminary experiments

(section 8.1), the categorisation of a pattern into the class “unknown” provides no

valuable information. Nevertheless, this class can be easily avoided by modifying the

thresholds h and l. The decision rules for these methods are detailed in following

subsections.

6.3.5.1 402040 decision rule.

A pattern is classified correctly if:

the output of exactly one output unit is h.

the teaching output of this unit is the maximum teaching output (>0) of the

pattern.

the output of all other output units is 1.

A pattern is classified incorrectly if:

the output of exactly one output unit is h.

the teaching output of this unit is NOT the maximum teaching output of the

pattern or there is no teaching output > 0.

The output of all other units is 1.

A pattern is unclassified in all other cases.

6.3.5.2 WTA (Winner Takes All)

A pattern is classified correctly if:

there is an output unit with the value greater than the output value of all

other output units (this output value is supposed to be a).

a > h.

the teaching output of this unit is the maximum teaching output of the

pattern (>0).

the output of all other units is < a-1.

A pattern is classified incorrectly if:

there is an output unit with the value greater than the output value of all

other output units (this output value is supposed to be a).

101

Page 26: Chapter6.doc

Chapter 6 CLASSIFIERS

a > h.

the teaching output of this unit is NOT the maximum teaching output of the

pattern >0.

the output of all other output units is < a-1.

A pattern is unclassified in all other cases.

6.3.5.3 Band decision rule.

A pattern is classified correctly if for all output units:

the output is the teaching output - l.

the output is the teaching output + h.

A pattern is classified incorrectly if for all output units:

the output is < the teaching output – l

or

the output is > the teaching output + h.

This rule is especially useful when the network presents one single output node and

the decision, instead of resolving which node is the winner, has to be based in a division

of the output range in bands of values and assign each band to a different class.

6.3.5.4 Post- analysis method based on thresholds.

This analysis procedure is applied on the neural network outputs in order to make

some restrictions on the winner selection through the WTA selection. The analysis is

performed using a C program created specifically for this work: confusion_th.

After choosing the winner candidate of the output trough WTA rule, a decision,

based in two different thresholds, determines whether this value can be actually

considered as the winner or not. These thresholds are defined as follows:

- Threshold 1: Minimum value of the output to be considered as the winner. When

the winner candidate does not exceed this value, the pattern is classified as neutral. The

conceptual idea is that the pattern is no emotive enough to be classified into the winner

class.

- Threshold 2: Maximum value of the opposite emotion or emotional groups. When

an utterance is classified into one emotion, e.g. angry, the output values for the emotions

102

Page 27: Chapter6.doc

Chapter 6 CLASSIFIERS

situated on the opposite side of the axis, e.g. bored and sad for the arousal dimension,

must not exceed this value. Otherwise, the winner candidate is classified as neutral. This

is based on the observed experimental fact (see Chapter 8) that mean output values of

opposite emotions are well differentiated, for both the five outputs and the three outputs

case.

6.4 Leave-one-out cross validation

6.4.1 Leave-one-sentence out.

When training a classifier, the amount of data used for its training will influence the

quality of the learning model. Intuitively, if more examples of a class are given, the

classifier will tend to construct better generalisations. In order to increase the reliability of

the results carried out during the speaker dependent experiments, for which the recorded

database was not significantly large, leave-one-sentence out procedure is applied for the

evaluation.

Suppose we have N patterns to train and test the model. If we divide the set into two

subsets, i.e. training and testing set, the results are dependent of the division and, in

addition, the amount of data used for each task is reduced. With the leave-one-out method

this problems are to some extent solved. The method takes N-1 patterns to train the

classifier and then tests it with the remaining pattern. This procedure is repeated for all

the available patterns from 1 to N. This way, the classifier is trained with almost the

majority of the data (N-1) and is tested, after the whole iteration, on the complete set.

6.4.2 Leave-one-speaker out

In order to evaluate the speaker independence of the classifier, it should be tested in a

complete previously unknown subject. This way, from all the available speakers, some

should be used for training while the remaining ones will be used for testing. Similar

problems to those found in 6.3.1 arise. In order to get the maximum profit from the

available data, leave-one-speaker out procedure is employed for speaker independent

experiments.

103

Page 28: Chapter6.doc

Chapter 6 CLASSIFIERS

Suppose we have S speakers, then S-1 are used during the training step and the

resulting classifier is tested on the remaining speaker. It is repeated for all the speakers

and statistics are computed over the whole set results.

104