The Study of the Sleep and Vigilance Electroencephalogram ......and REM/Light Sleep in a...

The Study of the Sleep and VigilanceElectroencephalogram Using Neural Network

Methods

Mayela E. Zamora

St Cross College

Supervisor: Prof. L. Tarassenko

Sponsor: Universidad Central de Venezuela

�A thesis is submitted to the

Department of Engineering Science,University of Oxford,

in fulfilment of the requirements for the degree ofDoctor of Philosophy.

Hilary Term, 2001

Declaration

I declare that this thesis is entirely my own work, and except where otherwise stated, describes my ownresearch.

M. E. Zamora,St Cross College

Mayela E Zamora Doctor of PhilosophySt Cross College Hilary Term, 2001

The Study of the Sleep and Vigilance Electroencephalogram

Using Neural Network Methods

Abstract

This thesis describes the use of neural network methods for the analysis of the electroencephalogram

(EEG), primarily in subjects with a severe sleep disorder known as Obstructive Sleep Apnoea (OSA). This

is a condition in which breathing stops briefly and repeatedly during sleep, causing frequent awakening

as the subject gasps for breath. Day-time sleepiness is the main symptom of OSA, but the actual methods

to assess the level of drowsiness are time-consuming (e.g. scoring the EEG) or not reliable (e.g. subjective

measuring of the person’s sense of sleepiness, performance in vigilance tasks, etc). The work presented

in this thesis is two-fold. In the first part, a method for the automatic detection of micro-arousals from

features extracted from single-channel EEG, is developed and tested. AR modelling is the method of

extracting the features from the EEG. A compromise was found between the stationarity requirements of

AR modelling and the variance of the AR estimates by using a 3-second analysis window with a 2-second

overlap. The EEG features are then used as the inputs to a multi-layer perceptron (MLP) neural network

trained to track the sleep-wake continuum. It was found that a micro-arousal may cause an increase in the

slow rhythms (δ band) of the EEG at the same time as it causes an increase in the amplitude of the higher

frequencies (α and/or β bands). The automated system shows high sensitivity Se (median 0.97) and

positive predictive accuracy PPA (median 0.94) when validated against a human expert’s scores. This

is the first time that AR modelling has been used in micro-arousal detection. Visualisation analysis of the

EEG features revealed that Alertness and Drowsiness in vigilance tests are not the same as Wakefulness

and REM/Light Sleep in a sleep-promoting environment. The second part of the thesis describes the

application of another MLP neural network, trained to track the alertness-drowsiness continuum from

single-channel EEG, on OSA patients performing a visual attentional task. It was found that OSA subjects

may present “drowsy” EEG while performing well during the visual vigilance test. Also, the MLP analysis

of the wake EEG with these subjects showed that the transition to drowsiness may occur progressively

as well as in sudden dips. Correlation of the MLP output with a measure of task performance and

visualisation of EEG patterns in feature space show that the alertness EEG patterns of OSA subjects may

be closely related to the drowsiness EEG patterns of normal sleep-deprived subjects.

Acknowledgments

I am most grateful to Prof Lionel Tarassenko for supervising this work. Many thanks to my collaborators

at the Osler Chest Unit, Churchill Hospital, Dr John Stradling, Dr Melissa Hack, Dr Robert Davies and Dr

Lesley Bennett for providing the test data and the valuable clinical support. I would also like to thank Dr

Chris Alford for his helpful comments on the clinical aspects of this work.

To the Consejo de Desarrollo Cientifico y Humanistico de la Universidad Central de Venezuela, I extend

my sincere gratitude for the finacial support, and to the staff of its Departamento de Recursos Humanos

for the quality of service that they gave me during my stay in the UK.

Also, I am very appreciative to all my fellow labmates, especially Dr Mihaela Duta, Dr Ruth Ripley, David

Clifton, Gari Clifford, Dr Simukai Utete, Dileepan Joseph, Iain Strachan, Dr Steve Collins, Dr Taigang He

and Dr Neil Townsend for their friendship and help. Special thanks to Jan Minchington for the efficient

office support and natural kindness. To all my friends in Oxford, and in Caracas, a million thanks.

Finally, and most importantly, endless gratitude to my parents, to my sisters, and to Neal for their conti-

nous support, cheering and love.

working hard on vigilance...

Contents

1 Introduction 1

1.1 Overview of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Sleep and day-time sleepiness 3

2.1 Sleep, wakefulness, sleepiness and alertness . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.2 The process of falling asleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.3 Going on to a deeper sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Breathing and sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Normal sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 Obstructive Sleep Apnoea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Daytime sleepiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Sleepiness/fatigue related accidents . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.3 Correlation between OSA and accidents . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Measuring the sleep-wake continuum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Measuring sleepiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.2 Measuring sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Previous work on EEG monitoring for micro-arousals and day-time vigilance 12

3.1 The EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.1 Origin of the brain electrical activity . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.2 Description of the EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.3 Recording the EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.4 Extracerebral potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Analysis of the EEG during sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Changes in the EEG from alert wakefulness to deep sleep . . . . . . . . . . . . . . . 20

i

ii

3.2.2 Visual scoring method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.3 Computerised analysis of the sleep EEG . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Analysis of the EEG for the detection of micro-arousals . . . . . . . . . . . . . . . . . . . . 29

3.3.1 Cortical arousals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.2 ASDA rules for cortical arousals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.3 Computerised micro-arousal scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.4 Using physiological signals other than the EEG . . . . . . . . . . . . . . . . . . . . 33

3.3.5 Using the EEG in arousal detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Analysis of the EEG for vigilance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4.1 Changes in the EEG from alertness to drowsiness . . . . . . . . . . . . . . . . . . . 35

3.4.2 EEG analysis in vigilance studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4.3 Vigilance monitoring algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Parametric modelling and linear prediction 43

4.1 Spectrum estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.1 Deterministic continuous in time signals . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.2 Stochastic signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 AR parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.1 Asymptotic stationarity of an AR process . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.2 Yule-Walker equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3.3 Using an AR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4 Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4.1 Wiener Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4.2 Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 Maximum entropy method (MEM) for power spectrum density estimation . . . . . . . . . 66

4.6 Algorithms for AR modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.6.1 Levinson-Durbin recursion to solve the Yule-Walker equation . . . . . . . . . . . . . 67

4.6.2 Other algorithms for AR parameter estimation . . . . . . . . . . . . . . . . . . . . . 72

4.6.3 Sensitivity to additive noise of the AR model PSD estimator . . . . . . . . . . . . . 78

4.7 Modelling the EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

iii

5 Neural network methods 81

5.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.1.1 The error function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.1.2 The decision-making stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.1.3 Multi-layer perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2 Optimisation algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.2 Conjugate gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.3 Model order selection and generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.3.1 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.3.2 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.3.3 Performance of the network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.4 Radial basis function neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.4.1 Training an RBF network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.4.2 Comparison between an RBF and an MLP . . . . . . . . . . . . . . . . . . . . . . . 108

5.5 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.5.1 Sammon map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.5.2 NeuroScale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6 Sleep Studies 115

6.1 Using neural networks with normal sleep data: benchmark experiments . . . . . . . . . . 115

6.1.1 Previous work on normal sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.1.2 Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.1.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.1.4 Assembling a balanced database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.1.5 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.1.6 Training a Multi-Layer Perceptron neural network . . . . . . . . . . . . . . . . . . . 123

6.1.7 Sleep analysis using the trained neural networks . . . . . . . . . . . . . . . . . . . 126

6.2 Using the neural networks with OSA sleep data . . . . . . . . . . . . . . . . . . . . . . . . 130

6.2.1 Data description, pre-processing and feature extraction . . . . . . . . . . . . . . . . 130

6.2.2 MLP analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.2.3 Detection of μ-arousals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.2.4 The choice of threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

iv

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7 Visualisation of the alertness-drowsiness continuum 146

7.1 The vigilance database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.1.2 Visualising the vigilance database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.2 Visualising vigilance and sleep data together . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

8 Training a neural network to track the alertness-drowsiness continuum 164

8.1 Neural Network training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.1.1 The training database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.1.2 The neural network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

8.1.3 Choosing training, validation and test sets . . . . . . . . . . . . . . . . . . . . . . . 165

8.1.4 Optimal (n − 1)-subject MLP per partition . . . . . . . . . . . . . . . . . . . . . . . 167

8.2 Testing on the nth subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.2.1 Qualitative correlation with expert labels . . . . . . . . . . . . . . . . . . . . . . . . 169

8.2.2 Quantitative correlation with expert labels . . . . . . . . . . . . . . . . . . . . . . . 170

8.3 Training an MLP with n subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

8.4 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

9 Testing using the vigilance trained network 181

9.1 Vigilance test database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

9.2 Running the 7-subject vigilance MLP with test data . . . . . . . . . . . . . . . . . . . . . . 182

9.2.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

9.2.2 MLP analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9.3 Visualisation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

9.3.1 Projection on the 7-subject vigilance on the NEUROSCALE map . . . . . . . . . . . . 210

9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

9.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

v

10 Conclusions and future work 220

10.1 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

10.2 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

10.3 Main research results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

10.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

10.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

A Discrete-time stochastic processes 227

A.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

B Conjugate gradient optimisation algorithms 232

B.1 The conjugate gradient directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

B.1.1 The conjugate gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

B.2 Scaled conjugate gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

B.2.1 The scaled conjugate gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . 237

C Vigilance Database 238

D LED Database 240

D.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

D.2 Demographic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

List of Figures

2.1 The human brain showing its main structures . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 A conventional all night sleep classification plot from one normal subject . . . . . . . . . . 11

3.1 A simplified neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 The 10-20 International System of Electrode Placement . . . . . . . . . . . . . . . . . . . . 17

3.3 Conventional electrode positions for monitoring sleep . . . . . . . . . . . . . . . . . . . . . 19

3.4 Sleep EEG stages (taken from [69]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Apnoeic event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 Stochastic process model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Autoregressive filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Moving Average filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Moving Average Autoregressive filter (b0 = 1, q = p − 1) . . . . . . . . . . . . . . . . . . . 53

4.5 Time series of the synthetised AR process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6 Autocorrelation function of the synthetised AR process . . . . . . . . . . . . . . . . . . . . 59

4.7 Second order AR process generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.8 Second order AR process analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.9 AR coefficients estimates’ mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.10 Filter problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.11 Prediction filter of order p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.12 Prediction-error filter of order p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.13 Prediction-filter filter of order p rearranged to look as an AR analyser . . . . . . . . . . . . 65

4.14 Lattice filter of first order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.15 Lattice filter of first order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1 The classification process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.2 An artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.3 Hyperbolic tangent and Sigmoid functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

vi

vii

5.4 A I−J−K neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.5 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.6 A radial basis function network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.1 The neural network’s wakefulness P(W ), REM/light sleep P(R) deep sleep P(S) outputs;and measure of sleep depth P(W )-P(S) (from Pardey et al. [123]) . . . . . . . . . . . . . . 117

6.2 Mean error and covariance matrix trace for reflection coefficients computed with the Burgalgorithm (wakefulness and Sleep stage 4) vs data length N . . . . . . . . . . . . . . . . . 121

6.3 Sammon map for the balanced sleep dataset; classes W, R and S . . . . . . . . . . . . . . . 123

6.4 NEUROSCALE map for the balanced sleep dataset; classes W, R and S . . . . . . . . . . . . 124

6.5 Average performance of the MLPs vs number of hidden units . . . . . . . . . . . . . . . . . 125

6.6 Performance of the 10-6-3 MLP vs regularisation parameters . . . . . . . . . . . . . . . . . 127

6.7 MLP outputs, P(W ), P(R) and P(S) for subject 9’s all-night record, showing a 12-minutesegment detailed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.8 Sleep database subject 9 P(W )-P(S), raw (a) and 31-pt median filtered (b) compared tohuman expert scored hypnogram (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.9 OSA sleep MLP outputs for subjects 3 and 8 . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.10 [P(W )-P(S)] output for OSA sleep subjects 3 (top) and 8 (middle); and for normal sleepsubject 9, (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.11 μ-arousal detection procedure. Upper trace: [P(W)-P(S)] and a 0.5 threshold; middletrace: thresholding result; lower trace: μ-arousal automatic score with ASDA timing criteria 133

6.12 μ-arousal validation Upper trace: automated score for 0.7 threshold; middle trace: auto-mated score for 0.8 threshold; lower trace: visually scored signal . . . . . . . . . . . . . . 134

6.13 Se, PPA and Corr vs threshold for OSA subjects . . . . . . . . . . . . . . . . . . . . . . . 137

6.14 [P(W )-P(S)] output for OSA sleep subjects 2 (top) and amplitude histogram showing thetwo main clusters, surrounded by a circle of one standard deviation, and the EDM thresh-old (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.15 Se, PPA and Corr for the best threshold (blue), the EDM threshold (red), and a 0.5 fixedthreshold (green) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.16 OSA subject 5 EEG and [P(W )-P(S)] output during a typical μ-arousal for this subject (24s) 141

6.17 Spectrogram of the EEG segment shown in Fig. 6.16 calculated with 1s resolution using10th-order AR modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.18 OSA subject 5 EEG and [P(W )-P(S)] output during a μ-arousal missed by the automatedscoring system (24s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.19 OSA subject 8 [P(W )-P(S)] output and human expert scores (2 minutes) . . . . . . . . . . 144

6.20 Sleep database subject 9 raw P(W )-P(S) using a 1-s analysis window (a) and using a 3-sanalysis window (b), compared to the human expert scored hypnogram (c) . . . . . . . . 145

7.1 Vigilance Sammon map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

viii

7.2 Vigilance NEUROSCALE map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7.3 Vigilance Sammon map showing subject’s distribution (Alertness in red and Drowsiness inblue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

7.4 Vigilance NEUROSCALE map projections for each subject (Alertness in magenta and Drowsi-ness in blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

7.5 Vigilance NEUROSCALE map trained with all subjects, including the α+ subject . . . . . . . 158

7.6 Vigilance NEUROSCALE trained with all subjects, including α+ subject (Alertness in ma-genta and Drowsiness in blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

7.7 Subject 8 reflection coefficient histogram (green) in relation to the rest of the subjects inthe training set (magenta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7.8 Vigilance and sleep NEUROSCALE map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7.9 Vigilance and sleep NEUROSCALE projections for all the patterns in each class (colour code:W, cyan; R, red; S, green; A, magenta; and D, blue) . . . . . . . . . . . . . . . . . . . . . . 161

7.10 Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta;and D, blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.11 Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta;and D, blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

8.1 Average misclassification error for the validation set vs. number of hidden units J for the(n − 1)-subject MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.2 Average misclassification error on the validation set with respect to regularisation param-eters (νz,νy) for the (n− 1)-subject MLP with J = 3 (linear interpolation used between 12values) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.3 Time course of the MLP output for vigilance subject 1 . . . . . . . . . . . . . . . . . . . . 173







8.10 Average misclassification error for the validation set vs. number of hidden units J for the7-subject MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

9.1 LED subject 1 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 186






ix















9.21 LED subject 10 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . 206

9.22 LED subject 10 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . 207

9.23 LED subjects MLP output vs missed hits scatter plots . . . . . . . . . . . . . . . . . . . . . 208

9.24 LED subjects MLP output vs missed hits scatter plots . . . . . . . . . . . . . . . . . . . . . 209

9.25 LED subjects no-missed hits MLP output histogram . . . . . . . . . . . . . . . . . . . . . . 210

9.26 Patterns from LED subject 1 and 2 projected onto the 7-subject vigilance NEUROSCALE map 211


9.28 Patterns from LED subject 7, 9 and 10 projected onto the 7-subject vigilance NEUROSCALE

map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217


9.30 Patterns from LED subject 8 projected onto the 7-subject vigilance NEUROSCALE map . . . 219

A.1 Stochastic process ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

List of Tables

3.1 The Rechtschaffen and Kales standard for sleep scoring. . . . . . . . . . . . . . . . . . . . 22

3.2 The vigilance sub-categories and their definition . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 AR coefficients estimates’ mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Feedback coefficients in terms of the reflection coefficients . . . . . . . . . . . . . . . . . . 72

4.3 Reflection coefficients in terms of the feedback coefficients . . . . . . . . . . . . . . . . . . 72

6.1 Mean error and trace of covariance matrix for synthesised EEG reflection coefficients(wakefulness) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.2 Mean error and trace of covariance matrix for synthesised EEG reflection coefficients (stage4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.3 Misclassification error (expressed as a percentage) for the best three MLPs . . . . . . . . . 126

6.4 Se, PPA and Corr per subject for various threshold values . . . . . . . . . . . . . . . . . . 136

6.5 Optimal threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.6 Equi-distance to means (EDM) threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.7 Fixed (0.5) threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.1 Alford et al. vigilance sub-categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

7.2 Number of patterns per subject per class in vigilance training database . . . . . . . . . . . 149

7.3 Number of patterns per subject per class in K-means training set . . . . . . . . . . . . . . 150

8.1 Partitions and distribution of patterns in training (Tr) and Validation (Va) sets . . . . . . . 166

8.2 Optimum MLP parameters per partitions and percentile classification error for training (Tr)and validation (Va) sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

8.3 Percentage correlation between 1-s segments of the 15-pt median filtered MLP output and15s-based expert labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

C.1 Bristol subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

D.1 Time of falling asleep (in mm:ss) measured by the clinician from the start of the MWT test.The letter used in this thesis to refer to a given test is shown in brackets . . . . . . . . . . 241

x

xi

D.2 Subject demographic details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

D.3 Overnight sleep study results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

Chapter 1

Introduction

Obstructive Sleep Apnoea (OSA) is a condition in which breathing stops briefly and repeatedly during

sleep, causing frequent awakening as the subject gasps for breath. Day-time sleepiness is the main symp-

tom of OSA. Diagnosis of the disorder includes an over-night sleep study to count the number of arousals,

and a day-time sleepiness assessment. Changes from wakefulness to deep sleep and from alertness to

drowsiness are reflected in many physiological signals and behavioural measures. Among the physiologi-

cal variables, the electroencephalogram (EEG) is one of the most relevant, but traditional methods, based

on visual assessment of this signal (for example, counting the number of micro-arousals during sleep)

are time-consuming or not reliable. Most of the changes in the EEG associated with the transition from

alertness to drowsiness and to sleep are in the frequency domain, and many attempts to computerise the

EEG analysis are based on frequency-domain methods.

1.1 Overview of thesis

The focus of this thesis will be on sleep disturbance (micro-arousals in OSA patients) and its effect on

day-time performance, as assessed with vigilance monitoring. Definitions of terms used in this thesis

and a description of the OSA disorder and its implications in society can be found in chapter 2. Clinical

background and a literature review on computerised methods are presented in chapter 3. Section 3.2.3

of that chapter show that little research has been done on the computerised analysis of disturbed sleep.

Furthermore, there is no prior work on the computerised analysis of both sleep disturbance and vigilance

1.1 Overview of thesis 2

from the EEG. This thesis will describe the research undertaken in order to develop such a framework

using AR modelling for frequency-domain analysis and neural network methods for clustering and for

classification.

AR modelling theory and algorithms are the subject of chapter 4. Chapter 5 is a review of neural net-

work methods. Experiments carried on AR modelling to find a compromise between the stationarity

requirements and the variance of the AR estimates are described in chapter 6, which also presents the

use of neural networks with sleep data to track the sleep-wake continuum from single-channel EEG. An

automated system is developed to detect micro-arousals in OSA sleep EEG, based on the neural network

outputs. Results, compared with an expert’s scores, show high sensitivity with a low number of false pos-

itives, and a good similarity in starting time and duration. A case study shows that the EEG may present

a mixed-frequency pattern during a micro-arousal, instead of a shift in frequency as usually described in

the literature.

In chapter 7 we explain the reasons why a different network is needed to map the alertness-drowsiness

continuum. Visualisation analysis of the EEG features revealed differences between Alertness and Drowsi-

ness in vigilance tests, with respect to Wakefulness and REM/Light Sleep in a sleep-promoting environ-

ment. Chapter 8 deals with the training of neural networks with vigilance data to track the alertness-

drowsiness continuum using single-channel EEG only. Finally, chapter 9 presents the results of the trained

network with data from OSA patients performing a visual attentional task. This study shows that OSA

subjects may present “drowsy” EEG while performing well. Also, the MLP analysis of the wake EEG with

these subjects shows that the transition to Drowsiness may occur progressively as well as in sudden dips.

Correlation of the MLP output with a measure of task performance and visualisation of EEG patterns in

feature space show that the alertness EEG patterns of OSA subjects may be more closely related to the

drowsiness EEG patterns of normal sleep-deprived subjects than to the alertness patterns of these.

Chapter 2

Sleep and day-time sleepiness

2.1 Sleep, wakefulness, sleepiness and alertness

2.1.1 Definitions

Although the above words are part of almost everyone’s daily conversations we will define them in the

sense that they are to be used in this thesis. Sleep is a natural and periodic state of rest during which

consciousness of the world is suspended while its counterpart, wakefulness, is a periodic state during

which one is conscious and aware of the world [124]. Between sleep and wakefulness is the transitional

state of sleepiness [130], which has been defined as a physiological drive towards sleep [4], usually

resulting from sleep deprivation, or as a subjective feeling or state of sleep need. Wrongly used as a

synonym for wakefulness, alertness is the process of paying close and continuous attention, a state of

readiness to respond [124], an optimal activated state of the brain [115]. Vigilance, another word of

similar meaning, was first introduced in the literature by Head in 1923, who differentiated the stages of

awareness [64].

2.1.2 The process of falling asleep

Two theories try to explain how we fall asleep. Oswald in 1962 suggested that the fall of cerebral vigilance

does not occur as a steady decline but occurs briefly over and over again, with frequent surges of cerebral

vigilance to, once more, a high level. This depicts sleep onset as a punctuated rather than gradual

process. However more and more evidence has been found recently using recordings of the brain’s

2.2 Breathing and sleep 4

electrical activity and respiration signals that show a gradual oscillatory descent into sleep [6][130].

Sleepiness has an ultradian (> once/day) modulation with three times when this condition is most com-

mon: just after awakening in the morning, in mid afternoon (the so called “post-lunch dip” which is

nevertheless not related to the ingestion of food) and just prior to sleep. The post-lunch dip correlates

with the occurrence of siestas and an increase in the incidence of automobile accidents [166][41][130].

The drive to sleep can be overridden by motivation, especially in life-threatening situations, but it cannot

be suppressed indefinitely [42].

2.1.3 Going on to a deeper sleep

Once the sleep state is reached, physical signs of this condition are lack of movement, reduced postural

muscle tone, closed eyes, lack of response to limited stimuli, and more regular and relaxed breathing,

usually accompanied with an increase in upper airway noise. During sleep, the eyes can move repeat-

edly and rapidly. This condition is called rapid eye movement (REM) sleep and is usually associated

with the act of dreaming. A normal subject follows cycles or periodic patterns of REM and non-REM

(NREM) sleep during the night, going from the wakefulness stage to the deep sleep stage and then to

REM sleep, for a time longer than 20 minutes but usually no more than 1 hour, descending again into a

deep sleep stage, and repeating the REM-NREM sleep 90-minute cycle for about 4 or 5 times (see Fig 2.2

in section 2.4.2)[155].

2.2 Breathing and sleep

2.2.1 Normal sleep

When a normal subject is awake, ventilation is controlled by two pathways, one driven by the brain-

stem respiratory control centre and the other by the cortex (see Fig. 2.1). The one which is controlled

by the brain-stem is a vagal reflex and is more related to oxygen and carbon dioxide concentration

control. During sleep, this respiratory centre remains active but the cortex drive disappears causing

regular breathing as well as a fall in ventilation and a rise in the CO2 concentration. The reduction


in muscular tone causes a similar effect. The intercostal muscles stop their breathing motion and the

tubular pharynx muscle, which relies on tonic and phasic respiration to stay open, is narrowed when it

and related muscles lose tone. This pharyngeal narrowing increases the upper airway resistance. Even so,

the loss in tone of the intercostal muscles increases the chest wall compliance, allowing the diaphragm to

elevate the rib-cage more easily. The overall effect is that the breathing looks more relaxed and the ratio

of abdominal contribution to rib-cage contribution decreases, at least in NREM sleep [155].

Figure 2.1: The human brain showing its main structures

The further reduction in tone experienced by the intercostal muscles during tonic REM sleep brings

another fall in ventilation followed by a recovery in phasic REM sleep, when the randomly excited cortex

is able to drive the breathing again, making it less regular. The abdominal contribution increases to a

higher level than when the subject is awake [155].

2.2.2 Obstructive Sleep Apnoea

An obstructive apnoea is a condition which occurs when the air flow in the ventilation system stops for

more than 10s, due to an obstruction in the upper airways. A hypopnoea occurs when the normal flow is

reduced by 50% or more for more than 10s [91]. The number of apnoea and hypopnoea events per hour,

called the respiratory disturbance index (RDI) or apnoea/hypopnoea index (AHI), is used to determine

whether breathing patterns are normal or abnormal. Usually, an AHI of 5 or more is considered abnormal

[119].


Some subjects develop a sleep disorder called Obstructive Sleep Apnoea (OSA) in which apnoea or hypop-

noea events occur when the upper airway, usually crowded by obesity, enlarged glands, or other kinds of

obstruction, collapses under the negative pressure created by inspiration as the muscles lose their tone.

Then, the subject increases his respiratory efforts gradually until the intrathoracic pressure drops to a

subatmospheric value. Only when the carbon dioxide level rises and the oxygen level falls enough to

awake the cortex respiratory mechanism, does the returning tone unblock the upper airways and restore

ventilation. Recently, some studies [155] have pointed out the possibility that the increase in respiratory

effort is responsible for the cortex arousal. Whatever is the cause, this arousal is short in time, sometimes

referred in the literature as a “micro-arousal”1, and the patient rarely is conscious of it [45].

If the apnoea/hypopnoea event is followed by an overshoot of hyper-ventilation, then the threshold of

the carbon dioxide level to provoke spontaneous ventilation can fall, and the next apnoea will have a

period when no respiratory effort is being made [155].

Micro-arousals

An arousal is a mechanism of the organism to increase the level of alertness in order to respond more

effectively to danger, whether it be external or internal and whether actual or perceived. In terms of

sleep, arousal not only refers to waking up but also to a series of physiological changes in autonomic

balance (i.e. heart rate, blood pressure, skin potential) and brain cortex activity [45].

Arousals caused by an apnoea/hypopnoea event are a short duration response caused by an internal

stimulus. Their length can be from just 3 or 5 seconds to 20 seconds [11], and can be barely noticeable

or can end in a choking sensation or panic [45]. A number of 15 or more micro-arousals per hour are

enough to diagnose OSA with confidence, but the number of arousals can be greater than 400 during the

night [155], and some studies found up to 100 per hour [45]. This fragmentation decreases the quality of

the sleep by diminishing the effective sleep time. Progressive sleepiness during daytime is a consequence,

starting with some loss of vigilance when the subject is performing a boring task, but soon leading him or

1The term micro-arousal was first introduced by Halasz in 1979 [60]

2.3 Daytime sleepiness 7

her to fall asleep while doing other activities such as reading, watching TV, sitting as a passenger in a car

or train or taking a bath. In the worst case, the subject may fall asleep while driving a machine at work or

a car, causing shunting accidents, and more serious crashes [155]. The deterioration in daytime function

correlates with the frequency of the micro-arousals rather than the extent of the reduced arterial oxygen

saturation [44].

Arousals can have causes other than obstructive sleep apnoea (OSA), for instance, ageing, leg movements,

pain, some forms of insomnia, but the most common cause is OSA [30]. OSA has a prevalence of 1-4% in

the overall population, 85% of the sufferers being males, and is highest in the 40-59 year age group, the

percentage of those affected rising to 4-8% [107] [44]. The problem usually arises in middle age, when

the muscles becomes less rigid and a decrease in activity increases the weight [155].

2.3 Daytime sleepiness

2.3.1 Causes

Sleep deprivation is one of the most common causes of sleepiness in our society. Studies on sleep de-

privation have found that a reduction in nocturnal sleep of as little as 1.3 to 1.5 hours per night results

in a reduction of daytime alertness by as much as 32% as measured by the multiple sleep latency test

(see section 2.4.1 for a description of this test)[20]. Physiological and psychological functions deteriorate

progressively over accumulating hours of sleep loss as well as over periods of fragmented sleep [35][97].

A second cause of sleepiness is OSA, the most common sleep disorder to cause day-time sleepiness,

even though the subjects affected by this disorder often report sleeping quite well [70] [107]. The

sleepiness of OSA sufferers is reflected in neuro-physiological impairment in originality, logical order

in visual scanning, recent memory, word fluency, flattening of affect in speech, and spatial orientation.

They become easily distracted by irrelevant stimuli, and have difficulties in ordering temporally changing

principles (card sorting or digit symbol substitution) [70]. It has been recommended that diagnosis of

OSA should not only depend on the AHI but also on functional sleepiness [107].

2.3 Daytime sleepiness 8

2.3.2 Sleepiness/fatigue related accidents

Fatigue and sleepiness are often used as synonyms. The term fatigue is also used to indicate the effects

of working too long, or taking too little rest, and being unable to sustain a certain level of performance

on a task [41]. Fatigue as well as sleepiness is related to motivation; the capability of performing a given

task; and past, cumulative day-by-day arrangements and durations of sleep and work periods [121].

Loss of performance usually means decreased ability to maintain visual vigilance and to have quick re-

actions, as well as to respond to unique, emergency-type situations. The loss of performance brought

by fatigue and sleepiness can be fatal when driving, piloting, monitoring air traffic control or radar or

when operating dangerous machinery. It appears that the incidence of sleepiness-related fatal crashes

may be as high as 40% of all the accidents on long stretches of motorway [41]. 20-25% of drivers having

motorway accidents appear to do so as a result of falling asleep at the wheel [107]. Sleepiness influences

people’s perception of risk [41]. Drivers do not always recognise the signs of fatigue/drowsiness or may

choose to ignore them [121]. Evidence has been found that lorry drivers on 11-hour hauls show in their

physiological signals increased signs of marked drowsiness during the last three hours of their drive [83].

Long-distance driving, youth and sleep restriction are frequently associated with sleep-related accidents

[128].

Sleepiness is the major complaint of shift-workers. Displaced hours of work are in conflict with the

basic biological principles regulating the timing of rest and activity (i.e. the circadian and homeostatic

regulatory systems) [4]. It may be the cause of more than 2% of all the serious accidents in industry

[41].

2.3.3 Correlation between OSA and accidents

As OSA is one of the most common causes of day-time sleepiness [161], the link between this sleep

disorder and motorway accidents is obvious. OSA patients show a high dispersion in reaction times [79],

and evidence has been found that OSA impairs driving [59]. Recent polls have revealed that 24% of

2.4 Measuring the sleep-wake continuum 9

OSA patients reported falling asleep at least once per week while driving [107], so it is not a surprise to

find that OSA sufferers have a 5 to 7 fold greater risk of road accidents than normal subjects. Long-haul

lorry drivers belong to the highest-risk group [107]. Lorry drivers with OSA have twice as many crashes

per mile driven as the normal group [121]. However, more recent studies have noted that increased

automobile accidents in OSA sufferers may be restricted to cases with severe apnoea (AHI> 40) [56].

2.4 Measuring the sleep-wake continuum

2.4.1 Measuring sleepiness

Many attempts to measure sleepiness/alertness have been made, and several scales are currently in

use. Subjective measures like the Stanford Sleepiness Scale (SSS), with 7 statements of feelings of

sleepiness from “wide awake” to “cannot stay awake” [130]; the Visual Analogue Scale (VAS), that

uses 10cm-lines anchored between the extremes of the states or moods under study [5][130]; and the

Activation-Deactivation Adjective Check List (ADACL), which consists of a series of adjectives describing

feelings at the moment and a four point scale – definitely feel, feel slightly, cannot decide and definitely

do not feel – have been used in a wide range of vigilance studies [130], in parallel with more objective

measures that provide means of verifying the subjective feelings of loss of alertness [5].

Several tests have been developed to provide an objective, repeatable quantification of sleepiness like

the multiple sleep latency test (MSLT) [27][140], that places the subject in a sleep promoting situation

and measures the latency to onset of sleep. In the MSLT, subjects in a sleep-promoting environment are

instructed to try to fall asleep while other similar tests differ in the instructions, like the maintenance of

wakefulness test (MWT) [111][43], which instructs the subject to resist sleep.

Loss of alertness or sleepiness has been related to diminished response capability, in which a decrease in

performance will indicate the presence of this condition. Therefore, quantifiable behavioural responses

or performance measures have been used also as objective ways to measure vigilance. The most popular

ones are reaction time, tracking error and stimulus detection error. The use of vigilance tasks to measure


sleepiness has the problem that the tasks are intrusive with respect to the natural process of sleepiness

[130]. Task complexity and knowledge of results (feedback) can mitigate the effects of sleep loss [42].

Other non-task related factors that affect the process are motivation, distraction and comprehension of

instructions [130][35][42].

Physiological measures

As a result of the degree of isomorphism between physiological and behavioural systems, diminished

response capabilities associated with sleepiness will be reflected in distinctive variations in physiological

measures. Below is a list of some of the changes in physiological variables associated with sleepiness

[130][118]:

• slower, more periodic breathing,

• decrease in cardiovascular activity (heart rate, blood pressure),

• decreased eye blinks and increased slow eye movements,

• decreased but variable skin conductance responses,

• decreased body temperature,

• electroencephalogram (EEG) changes in amplitude, frequency and patterning.

2.4.2 Measuring sleep

Loomis and collaborators first showed in 1937 that sleep is not a uniform or steady state and they there-

fore classified sleep in stages [96]. Following this sleep classification was further refined until in 1968, a

committee chaired by Rechtschaffen and Kales (R & K) compiled a set of rules that soon became the stan-

dard in sleep staging [136]. From wakefulness or REM sleep to deep sleep, R & K analysis distinguishes

four intermediate stages for NREM sleep (see Fig. 2.2 for a typical all-night sleep classification plot which

is known as a hypnogram). Visual assessment of the subject is not enough for the characterisation of these

stages. Physiologically, the EEG, the electromyogram (EMG) and the electrooculogram (EOG) provide a


higher level of quantification in the description of the different sleep stages. Measures of sleepiness based

on the EEG and details of the sleep stages are given in chapter 3.

Hours of sleep

1 3 4 5 6 7 82

Awake

REM

1

2

3

4

Sle

ep s

tag

es

Figure 2.2: A conventional all night sleep classification plot from one normal subject

Chapter 3

Previous work on EEG monitoring formicro-arousals and day-time vigilance

As we have seen in chapter 2, many physiological processes change at the time of sleep onset. Monitoring

these changes provides means of detecting arousals during sleep for OSA diagnosis (see section 2.2.2)

and day-time sleepiness. However, the organ that shows the clearest changes during sleep and from

alertness to sleepiness is the brain. Not only is the brain the organ that contains the mechanisms for

sleeping and being awake, its electrical activity is relatively easy to monitor and reflects the changes in

the sleep/wake continuum [69].

3.1 The EEG

The electroencephalogram or EEG is a graphical record of the electrical activity of the brain which was

first measured non-invasively in humans and described in 1929 by Hans Berger1. It can be measured with

electrodes located near, on or within the cortex. Depending on the location of the recording electrodes

the EEG can be called scalp EEG, cortical EEG or depth EEG. The first one is recorded with electrodes

placed on the scalp while the last two refers to electrodes in contact with the brain cortex [145]. From

now on we will use EEG to mean scalp EEG.

1The first recording of the electrical activity of the brain was made by Caton in 1875 using rabbits, monkeys and other smallanimals[28]

3.1 The EEG 13

3.1.1 Origin of the brain electrical activity

The human nervous system is responsible for taking the information from internal and external or en-

vironmental changes, analysing it and acting upon it in order to preserve the integrity, well-being, and

status quo of the organism. Its most prominent and important organ is the brain (see Fig. 2.1). The hu-

man brain contains approximately 109 nerve cells or neurons interconnected in a very intricate network

within which the information is transmitted by electro-chemical impulses [51]. Most neurons consist of

a cell body, or soma, with several receiving processes, or dendrites, which prolongs to a nerve fibre, or

axon that branches at the other end (see Fig. 3.1).

axon

soma

dendrites

Figure 3.1: A simplified neuron

As in any other cell in the human body, there is an electrical potential difference between the inner and

the outer side of the neuron. This potential, called the resting potential, is due to differences in extracel-

lular and intracellular ion concentration, maintained by the cell membrane structure and ion pumping

mechanisms. Neurons can respond to stimuli, strong enough to initiate a series of charge changes that

leads to membrane depolarisation and reverse polarisation that reaches a peak and repolarises back to

the resting potential. This sudden activity resembles a spike in shape and is called an action potential.

Typically it has a peak to peak amplitude of 90 mV and a duration of 1ms.

Neurons also interact with each other by chemical secretions in the dendrite-axon gaps (synapses) be-

tween them. The action potential in the pre-synaptic neuron (transmitting neuron) travels from the soma

along the axon. When it reaches the end it releases a chemical neurotransmitter at the axon terminals,

which are very close to the dendrites of other neurons. Then the post-synaptic neuron (receiving neuron)

receptors for this chemical release ions inside the cell that change the membrane polarisation, originating

a post-synaptic potential. Post-synaptic potentials are much lower in amplitude than the action potentials,

3.1 The EEG 14

but they last much longer (15 - 200 ms or more) and the extracellular current flow associated with them

is much more widely distributed than that corresponding to action potentials. It has been estimated that

one neuron can influence up to 5000 of its neighbours. For these reasons it is believed that the EEG re-

flects the summation of post-synaptic potentials of the pyramidal cells rather than the spatial summation

of individual action potentials [135] [145] [125]. Pyramidal cells are neurons located very close and

perpendicularly to the cortex surface, so the ion current flow generates electrical potential changes that

are maximum in the plane parallel to the cortex [135].

If post-synaptic potentials coming from the dendrites of one neuron, summed in time and space, exceed

a certain threshold, the soma generates a new nerve impulse, an action potential, that is then transmit-

ted to the neurons at the end of its axon, passing in this way the stimulus response from one neuron to

another [51] [135]. Post-sypnatic potentials can be of varied peak amplitude but in general, a single one

is not enough to trigger the action potential [135]. Because of their chemical origin, the potentials gen-

erated in the brain are very limited in amplitude and the ionic currents travel slower (1ms per synapse)

than currents in metals. The axon membrane is not a perfect insulator, some extracellular current flows

and diffuses the information around the neuron, speeding up the signal transmission [135] [110]. The

cerebro-spinal fluid and the dura membrane act as strong attenuators for the EEG, with the scalp itself

having less effect. EEG waves seen at the scalp, therefore, represent a kind of a “spatial average” of

electrical activity from a limited area of the cortex [125].

3.1.2 Description of the EEG

The EEG is a very complex quasi-rhythmical spatio-temporal signal within a time-frequency band of 0.1

- 100 Hz and an amplitude of the order of hundreds of microvolts at the scalp [135]. The effective

frequency range is 0.5 - 50 Hz and is divided for clinical reasons in four main bands in which the power

of the signal is concentrated, namely [87][69][24]:

1. Delta (δ) activity: [0.5 - 3.5] Hz2

2δ rhythm is limited to the [0.5 - 2) Hz range in sleep studies

3.1 The EEG 15

2. Theta (θ) activity: [4 - 8) Hz

3. Alpha (α) activity: [8 -13] Hz

4. Beta (β) activity: [15 - 25] Hz

5. Gamma (γ) activity: [30-50] Hz

with “]” meaning “inclusive” and “)” meaning “exclusive”.

EEG records are sometimes described as just “slow” or “fast” if the dominant frequency is below or above

the α band. The amplitude of the waves tends to drop as the frequency increases. Although there are

indications of several sources of rhythmical activity in the brain, their role in the generation of the EEG

rhythms is not yet fully understood [145] [125]. Clear oscillatory behaviours in the nervous system occur

in various situations, like in rhythmic motor functions (chewing, swimming) as well as in pathological

conditions (clonic muscular jerking, rhythmic eye blinks), but most of them serve unknown functions.

Some may be related to biological clocks or establishing windows of time during which information flows

[125]. The bands described above correspond to the main frequencies of these physiological pacemakers.

These frequencies do not tend to overlap with the frequency content of the neighbouring bands, hence

the gaps between some of the bands.

It has been suggested that the distributed, but related, cortical γ activity in the forebrain provides the

physiological basis for focused attention that links input to output, i.e. relating voluntary effort and/or

sensory input to a calling up and operation of a sequence of movements or thoughts. This form of

attention occurs normally during wakefulness, but can also be present during disordered sleep, in patients

who talk or walk during sleep [24]. Activity over 50 Hz is not considered of clinical value in scalp EEG

because it is mostly masked by background noise. Apart from the background rhythmical activity, there

are other components in the EEG of transient nature, usually described in terms of their duration and

waveform. For instance, a monophasic wave of less than 80ms duration is called a spike, while one of

80-200ms is called a sharp wave. Other transient forms are the spindles and K-complexes (see Fig. 3.4 later

in this chapter). All EEG components fluctuate spontaneously in response to stimuli or as a consequence

3.1 The EEG 16

of changes in the subject’s state of mind (i.e. sleep/wake control and psychoaffective status) and brain

metabolic status. They can also be changed by the use of drugs or by traumas or pathological conditions

[87] [85].

EEG patterns are different from one individual to another. Factors like gender, early stimuli, minor or

major brain damage, etc. can affect the development of the EEG. Once a subject reaches adulthood, their

EEG characteristics “stabilize” over time. This means that the EEG patterns for different conditions such

as eyes open, eyes closed, auditory stimulation and task performance remain remarkably similar for the

same individual as their age increases [85].

3.1.3 Recording the EEG

The EEG is recorded by amplifying the potential differences between two electrodes located on the scalp.

An electrode is a liquid-metal junction used to make the connection between the conducting fluid of the

tissue in which the electrical activity is generated and the input circuit of the amplifier [135]. The most

commonly used system for the placement of the electrodes is the so-called “10-20 International System

of Electrode Placement”, [76], represented in Figure 3.2. An orderly array of EEG channels constitutes

a montage. When all the channels are referenced to the same electrode (usually mastoid processes A1

for the right side of the scalp, and A2 for the left side of the scalp, or a common site located at the nose

or at the chin) the montage is called “referential”. If all the channels represent the difference potential

between two consecutive electrodes on the scalp, the montage is said to be “bipolar” [145].

The EEG signal is traditionally recorded on paper, or, more commonly now, electronically. It is subse-

quently analysed in order to extract useful information about the physiology or pathology of the brain.

This analysis is usually done by an expert by visual inspection of the signal.

3.1.4 Extracerebral potentials

In addition to the EEG, the scalp electrodes can also pick up other signals whose sources are not in the

brain, but are near or strong enough to interfere with its electrical activity. These signals can totally

3.1 The EEG 17

T3

F8

RightLeft

Nasion

Inion

2

5

O1 O

6TT

A1 A2

C 3 C4 T4

F7

FP1 FP2

P3 P4

C Z

FZF3 F4

PZ

Fp1,2

C 3,4

F3,4

T3,4

T5,6

A 1,2

pre-frontal

frontal

central

mid-temporal

posterior temporal

mastoid

P3,4

O 1,2

F7,8

FZ

C Z

PZ

parietal

occipital

anterior

frontal mid-line

central vertex

parietal mid-line

Figure 3.2: The 10-20 International System of Electrode Placement

obscure the EEG, making the recording uninterpretable. They can subtly mimic normal EEG activity or

distort normal activity, leading to misinterpretation [23]. Although called artefacts (or artifacts) they do

not always come from man-made devices. The main sources of artefacts are:

1. The recording instrument

2. The interface between the recording instruments and the scalp

3. Extraneous environmental sources

4. Other bio-electrical signals that do not originate from the brain and are not of interest in this

context, and can therefore be considered to be unwanted influences.

Muscle and heart activity as well as eye and tongue movements are among the bio-electrical signals

which, in this context, are considered artefacts because they obscure the EEG. They are classified as:

1. Electrocardiographic (ECG) signals and signals due to breathing

2. Electrooculographic (EOG) signals (signals due to eye movement)

3. Glossokinetic signals (signals from the movement of the tongue)

4. Electromyographic (EMG) signals (signals induced by muscle activity)

3.1 The EEG 18

5. Electrodermal signals due to altered tissue impedance (see above).

Usually the above influences appear within the EEG frequency range and consequently cannot be elim-

inated by filtering. If the interference renders the EEG useless, then the affected sections of EEG are

ignored, unless the presence of the interfering signal gives important information about the brain status

as is sometimes the case in visual scoring to determine alertness or in visual sleep staging.

Artefacts during sleep

Blink artefacts can occur only during wakefulness and in combination with slow eye movements during

drowsiness. Rapid eye movements (REM) are seen during waking but are characteristic of the “dreaming”

sleep stage that was named after them. For EEG recorded with the reference electrode positioned on

the opposite side of the body, vertical eye movements affect mostly the frontopolar sites (Fp1 and Fp2

electrodes), with an exponential decrease of the effect towards the occipital sites, while, for horizontal

eye movements, the maximum effect is found at the frontotemporal sites. EMG artefacts are uniformly

distributed within REM sleep, but are concentrated at the beginning and the end of non-REM sleep

periods. As expected, the deeper the sleep stage the lower the EMG activity, although REM sleep is

marked by skeletal muscle atonia. ECG artefacts may or may not be present during sleep as they do not

depend on the non-REM sleep stage. Phasic electrodermal artefacts can occur upon sudden arousal from

light sleep stages. Chest movements due to respiration may induce head movements that compress some

of the electrodes against the pillow, resulting in slow potential shifts in them [9].

The best way of dealing with artefacts is avoiding or minimising their occurrence during the recording

[23] [9]. When this is not possible (e.g. after the recording) other alternatives like digital filtering may be

applied. However, digital low-pass filtering for reducing muscle and mains artefacts, or high-pass filtering

for reducing sweating and respiration artefacts may severely distort both the EEG and the artefact signal.

EMG artefacts may resemble cerebral activity after filtering (mostly β activity, but also epileptic spikes and

rhythmic α activity). The last alternative is to reject EEG segments contaminated with artefacts [9]. This

is performed in most sleep laboratories by visual inspection, but some automatic detection can also be

3.2 Analysis of the EEG during sleep 19

performed, like out-of-range checks, and lately using some more sophisticated methods of identification

based on artefact-free models.

3.2 Analysis of the EEG during sleep

The R & K [136] technique for sleep scoring has become the gold standard throughout the world since

its publication in 1968. The scoring is based on the recording of several physiological signals, called the

polysomnograph (PSG). Typically a PSG record consists in 5 to 11 signals, including 2 EEG channels, one

mentalis-submentalis (chin) EMG channel, 1 or 2 EOG channels and one ECG channel (see Fig. 3.3). In

a clinical study to detect sleep related breathing disorders, special transducers are used to include nasal-

oral airflow, respiratory effort recorded both at the level of the chest and the abdomen, and oximetry

(oxygen saturation levels). When the number of channels is restricted to one, one of the EEG channels

C4 − A1, C3 − A2 or Cz − Oz is recommended for a single-channel EEG recording. Paper or magnetic

tape used to be the outputs of a PSG device, but are nowadays replaced by digital storage and display of

the digital PSGs [84].

referencefor EOGs

referencefor EEG

EMG

R-EOG

L-EOG

C4

EMG EEG EOG EOG

channels

Figure 3.3: Conventional electrode positions for monitoring sleep

A description of the R & K rules for sleep scoring is given briefly in Table 3.4 and Fig. 3.4, and in more

detail in the following section.


3.2.1 Changes in the EEG from alert wakefulness to deep sleep

At Wakefulness, EEG waves in an adult show a low amplitude, high frequency and apparently random

characteristic, generally contaminated with muscular activity from the temporal or other skeletal muscles.

When the subject closes his eyes and relaxes or when he becomes drowsy, there is usually a reduction

in any muscle and eye movement potentials, plus an increase in the EEG α activity. The slow (< 1 Hz)

rolling of the eyes upwards and the shutting and opening of the eyelids a few times are also signs of

drowsiness [69].

As the subject becomes more drowsy, the α rhythm may be interrupted by periods of relatively low voltage

during which slow lateral eye movements often occur. The slightest stimulus during these periods of

low voltage EEG activity will cause immediate reappearance of the α rhythm. Note that this indicates

not drowsiness but an increase in alertness, for which reason this is called a paradoxical α response.

Alternating periods of low voltage activity and of higher voltage α activity occur for a few minutes,

with the duration of the former progressively increasing until the latter no longer appears, along with a

progressive increase in θ activity indicating that the subject is lightly asleep. Stimuli insufficient to cause

arousal may be strong enough to produce an electronegative sharp wave at the top of the head or vertex

(V-wave). This is defined by R & K as Sleep Stage 1 [87].

Stage 2 is characterised by the appearance of sleep spindles, short (0.5-3s) bursts of 12-14Hz activity con-

sisting of approximately 6-25 complete waves, as well as K-complexes. A K-complex is a large amplitude

biphasic wave of approximately one second duration, maximal at the vertex. K-complexes can have two

different origins. One is as a response to an external stimulus (e.g. a noise) and the other is as an early

manifestation of the slow waves typical of deeper sleep stages [155].

The other two stages, Sleep Stage 3 and Sleep Stage 4, are well distinguished from the rest by the appear-

ance of high amplitude (≥ 75μVpp) δ waves. This feature gives to these stages the name of slow wave

sleep (SWS) or δ sleep. The difference between the two stages lies in the percentage of this slow pattern

in the analysed segment, being from 20% to 50% for the third stage and greater than 50% for the fourth


Figure 3.4: Sleep EEG stages (taken from [69])

[155]. Figure 3.4 shows typical EEG segments for each sleep stage.

So-called REM sleep can be divided into three phases. The first phase is characterised by the decrease

or even total disappearance of the EMG activity, which has already experienced a decline in going from

wakefulness to deep NREM sleep. After a few minutes the slow waves, spindles and K-complexes in the

EEG are replaced by rapid, low-amplitude waves, as in wakefulness or in the first sleep stage, with the

exception that the α mode does not dominate the EEG. This is the second phase, which only lasts a few


minutes, giving way to the third phase that comes with a burst of rapid eye movements, spikes in EMG

and sometimes visible twitching of the limbs. When REM sleep has a high density of these bursts of eye

movements it is known as phasic REM, while a low density type of REM sleep has received the name of

tonic REM. Tonic REM typically occurs at the beginning of the night whilst phasic REM is usually found

late at night. REM and non-REM periods alternate on a 90-minute cycle through the night, although the

duration of REM increases across the night.

Sleep stage CharacteristicsWakefulness Low amplitude, high frequency EEG activity(β and α activity);

EEG sometimes with EMG artefactSleep stage 1 Increased θ activity; slow eye movements (SEM); vertex

sharp waves; transition stage that lasts only few minutesSleep stage 2 EEG presents spindles (bursts of α activity) and K-complexesSleep stage 3 EEG with high amplitude, low frequency activity;

δ activity appearsSleep stage 4 EEG is dominated by δ activityREM sleep EEG presents high frequency, low amplitude waves; EMG

generally inhibited; bursts of rapid eye movements (REM) also appeartogether with spikes in the EMG

Table 3.1: The Rechtschaffen and Kales standard for sleep scoring.

The hypnogram shown in Fig. 2.2, section 2.4.2, illustrates the transitions between sleep stages, the main

features of the NREM-REM cycle, and the proportion of each stage found in a young adult. Sleep stage 1,

being a transitional stage between wakefulness or drowsiness and true sleep (stage 2 or deeper), usually

occupies only 5% of the night. The bulk of human sleep, around 45% of it, is made up of stage 2. Stage

3, another transitional phase, constitutes only about 7% of the sleep, while stage 4 makes up about 13%.

The rest of the total sleep time (20%–30%) is taken up by REM sleep [69].

3.2.2 Visual scoring method

The transition from fully alert wakefulness to deep sleep is a gradual process and it would be very difficult

to determine what the level of sleep is at any moment without dividing the PSG record into epochs of

duration which may be anything from 10s to 2min. The standardised use of 15mm/s and 10mm/s as the

PSG paper speed made the use of 20s or 30s epochs quite convenient, as each epoch is then one page

long. The scorer uses the R & K set of rules to determine the sleep stage per epoch, regardless of the


level of sleep in the previous record or in subsequent records. Eight hours of sleep produce about 400m

of paper. If the record is segmented in 20-30s epochs, this gives approximately 1000-1400 epochs to be

scored visually, which takes an experienced technician over 2 hours, or more if the record has transient

pathological events [155].

Limitations of the R & K scoring rules

Inspite of being widely used, the R & K rules have never been appropriately validated, and they were

never designed for scoring pathological sleep [66]. They suffer from major limitations like:

• the 6-value discrete scale to represent a process that is essentially continuous,

• the 20-30s time scale offering a very poor resolution so that transient events shorter than it are

missed,

• the bias introduced due to the failure to address non-sleep related individual variability of charac-

teristics such as α rhythm,

• the failure to address important sleep/wake related physiological processes such as respiratory and

cardiovascular processes and corresponding disorders.

Obviously, the rules have to be adapted and extended. In 30 years the methods of analysis have changed

due to the advent of the personal computer era. The task group on Signal Analysis of the European

Community Concerted Action “Methodology for the analysis of the sleep-wakefulness continuum” [12],

generated guidelines for a computer-based sleep analyser that would overcome the limitations of the

manual standard scoring and the R & K standard set of rules. They proposed a 1s time resolution, and

the tracking of the NREM sleep/wake process along a continuous scale with the 0% level corresponding to

wakefulness and the 100% level corresponding to the deepest SWS, as well as an on/off output indicating

REM sleep. They felt that quantification of REM/NREM should be based only on EEG, EOG and chin-

EMG in order to avoid bias between the inter-individual and intra-individual non-sleep-related differences


such as α rhythm, vertex and sawtooth3 waves and slow eye movements. They also considered additional

outputs to complement the REM/NREM sleep/wake process, such as a micro-arousal on/off output.

3.2.3 Computerised analysis of the sleep EEG

The need for automatic classification has been widely recognised [12]. Attempts to classify sleep EEG

automatically were made soon after the release of the R & K rules for manual scoring [150]. Different

approaches have been developed, most of them trying to emulate the R & K standard, with or without

overcoming its limitations. Many of them include the analysis of several PSG signals, like the EEG, EOG

and EMG [150] [151] [55] [133] [86] [152] [57] [132] [143] [68], and cardiorespiratory signals [54]

[92].

A computerised classification system is usually fed with artefact-free PSG signal segments of fixed or

adaptive length (typical range 1s-30s), or has an artefact marking/rejection procedure prior to the anal-

ysis block. Sometimes the EEG is the only input signal used [94] [72] [75] [78] [147] [160] [123] [16].

A number of features are then extracted using one or more of several methods (time domain, frequency

domain, non-linear dynamics, etc). A classification block combines the features and estimates the sleep

level using a set of rules (decision tree, linear discriminant, fuzzy logic), or an equivalent procedure

(neural networks).

Most of the approaches to classification in the past have used either period analysis (time domain) [22]

[94] [149] [133] [67] [13] [92] [48] [68] [167] or spectral analysis (frequency domain) [164] [57]

[132] [143] [78] [99] [16] [158]. Alternatively other techniques have been introduced, such as wavelet

analysis, autoregressive (AR) modelling [55] [75] [72] [86] [74] [123], principal component analysis

(PCA) [78] and more recently, nonlinear dynamic analysis [2] [52] [137]. The parameters, or features,

obtained have been combined in many different ways to yield classification. One of the most used is

the knowledge-based approach [150] [94] [55] [72] [132] [57] [92] [68]. A Markov-chain maximum

3Normally related to visual activity while scanning a picture, these random electropositive waves of 20μV of amplitude or less,sawtooth waves (or λ-like waves are sometimes seen while the subject is in REM/light sleep


likelihood model was developed in 1987 by Kemp and collaborators [86], and cluster analysis has also

been investigated [75], with neural network techniques joining the list in the last decade [13] [143]

[147] [160] [123] [16] [158].

Most of these systems show a reasonable discrimination for sleep stage 2 and slow-wave sleep, but all of

them are poor at discriminating REM from wake and stage 1. Holzmann et al. found a high percentage

of disagreement in light sleep scoring for experts revisiting their scoring (intra-rater)[68]. Many authors

use EOG and/or EMG to help the identification of REM [126]. The percentage of agreement with visual

scorers varies between 67% and 90% with a typical value of 83% in artefact-free segments. Some studies

aggregate sleep stages 3 and 4 together and that elevates the percentage of agreement to over 90%

[72] [143]. Another problem, probably inherited from the R & K set of rules that most of the systems

try to emulate, is that classification systems work almost perfectly in healthy subjects but do not work

sufficiently well in sleep-disturbed patients [152] [155] [126]. If an automated system is able to give the

same level of intra-rater and inter-rater agreements as the clinical experts manage (usually about 86% for

the inter-rater agreement, and 91% for intra-rater agreement) then it can be said to be of use for clinical

purposes. Although several commercially available systems can perform sleep staging, visual scoring is

the only reliable method available at the moment when scoring disrupted sleep EEG [112]. It is clear

that computerised analysis cannot fully replace expert opinion, therefore results of an automatic scoring

system require inspection by a trained polysomnographer [126].

Time-domain analysis

Visual analysis of the EEG is based on the identification of patterns and the assessment of mean amplitude

and dominant frequency. Therefore, time-domain measures of the EEG have been among numerous

features used in computerised analysis. Zero-crossing and maximum peak-to-peak amplitude are the

most popular of this kind of descriptors [94] [133]. The zero-crossing count is the number of times

that the signal crosses the base-line and is related to the EEG mean frequency. In 1973, Hjorth [67]

presented several EEG descriptors calculated directly from the time series as an alternative to frequency


analysis. The descriptors are calculated from the derivatives of the EEG signal and have a correspondence

to spectral descriptors. Hjorth descriptors, which measure the standard deviation of the signal (activity),

and ratios of the standard deviations of the signal and its first two derivatives (mobility and complexity)

have often been used in the analysis of the sleep EEG [92] [47]. The signal is band-pass filtered prior

to the calculation of time-domain descriptors which are then used to detect particular types of sleep

patterns [133]. Bankman et al. [13] added measures of slope to the already mentioned time-domain

features for K-complex detection. More recently, Uchida and co-workers [168] have investigated the use

of histogram methods of waveform recognition in sleep EEG. The method, which measures the period

and amplitude of a wave, has the advantage of detecting the frequency, amplitude and duration of single

and superimposed waves.

Uchida and collaborators used period-amplitude analysis of the sleep EEG and compared their analysis

with spectral methods. They found that for some frequency bands the time-domain method does not

detect the waves while the Fourier transform methods perform very well over the entire EEG frequency

range [167]. In contrast, Holzmann et al. found that a zero-crossing strategy gives better results for

slow-wave (< 2Hz and > 75μVpp) detection than Fourier transform methods [68].

Frequency-domain analysis (bank of filters, FFT, AR modelling)

Given that the sleep process involves gradual shifts in the EEG dominant frequency (see section 3.2.1),

the power spectrum of the signal conveys useful information. There are several approaches to estimate

the power spectrum of a stationary signal ([63] pp.147-8). Although the EEG is non-stationary, it may

be considered piece-wise stationary (for a detailed description of this issue, see chapter 4). The most

popular methods to estimating the power spectrum are the Fourier transform and AR modelling. A bank

of band-pass filters is another popular approach to obtain the power of the EEG frequency bands (see

section 3.1.2) [150].

The Fourier transform has been in use for nearly six decades in the spectral analysis of sleep EEG [89].

Its use increased dramatically in the 60’s with the development of the Fast Fourier Transform algorithm


(FFT) [34] that speeds the calculation up by a factor of N2

NlogN , with optimum performance when N is a

power of two 4. It has a disadvantage when N is small, as the variance of the spectrum estimate is high

for low values of N , but this can be improved by the use of smoothing windows and averaging. Another

disadvantage is that the Fourier transform is only calculated for discrete values of frequency, multiples

of fs/N , where fs is the sampling frequency. Features can be taken directly from the power density

spectrum, as coefficients for frequencies with the highest variance in the sleep continuum [158], or as

peak-frequencies, but the practical norm is to calculate the power (absolute or relative) accumulated in

the EEG bands [57] [132] [143] [99] [16]. PCA has also been applied to the spectrum coefficients to

find out which are the most significant [78].

AR modelling offers a more interesting alternative to FFT methods for power density spectrum estimation.

It yields a lower variance estimate if the model order is kept low, and is continuous in frequency. It

combines the versatility of picking up broad band signals and pure tones with relatively high accuracy,

which makes it suitable for the analysis of the EEG, a signal that may present bursts of waves as well

as background activity. Its relatively high computational complexity is not a problem anymore with the

state of the art in computing technology. Features can be extracted from the power spectrum estimate as

relative or absolute powers in EEG bands [55] [72], or directly from the model parameters [75] [152]

[147] [160] [123] [137]. Smoothing is usually applied to the coefficients to get a better estimate when

the number of samples has to be kept low (stationarity requirement). Chapter 4 will cover this method

in detail.

Non-linear analysis

Although some evidence has been found that mathematically the EEG signal resembles much more a

stochastic process with changing conditions than a non-linear deterministic process with a chaotic attrac-

tor [2], chaos theory offers ways to determine signal complexity.

Shaw et al. used an algorithmic complexity measure as an index of cortical function in rats [146]. Rezek

4N is the number of signal samples


and Roberts [137] compared four stochastic complexity measures for the EEG, namely AR model order,

spectral entropy, approximate entropy and fractional spectral radius, obtaining best results with the last

one when attempting to detect disturbed sleep with the central EEG channel.

Fell et al. found that non-linear measures discriminate better between sleep stages 1 and 2, while spectral

measures do so with sleep stage 2 and SWS. None of the investigated measures were able to discriminate

between REM sleep and sleep stage 1. The measures were relative δ power, spectral edge, spectral

entropy and first spectral moment (spectral measures), and correlation dimension D2, largest Lyapunov

exponent L2 and approximate Kolmogorov entropy K2 (non-linear methods).

Classification techniques

The set of features extracted from either frequency-domain or time-domain analysis or a mixture of

both are usually combined in a deterministic way to determine the R & K sleep stage. However, other

approaches involving self-learning classifiers have also been investigated. In 1981, Jansen and co-workers

used AR features and cluster analysis for sleep staging [75]. Later, Kemp et al. developed a model based

maximum likelihood classifier [86]. Kubat and collaborators presented in 1994 an artificial intelligence

approach with automatic induction of decision trees [92]. At the same time, several investigators have

used probability-based approaches such as Bayesian classifiers [72] and neural network classifiers, with

the introduction of a new approach in tracking the sleep continuum in the work of Pardey et al. [123].

Knowledge-base methods We have already pointed out that many of the attempts to perform au-

tomatic sleep scoring emulate the visual scoring process of the R & K rules. As a result, numerous

knowledge-based classification systems have been developed. The implementation varies from hybrid

analog-digital logic arrays of “ANDs” and “ORs” [22] [150], or algorithmic “IF-ELSE” rules [132], to

fuzzy-logic systems [94] [55] [68]. Most of the systems extract additional information from the experts,

but heuristic approaches like the one developed by Smith and co-workers, who tried different adjustments

to increase the agreement with visual scoring [151], can be found in the literature.

3.3 Analysis of the EEG for the detection of micro-arousals 29

Neural network methods Neural networks have been used in the detection of characteristic sleep

waves (i.e. spindles, K-complexes, etc) which can then be used to help automatic sleep staging. Shimada

et al. have trained a 3-layer neural network for the detection of these waves using a time-frequency 2D

array, consisting of 11 sets of 12 FFT coefficients in a 3.84s window [158]. Wu and co-workers developed

EEG artefact rejection by training a neural network to recognise the typical artefact patterns [73].

Neural networks have also been used to classify sleep according to the R & K scale. Baumgart-Schmitt

and collaborators [16] [15] used a mixture of experts to classify sleep using 31 power spectral features

and nine 3-layer neural networks, each one trained with data from a different healthy subject. They

obtained good discrimination of REM with respect to Wakefulness and Sleep Stage 1.

Previous work in the group in which the research described in this thesis has been carried out [123] has

used a neural network to track the dynamic development of sleep on a continuous scale from deep-

sleep (stage 4) to wakefulness on a second-by-second basis. The neural network output has the ability

to pinpoint short-time events and more cyclic events like Cheyne-Stokes respiration using only one EEG

channel and AR features.

3.3 Analysis of the EEG for the detection of micro-arousals

3.3.1 Cortical arousals

There are several types of arousals. Some sleep disturbances do not reach the brain cortex, they are

called “sub-cortical” or “autonomic” arousals and they can be detected by monitoring the heart rate and

the beat-to-beat blood pressure, looking for an increase in the pulse rate and blood pressure along with

an increase in the respiratory effort, following an apnoea/hypopnoea event. There are also those which

arise in the brain cortex, the so-called “cortical” or EEG arousals, which as their name suggests, can be

detected by monitoring the EEG. Cortical arousals can have several causes, not all related to OSA, for

instance external noise, changes in light, snoring, leg movements, bowel disturbance, bladder distension

and gastroesophageal reflux to mention a few of them. Pain and some forms of insomnia can also be


causes of arousals, but the most common cause is OSA. Ageing is another strong factor in the tendency

to arousal [155].

During a PSG all the external variables like light and noise can be reasonably controlled, and monitoring

leg movements (by transducers located on the legs, or video recording) and snoring (by a microphone

taped on the neck) may help to discard those arousals produced by causes other than OSA [112].

Bennet and colleagues [18] found that detection of autonomic activation is as good as detecting cortical

arousal for predicting daytime sleepiness in OSA patients, but it does not convey any extra information.

As a rule patients with sleep disorders or excessive daytime sleepiness have normal electrophysiological

EEG characteristics both in frequency and amplitude [61]. The OSA sleep disorder does not alter the

physiology of sleep but has a pronounced effect on the sequence of states. Recent studies [123] [48]

have claimed that the EEG provides sufficient information to identify most micro-arousals. A cortical

arousal caused by an apnoeic/hypopnoeic event usually looks like an increase in the frequency in the

EEG (see Fig. 3.5). Note that we will discuss this in more detail in section 6.2.5.

Figure 3.5: Apnoeic event

R & K criteria for sleep scoring allow the scoring of arousals longer than 10s as well as the so-called

“movement arousals”, but the set of rules was not designed for the scoring of transient events (shorter

than 10s). If a 30s epoch has more than 15s of slow waves plus a short arousal, the epoch will be scored

as stage 3 or 4, ignoring the presence of the arousal. In this way, a night sleep record may look “normal”,


whereas the fact is that the subject has experienced hundreds of micro-arousals [155].

The American Sleep Disorders Association (ASDA) attempted to overcome the R & K deficiency in the

scoring of micro-arousals when it published a set of rules for EEG arousal scoring. The rules are indepen-

dent of the R & K criteria, and are summarised in the next section.

3.3.2 ASDA rules for cortical arousals

Technically, ASDA defined an arousal as an abrupt shift in EEG frequency, which may include θ, α and/or

frequencies greater than 16Hz but not spindles, subject to the following summary of rules and conditions

[11]:

1. A minimum of 10 continuous seconds of sleep in any stage must occur prior to an EEG arousal for

it to be scored as an arousal. That is a consequence of the first and second rules of the EEG arousal

scoring set of rules. The first one establishes that the subject must be asleep. The second one is

such as to prevent the scoring of two related arousals as independent arousals.

2. The minimum duration is 3 seconds. There is both a physiological basis and a methodological

reason for this choice: reliable scoring of events shorter than this is difficult to achieve visually.

3. To score an arousal in REM sleep there must be a concurrent increase in submental EMG.

4. Artefacts (including pen blocking or saturations), K-complexes or δ waves are not scored as arousals

unless accompanied by a frequency shift in another EEG channel. If they precede the frequency

shift, they are not included in the three seconds criterion. Indeed, δ wave bursts are not necessarily

related to arousals and as a result, more evidence should be used, for example, respiratory tracing.

5. To score 3 seconds of α sleep as an arousal, it must be preceded by 10 or more seconds of α-free

sleep.

6. Transitions from one stage to another must meet the criteria indicated above to be scored as an

arousal.


The time scale for arousal scoring is much shorter (changes of 3s or more) than the 20-30s for visual

sleep scoring, and therefore arousal visual scoring is more time-consuming than visual sleep staging, and

also more inaccurate. In spite of the efforts of ASDA, the scoring of micro-arousal events is still difficult

as inter-rater variability is very high, especially when the μ-arousal occurs during REM or light sleep

[46]. Townsend and Tarassenko [165] evaluated the agreement between three scorers of EEG micro-

arousals on an 11-patient database and found very little agreement (0-10%) over a mean of 70 arousals

per patient when counting the number of arousals scored. Indeed the figure got worse if the starting time

and duration of the arousals were also considered, as for some recordings none of the experts scored the

same event as an arousal.

3.3.3 Computerised micro-arousal scoring

As can be deduced from the above rules, the detection of arousals is not easy. Arousals may occur during

any sleep stage, and are particularly difficult to detect in REM sleep when the EEG is the only signal

used in the analysis. There is also some controversy concerning the comprehensiveness of the ASDA

rules. Townsend and Tarassenko [165] as well as Drinnan and co-workers [48] have questioned the

absence of a “gold-standard” definition for arousals. While the ASDA definition is in widespread use,

EEG changes which do not meet the criteria have been associated with daytime sleepiness [29]. Other

signals have been suggested as indices of arousals, like blood pressure [129] [39], but these indicators

correlate well with EEG arousal. Hypoxemia (reduced arterial oxygen saturation) has been found to

play a role in the capacity to stay awake, rather than in the propensity to fall asleep, while indices of

sleep disruption correlate with both [17]. Guilleminault et al. [58] evaluated the role of respiratory

disturbance, oxygen saturation, body mass and nocturnal sleep time in daytime sleepiness, but did not

find significant correlation between them. They concluded that the best predictor of the excessive daytime

sleepiness frequently found in OSA patients is the nocturnal PSG and the sleep structure abnormalities

found in the brain activity recording.

Stradling and collaborators [156] found that the relationship between the severity of the OSA measured


by a sleep study and the daytime sleepiness of the subject is poor. They suggested that the importance of

a micro-arousal is related to both its duration and the depth of sleep prior to the arousal. Accordingly, it

would be desirable to extract this information automatically from a computerised arousal scoring system.

3.3.4 Using physiological signals other than the EEG

Recently, Aguirre and co-workers [3] modelled blood oxygen saturation, heart rate and respiration sig-

nals from a patient with OSA, using a nonlinear AR moving average model with exogenous inputs in

which the blood oxygen saturation is the output of the model and the other two signals the inputs. They

reconstructed successfully the respiration signal from the other two, suggesting that the dynamics un-

derlying these signals are nonlinear and deterministic. However, it seems that while these signals are

very well correlated, there is not a unique relationship with the changes in the EEG following an apnoeic

event as was found by Townsend and Tarassenko [165] who investigated pulse transit time (a measure

of beat-to-beat blood pressure) and heart rate for micro-arousal detection. They found that increases in

heart rate and decreases in pulse transit time appear to occur relatively regularly during the night many

times, independently of the occurrence of micro-arousals.

Drinnan et al. [47] investigated the relation between movement or respiration signals (wrist movement,

ankle movement, left and right tibial electromyogram and phase change in ribcage-abdominal move-

ment) and cortical arousals. Their conclusions were that arousal was accompanied by movement only

on a minority of occasions; in some subjects, the number of movement events exceeded the number of

arousals, and some arousals were accompanied by more than one movement. This may explain the poor

relationship that they found between movement signals and arousals. Ribcage-abdominal phase was the

only index which showed a significant relation with cortical arousals, but despite the high correlation, in

some obese subjects the sensitivity and the positive predictive accuracy for phase were as poor as for the

other investigated signals, due to the loose coupling between the used sensors and the diaphragmatic mo-

tion. Other subjects showed phase changes opposite to those expected. Macey and collaborators [100]

[101] found similar results when using time-domain features and neural network methods to detect


apnoea events from the abdominal breathing signal in infants with central apnoeas.

3.3.5 Using the EEG in arousal detection

Drinnan and collaborators [48] investigated 10 possible indices of arousal using the EEG derivation

Cz/Oz. Two of the indices were related to amplitude: Hjorth’s activity, CFM (cerebral function monitor)

and eight to frequency: α power, zero crossing rate, δ crossing rate (zero crossing rate of EEG’s first

derivative), Hjorth’s mobility, frequency peak, frequency mean, frequency mean of the CFM-filtered EEG

and Hjorth’s complexity. From all these they found that three of them offered good discrimination in

terms of identifying arousals: the zero crossing rate, Hjorth’s mobility and frequency mean.

Huupponen et al. [71] used a single channel of EEG and a neural network to detect arousals. Second-

by-second changes in the power of the EEG bands relative to the average power in a 30s segment were

chosen as input features for the neural network. Amongst the detected arousals, there were only 41%

true positives and a very high number of false positives. A recent study, performed by Di Carli and co-

workers [40] used two EEG channels and one EMG channel for the automatic detection of arousals in

a set of 11 patients with various pathologies, including OSA. They used wavelets to analyse the EEG

in the time-frequency domain, then measured the relative powers in the EEG bands and calculated the

ratios between these powers, computed for both short-term and long-term averages. These indices were

used along with the other measures from the EMG as the inputs to a linear discriminant function, whose

free parameters were set to maximise the sensitivity and selectivity for detecting arousals previously

scored by two experts. They made a distinction between “definite” arousals and “possible” arousals in

the visual scoring, and also post-analysed their results correlating the starting time and duration of the

micro-arousals detected by the computer with the ones detected by the experts. The automatic detection

yielded an average 57% agreement with the experts, while the agreement between experts reached 69%.

The percentage of man-machine agreement increased to approximately 75% when only definite arousals

were considered. Both Huupponen’s and Di Carli’s detectors took into account the context, mimicking

the visual scoring according to the ASDA rules.

3.4 Analysis of the EEG for vigilance monitoring 35

3.4 Analysis of the EEG for vigilance monitoring

The study of the wake EEG is much more difficult than that of the sleep EEG as the signals are more

prone to artefacts and show subtle changes as the level of alertness of the individual varies. It is also

very difficult to validate the alertness measures derived from the EEG with others, like task performance,

because training and motivation play an important role in the ability of the subject to perform well while

drowsy.

3.4.1 Changes in the EEG from alertness to drowsiness

The fully awake, responsive state is associated in the EEG with the absence of any rhythmic activity.

The EEG has low amplitude and a random pattern. Also, multiple EMG artefacts are present. The

physiological explanation for this is that the responses to alerting stimuli are mediated by the ascending

reticular activating system of the brain stem [65] which also desynchronises the cortical activity [134].

As the individual relaxes, rhythmical activity appears, most commonly as α wave activity, the amplitude

of the EEG increases and the muscle activity diminishes. The α rhythm is almost always found in the

EEG of healthy, awake, unanaesthetised subjects. Its amplitude, however, is usually very low and it is

only picked up by recorders when it becomes strong as the person becomes drowsy or closes their eyes.

The relationship between the occurrence of an α wave and the brain status is intricate; most often, the α

rhythm appears in individuals relaxed and prone to sleepiness, i.e. drowsy. With further advance towards

drowsiness there is an α activity drop, the α sequences becoming less and less continuous, eventually

giving way to θ activity at the onset of sleep. θ activity is most commonly found in the 6-7 Hz band and

is stronger at the onset of drowsiness [85].

Spatially, the most important changes are in the amplitude of the α activity which occur predominantly

at the occipital sites, while an increase in the slow, mainly θ, activity is more diffuse [142]. EEG changes

do not appear until the subjective symptoms of sleepiness become manifest [5].

Slow eye movements (SEM), is probably the most sensitive variable to allow differentiation between


sleepiness and alertness [88] [118][142][170]. However, in practice it is very difficult to score SEM

since blinks and rapid eye movements interfere [5]. An increase in motor activity (EMG) is shown in

subjects struggling against imminent drops in alertness [32].

The appearance of α rhythm does not necessarily indicate complete eye closure or blurring vision, some-

times it may be associated with the perception of “being sleepy with open eyes” [88]. α rhythm is

particularly problematic for reasons not totally clear yet. Conradt and co-workers found differences in

“fast” α, “low” α activities and reaction times, but the differences are very difficult to detect [33]. The

changes in α activity on a small time scale are somewhat different depending on whether the eyes were

initially open or closed [32].

Individual EEG differences

A particular problem for vigilance studies is the difference between individuals. Almost all the vigilance

studies using EEG report problems with a proportion of the subjects exhibiting abnormal EEG. Some

individuals are unable to maintain α activity for more than 30s with closed eyes while others show much

α activity with eyes open even when at maximum alertness. Moreover these “α-plus” subjects do not

experience the normal increase in α activity when losing alertness, instead their α waves decrease with

sleepiness. Sometimes their α activity amplitude spreads into the θ band. These observations suggest the

need for individual calibration of sleepiness effects on the EEG [5]. This will be discussed in more detail

in section 10.5.

3.4.2 EEG analysis in vigilance studies

The central referential electrode montage C3-A2 is widely used to record the EEG in vigilance studies

[35] [61] [166] [106] as it is recommended by the standard manual for sleep stage scoring [136]. The

manual also recommends an epoch length of 15-30s, and this has also been adopted for alertness scoring

[38] [8] [157]. However, episodes of stage 1 or “micro-sleep” periods as brief as 1 − 10s have been

identified [142] [130] [126].


Alford et al. developed a sleepiness scale based entirely on PSG measures using 15s-epochs. The scale

has 6 waking categories and one sleep category (see table 3.2) [8]. This scale will be considered in more

detail in chapter 7 of this thesis. Given that vigilance stages may change within seconds, the EEG in

a 30s window is not stationary in terms of vigilance [127], and a correct statement can no longer be

made with respect to any information averaged over 30s [93]. Penzel and Petzold [127] scored the EEG

in variable length segments according to the patterning or rhythmicity and Varri et al. used an adaptive

segmentation algorithm for the EEG prior to visual scoring, resulting in segments of 0.5s to 2s [170].

With their technique, a 90 min vigilance test may consume an entire day of work for a technician (2-3

hours from preparation to the removal of the electrodes, plus 5 hours scoring the PSG) [126]. As wake

EEG is more complex than sleep EEG, the inter-rater agreement for vigilance EEG scoring is usually lower

(≈72% [61]) than in the sleep case (86% according to [126]).

Vigilance sub-category Description

Active Wakefulness (Active) active/alert patternmore than 2 eye movements per epoch

increased/definite body movementQuiet Wakefulness Plus (QWP) active/alert pattern

more than 2 eye movements per epochaverage/possible/no body movements

Quiet Wakefulness (QW) alert patternless than 2 eye movements per epoch

average/reduced/definitely no body movementsWakefulness with (WIα) definite burst of α rhythm

Intermittent α for less than half of an epochWakefulness with (WCα) definite burst of α rhythm

Continuous α for more than half of an epochWakefulness with (WIθ) definite burst of θ rhythm

Intermittent θ for less than half of an epoch(plus α rhythm, if present)

Wakefulness with (WCθ) definite burst of θ rhythmContinuous θ for more than half of an epoch (stage 1 of sleep)

Table 3.2: The vigilance sub-categories and their definition

Spectral methods

Changes associated with sleepiness in the EEG are mainly in the patterns and rhythms of the signal.

Therefore it seems that the signal is better analysed in the frequency domain, either by power spectrum


estimation or by band-pass filtering, using the standard EEG frequency bands to define the filter bound-

aries. The rhythms most affected by drowsiness are θ, δ and α in that order. However, they do not change

in the same way, nor are all of the changes linear with respect to the decrease in performance. Late in the

60’s Daniel found that θ waves dropped significantly prior to failures in a detection task, and the occur-

rence of α waves was not necessarily correlated with errors [38]. Later on Lorenzo et al., using central

electrodes, found a linear increase in θ power as a result of sleep deprivation which was also linked to

deterioration in performance [97]. Da Rosa et al. modelled the awake and sleep EEG with sufficient accu-

racy using the linearisation and simplification of a nonlinear distributed parameter physiological model

[36]. Studies on a minute scale showed that α power declines with drowsiness, while θ power increases

linearly with the loss in performance.

Flight simulations and in-cockpit studies have found correlation between EEG power-spectrum and pilot

performance, except for the α band [153]. Makeig and Jung [104] found that the second eigenvector of

the normalised EEG log spectrum is highly correlated with variations in drowsiness and sleep onset.

3.4.3 Vigilance monitoring algorithms

Attempts to implement an alertness monitor follow two major trends in pattern classification, the rule-

based type and the neural network approach. The signals most commonly used in these algorithms are

the EEG, the EOG and the EMG, but one of the prototypes for driver performance monitoring uses a non-

physiological signal, a measurement of the vehicle’s steering (see sub-section Neural Network methods

below). Some of the prototypes have been used only on simulators, while others have also been tested

in real conditions. Ambiguous data and inter-subject variability seem to be a common problem in all of

them.

As in sleep, the existing systems for automatic vigilance scoring are not yet suitable for clinical work,

requiring supervision from a skilled technician. Results in patients with EEG alterations are not reliable

unless their abnormality has been taken into account when developing the system.


Rule-based algorithms

In 1989 Penzel and Petzold [127] developed a sub-vigil state rule-based classifier based on frequency

domain features extracted from 2s segments of EEG. They achieved 84.4% of agreement with consensus

labelled data and noted that the inter-rater variability defines the limit of what can be achieved for man-

machine agreement. The inter-rater variability was 76% and the intra-rater variability was 81.6% on

their data set. The algorithm was used on OSA data and yielded “good results” in detecting arousals.

Varri et al.’s [170] rule-based computerised system for alertness scoring used more inputs: two EEG, two

EOG and one EMG channels. The system applied adaptive signal segmentation based on mean amplitude

and mean frequency measures, and a bank of filters provided the means of calculating the power within

each EEG band. A similar sub-system detected eye movements and EMG power. The effect of inter-subject

variability was reduced by recording 3 minutes of EEG with the eyes opened in an alert condition and 3

minutes with the eyes closed in a quiet condition to provide reference values for the power in each EEG

band. They found that eye movement can play a very important role in alertness monitoring. The system

gave a 61.6% man-machine agreement. Hasan et al. [61] used the system with new data, having to

perform “prior minor adjustments” to compensate for the differences with the training data. They divided

the group into low/high α activity. They also found a value of 61.8% for the man-machine agreement, for

an inter-rater agreement of 71.9%, and noted that visual scorers had difficulties in correctly identifying

all the bursts of brain waves, especially θ.

Neural Network methods

As in many other classification problems, neural network methods have been applied to the problem of

alertness/drowsiness estimation. In 1992 Venturini et al. [171] attempted to perform real-time estima-

tion of alertness on a minute scale using one EEG channel and a neural network. Power from 5 significant

frequencies were used as input features. The neural network had difficulties in achieving good generali-

sation due to the small size of the available data set, therefore the jack-knife method was used for training

(see chapter 8 for a description of this methodology). Results were “good”, reported as being better than


a linear discriminator on subjects who missed more than 40% of the target sounds on an auditory vigi-

lance task. They also tried to develop a similar system based on event related potentials (ERP) getting

an accuracy of 96% on data averaged over 28 minutes and of 90% on data averaged over 2 minutes.

However, ERP has two great disadvantages: firstly, it requires the introduction of a distracting sound,

and secondly it cannot be performed on a second-by-second basis because an ERP requires averaging

of a series of repetitive stimuli over at least a 2-min long window to be extracted from the background

EEG. Jung and Makeig [80] refined the system by using 2 EEG channels and a neural network using the

power spectrum as features and PCA to reduce the dimensionality of the feature space. They obtained a

reasonable match with respect to the predictions made by an a priori model and using linear regression.

More recently, Roberts et al. [139] attempted to predict the level of vigilance using multivariate AR

modelling of 2 symmetric channels of EEG (T3 and T4) and the blink rate from 2 channels of EOG

as input features to a committee of neural networks known as Radial Basis Function (RBF) networks

using thin-plate splines as basis functions. The made a comparative study training the neural networks

for regression and for classification, the latter using only extreme-value labels. They trained the neural

networks in a Bayesian framework that allows integration over the unknown parameters (see [102] and

[103] for more detail) and which provides error bars for the results of the neural network analysis. They

obtained “reasonable”correlation with the smoothed human-expert assessment.

Trutschel et al. [166] combined the neural network approach with fuzzy logic when developing a neuro-

fuzzy hybrid system to detect micro-sleep events. The device consisted of 4 neural networks, one for each

of four EEG channels, and a fuzzy-logic combiner. They used the system to monitor alertness in a driving

simulation study, obtaining “high” correlation between the number of micro-sleeps detected per hour and

the accident statistics per hour during the night.

Physiological signals are not the only sources of information which can provide measures of alertness.

Performance measures give an indirect way of monitoring alertness. A vehicle based signal, the steering

measure, has been used to track driver performance and alertness [157]. Power spectrum, mean and


variance were chosen as input features in a neural network. Θ-plus individuals5 were rejected from the

study. The system only worked with 75% of the drivers. Poor results may, however, have been due to

contradictory data. For instance, the experts who labelled the data using EEG, EOG and EMG channels,

scored one subject as being asleep for nearly two hours of driving. Results indicate that steering measure

and alertness are not 100% correlated.

Shortcomings

As mentioned above, EEG and SEM are the most significant physiological signals in alertness assessment.

However, SEM is very difficult to measure, and the EEG present two disadvantages, firstly the inherent

complexity of the wake EEG, affected by many factors, like task characteristics, motivation and mood,

and secondly the inter-subject EEG variability.

The EEG has a wide-spread distribution among the population, and even within groups with the same

gender and age range. Matsuura et al. [106] found a large inter-individual variability especially with

respect to age. The percentage of α time and α continuity were greater in males than in females after

adolescence, the percentage of θ time was greater in females than in males during childhood, and the

percentage of β time was higher in females than in males at all ages.

As we said in section 3.4.1, in about 10% of the population, visual inspection of the EEG shows α rhythm

during wakefulness, while for the other 90% the EEG only shows α rhythms when the subjects are in

eyes-shut wakefulness or in the first sleep stages. Another 10% of the population shows very low or no α

activity with eyes closed. The first group is known as α-plus (α+) or P-type while the other is the M-type,

P being used for persistent and M for minimal [87]. One of the vigilance studies found one α-plus subject

whose α activity decreased when becoming drowsy instead of the normal increase experienced by the

rest of the subjects [5].

A study of short-term EEG variability using the FFT suggests that interpretation of relative measures of

δ, θ and β in individual spectra may be dependent on absolute α power [120]. Varri et al. [170] divide

5Their EEG displays θ waves while they are awake


the data into low or high α to adapt their algorithm to the “normal” differences in α activity. As already

mentioned, Hasan et al. [61] had to perform “prior minor adjustments” to compensate for the differences

with the training data. They also found that subjects with poorly defined occipital α activity constitute a

special problem in the detection of drowsiness [61].

A third problem in alertness/drowsiness scoring using the EEG comes from the standard procedures

followed to score the sleep EEG. The standard set of rules for sleep scoring [136] recommends a length

of 15-30s for the EEG epochs. However, Kubicki et al. opine that it is often difficult to make a distinction

between an “α-sleep type” and pre-arousals (micro-arousals)on this time scale [93].

Portable devices

A few commercial alertness monitoring devices based on one or several of measures such as eye-tracking,

pupillometry, eyelid closures, head motion detectors, electrophysiological and skin measures and perfor-

mance deterioration, are currently available [31][116]. A specialized company [31] advertises a micro-

sleep/fatigue detection algorithm that uses advanced neural network and fuzzy logic hybrid systems for

detecting and predicting the occurrence of micro-sleeps, a description that coincides with the system de-

veloped by Trutschel et al. [166]. The same company offers integrated systems with alertness monitoring

and alertness stimulation/ micro-sleep suppression technologies, i.e. vibration, aroma, lighting, sound

and interactive performance systems combined with automatic micro-sleep/fatigue detection.

Chapter 4

Parametric modelling and linearprediction

This chapter reviews the theories of auto-regressive (AR) modelling and linear prediction, after an intro-

ductory section on spectrum estimation. A more detailed review of AR modelling can be found in [63].

Noise classification can be found in [62], and filter structures in [122].

4.1 Spectrum estimation

4.1.1 Deterministic continuous in time signals

Let x(t) be a deterministic continuous signal with finite energy. Its Fourier transform Xc(f) is given by

Eq. 4.1:

Xc(f) =∫ ∞

−∞x(t) e−j2πft dt (4.1)

where the subindex c is used to distinguish it from its counterpart in the discrete-time domain.

Given the Fourier transform Xc(f), the signal x(t) can be recovered using the inverse Fourier transform:

x(t) =∫ ∞

−∞Xc(f) ej2πft df (4.2)

4.1.2 Stochastic signals

Many physical phenomena occur in such a complicated way that even if they are governed by determin-

istic laws, the almost infinite amount of interactions and the noise present in the sensors makes the use

4.1 Spectrum estimation 44

of a probabilistic model more sensible. Stochastic signals1 carry an infinite amount of energy, and the

Fourier transform integral as defined in Eq 4.1 normally does not exist. They are not periodic, so the

Fourier series expansion does not apply either. Instead of the energy content, we may be interested in

the power (time average of energy) distribution with frequency. If the generating process is stationary1,

second order averages like the autocorrelation and the autocovariance offer an alternative to performing

the time-frequency transform. Normally, the autocovariance tends to zero as the lag increases, but if the

process is zero-mean, the autocorrelation equals the autocovariance and therefore shows the same trend.

This is a sufficient condition for the existence of the Fourier transform of the autocorrelation, given by

Eq 4.3:

R(f) =∫ ∞

−∞r(τ)e−j2πfτ dτ

R(ω) =∫ ∞

−∞r(τ)e−jωτ dτ

(4.3)

The autocorrelation at lag zero, which is equal to the average power of the signal, is related to the Fourier

transform R(f) by the Wiener-Khinchin theorem:

r(0) = E[x(t)2] =∫ ∞

−∞R(f)df (4.4)

Therefore, the function R(f) represents the distribution of the power in the frequency domain, as a result

of which it has been named power spectral density (PSD) or power spectrum of the signal, often denoted

as S(f):

S(f) = R(f)

S(ω) = R(ω)(4.5)

The PSD has several properties which are reviewed in [62, pp.254-56].

Estimating the power spectrum

Autocorrelation function estimators The autocorrelation function is an average over the ensemble

x(t, ξ)1. Usually only a single realisation x(t) (i.e. fixed ξ) of a given process x(t, ξ) is available leaving

1for a definition and a review of stochastic processes see Appendix A


us unable to estimate r(τ) unless ergodicity is assumed. If the process is ergodic, the autocorrelation

function of the process equals the time average over a single realisation given by the left-hand side of

Eq. 4.6:

r(τ) = E[x(t)x(t + τ)] = limT→∞

12T

∫ T

−T

x(t)x(t + τ) dt (4.6)

However, in most of the cases, the signal x(t) is only available during a limited interval of time. Then

the autocorrelation function can only be estimated. Denoting x′(t) as the signal x(t) truncated by a

rectangular window of length 2T , we can estimate r(τ) as:

r(τ) =1

2T − |τ |

∫ T−|τ |/2

−T+|τ |/2

x(t +|τ |2

)x(t − |τ |2

) dt (4.7)

Eq. 4.7 is valid for |τ | < 2T , for |τ | ≥ 2T the estimate r(τ) is set to zero. This is an unbiased estimator

(i.e. its mean value is the real value of r(τ)), but its variance increases as |τ | increases, because of the

factor 2T − |τ | in the denominator. Instead, the estimator r′(τ):

r′(τ) =2T − |τ |

2Tr(τ) (4.8)

has smaller variance, and although it is a biased estimator, it is more commonly used because its Fourier

transform is related to the energy density spectrum of the truncated signal x′(t). Indeed, r′(τ) is equal

to:

r′(τ) =1

2Tx′(τ) ∗ x′(−τ) (4.9)

where the symbol ∗ represents convolution in τ .

The periodogram: Fourier estimate of the PSD Invoking the Fourier transform property of convo-

lution in time, and noting that the transform of x′(−τ) is X ′(−f) then the Fourier transform of r′(τ)

is:

R′(f) = 12T X ′(f)X ′(−f)

= 12T |X ′(f)|2

(4.10)


Then the PSD estimate using the estimator r′(τ) for the autocorrelation is:

S′(f) = 12T |X ′(f)|2

= 12T |∫ T

−T

x(t)e−j2πftdt|2

=∫ 2T

−2T

r′(τ)e−j2πfτ dτ

(4.11)

The function S′(f) is called the periodogram. It is an asymptotically unbiased estimator but its variance

increases with T . This surprising result is due to the integral in τ of the estimator r′(τ) with increasing

variance as |τ | approaches 2T . In the limit T → ∞ the periodogram tends to be a white-noise process

with mean S(f). Smoothing windows have been widely used as palliatives to overcome this behaviour,

either applied to the autocorrelation estimate r′(τ) to deemphasize the unreliable values at the borders,

or convolved with the periodogram to reduce the variance directly.

Discrete-in-time stationary stochastic processes If the signal is sampled in time, the equations above

change accordingly. The autocorrelation function r(m) is now a function of an integer lag m. Its Fourier

transform R(ω) is periodic, as a result of the sampling in time, and the total power can be found simply

by integrating over a period of R(ω):

P = r(0) =12π

∫ π

−π

R(ω)dω (4.12)

where R(ω) is:

R(ejω) =∞∑

m=−∞r(m)e−jωm (4.13)

If only N samples of the time series x(n) have been taken, the discrete version of the autocorrelation

estimator in Eq. 4.8 can be calculated as:

r′(m) =1N

N−|m|−1∑n=0

x(n)x(n + |m|) (4.14)

This estimator presents the same characteristics as its continuous-time version. The expected value of

r′(m) is m/N times r(m), but is asymptotically unbiased, as the bias tends to zero as N increases. Also,


its variance increases as m approaches N . A full expression for this variance is very difficult to find

for non-Gaussian processes [122]. However, Jenkins and Watt [77] conjecture that, in many cases, the

mean-square error of r′(m) is less than for the unbiased estimator.

Discrete-in-time periodogram Based on the Jenkins and Watt conjecture, Eq. 4.14 is used in Eq 4.13:

R′(ejω) =1N

N−1∑m=−N+1

N−|m|−1∑n=0

x(n)x(n + |m|)e−jωm (4.15)

After some mathematical manipulation [122, pp.542-3]:

R′(ejω) = 1N

N−1∑k=0

N−1∑n=0

x(n)x(k)e−jω(k−n)

= 1N

N−1∑k=0

x(k)e−jωkN−1∑n=0

x(n)ejωn

= 1N X(ejω)X(e−jω)

= 1N |X(ejω)|2

(4.16)

where X(ejω) is the Fourier transform of the finite length time series x(n). Note that R′(ω) is the PSD

estimator S′(ω) known as the periodogram.

S′(ejω) =1N

|X(ejω)|2 (4.17)

It can be proved ([122] pp.542-3) that using either the unbiased estimator of the autocorrelation or the

biased estimator proposed by Jenkins and Watt, the periodogram for a discrete in time stationary process

is a biased estimator of the PSD. As in the continuous in time case, its variance does not tend to zero

as N increases. Again, this result can be improved by smoothing techniques, one of which divides the

time series into smaller, overlapping segments to perform an average over the periodograms, but this

compromises the resolution in frequency.

Parametric modelling methods A model is any attempt to describe the laws which yield a given phe-

nomenon. Once a model is selected and its parameters estimated from the data, it can be used to generate

as many realisations as are needed to calculate the averages over the ensemble, or even better, it can be


used to calculate directly the PSD without having to use the Fourier transform. The choices for a model

are infinite, but using a priori knowledge over the data, the range can be reduced considerably. As-

sumptions like zero values outside the observation window can be avoided. However, some assumptions

always have to be made for the characterisation of the model.

Yule [174] proposed in 1927 the use of a deterministic linear filter to represent a stochastic process. The

filter is driven by a sequence of statistically independent random variables with a zero-mean, constant-

variance Gaussian distribution2. This purely random series is known as white Gaussian noise because

its autocorrelation function is zero for all lag except for the origin, where it equals the variance of the

Gaussian process. The corresponding power spectral density is therefore a constant for all frequencies,

like the optical spectrum of white light. The filter performs a linear transformation on this uncorrelated

sequence to generate a highly correlated series x that statistically matches the data x from the process

under analysis, as is shown in Fig. 4.1. The modelling procedure consists of the calculation of the filter

parameters.

White Gaussian noise v(n) discrete-timelinear filter

y(n) = x(n)^

Figure 4.1: Stochastic process model

The input-output relation of the filter has this general form:

(present value

of model output

)+

⎛⎝ linear combination

of past valuesof model output

⎞⎠ =

(present value

of model input

)+

⎛⎝ linear combination

of past valuesof model input

⎞⎠

This can be written as a linear difference equation that relates the input driving sequence v(n) with the

output y(n) as:

y(n) =q∑

m=0

bmv(n − m) −p∑

k=1

aky(n − k) (4.18)

2Being the most common distribution found in physical phenomena, and given that the output of a linear filter driven by aGaussian random process is another Gaussian process, it is the most convenient distribution at the filter input for a vast range ofapplications.


Giving that the proposed filter is time-invariant and linear, linear filters theory applies. Therefore, taking

the z-transform3 of both sides of Eq. 4.18:

Y (z) =q∑

m=0

bmV (z)z−m −p∑

k=1

akY (z)z−k (4.19)

Rearranging Eq. 4.19 to leave only Y (z) on the left-hand side:

Y (z) =∑q

m=0 bmz−m

1 +∑p

k=1 akz−kV (z) (4.20)

The z-transform of the unit-sample response of the filter h(n) can be found by making v(n) = δ(n), which

has a z-transform V (z) equal to 1:

H(z) = Y (z) |V (z)=1=∑q

m=0 bmz−m

1 +∑p

k=1 akz−k(4.21)

By using the substitution z = ejω , then the Fourier transform of the filter unit-response can be found from

Eq. 4.21:

H(ejω) =

q∑m=0

bme−jωm

1 +p∑

k=1

ake−jωk

(4.22)

Let us now consider the input sequence as white Gaussian noise with variance σ2v . Its autocorrelation

function is equal to σ2vδ(n) and its Fourier transform is equal to σ2

v for all frequencies. Using the relation

between the input and output autocorrelation functions given in Eq. A.13:

ry(m) =∞∑

i=−∞

∞∑k=−∞

h(i)h(k)σ2vδ(k − i + m) (4.23)

and taking the Fourier transform of both sides:

Sy(ejω) = Sx(ejω) = σ2v |H(ejω)|2 (4.24)

it can be seen that the PSD of the output of the filter can be obtained from the filter parameters {bi, ai}

and the input noise variance.

3Defined as Z[g(n)] = G(z) =∑∞

n=−∞ g(n)z−n. See [122] for properties.

4.2 Autoregressive Models 50

Whether the linear combination of the past output values or the linear combination of the past input

values or both are used in the input-output relation defines the following types of filter:

1. Autoregressive model (AR): No linear combination of past values of the inputs is used.

2. Moving average (MA) model: No linear combination of past values of the outputs is used.

3. Autoregressive-Moving average (ARMA) model: Include all the terms shown in Eq. 4.18.

The use of one or other kind of model depends on the nature of the process. A description of the models

follows in the next section.

4.2 Autoregressive Models

Let the time series y(n), y(n−1), . . . , y(n−p) represent a realization of an autoregressive process of order

p. Then it satisfies the following difference equation:

y(n) + a1y(n − 1) + a2y(n − 2) + . . . + apy(n − p) = v(n) (4.25)

where the constants a1, a2, . . . , ap are the parameters of the model (AR coefficients), and [v(n)] is a white

Gaussian noise sample.

The term “autoregressive” comes from the similarity between the AR model equation and the regression

model equation.

y =p∑

k=1

wkuk + v (4.26)

The regression equation relates a dependent variable y to a set of independent variables u1, u2, . . . , up

plus an error term v. It is said that y is regressed on u1, u2, . . . , up. In a similar way the actual sample

of the AR process, y(n) is regressed on previous values of itself (auto) as is shown if we rewrite the AR

equation as:

y(n) =p∑

k=1

wky(n − k) + v(n) (4.27)


where wk = −ak.

Transforming Eq. 4.25 to the z domain we get:

Y (z)[1 + a1z−1 + a2z

−2 + . . . + apz−p] = V (z) (4.28)

Therefore the transfer function of an AR filter is:

H(z) =Y (z)V (z)

=1

1 + a1z−1 + a2z−2 + . . . + apz−p

(4.29)

The use of previous samples of the output of the filter is depicted with feedback paths as shown in

Fig. 4.2. The AR filter is an Infinite Impulse Response (IIR) filter, or all-pole filter of order p. It can be

stable or unstable, depending on the location of its poles. If one or more poles lie outside the unit circle

the filter will be unstable. The p poles may be calculated from the characteristic equation of the filter:

1 + a1z−1 + a2z

−2 + . . . + apz−p = 0 (4.30)

z -1

Σ

Σ

z -1

Σ

z -1

a

-+

v(n)white noise AR process

a

a1

y(n)

y(n-1)

y(n-p+1)

y(n-p)p

p-1

Figure 4.2: Autoregressive filter


Moving Average Models

Moving average filters are described by:

y(n) = v(n) + b1v(n − 1) + b2v(n − 2) + . . . + bqv(n − q) (4.31)

where the constants b1, b2, . . . , bq are the MA parameters of the model and [v(n)] is a white Gaussian

noise process.

This type of filter is an all-zero filter, inherently stable and with finite impulse response (FIR). For this

kind of discrete filter the order of the filter equals q, as it is the minimum number of delay units used to

implement it (see Fig. 4.3). The term “moving average” refers to the weighted average of the input time

series v(n).

z -1z -1 z -1

ΣΣ Σ

1b 2b

white noise v(n)

...

MA process y(n)

bq

Figure 4.3: Moving Average filter

Moving Average Autoregressive Models

This combines the features of the AR and MA filters. The difference equation which describes them is:

y(n) + a1y(n − 1) + a2y(n − 2) + . . . + apy(n − p) =

v(n) + b1v(n − 1) + b2v(n − 2) + . . . + bqv(n − q) (4.32)

where the constants a1, a2, . . . , ap, b1, b2, . . . , bq are the ARMA parameters of the model and [v(n)] is a

white Gaussian noise process. For this kind of IIR filter with direct transmission from the input the order

is said to be the pair (p, q). AR and MA models are special cases of an ARMA model.


Σ Σ

z -1

Σ Σ

z -1

Σ Σ

z -1

Σ

z -1

..

. ...

a1

a2 b2

b1

b0-

+v(n)white noise

y(n-p+1)

y(n-2)

y(n-p)

ARMA process y(n)

bp

ap

ap-1

y(n-1)

Figure 4.4: Moving Average Autoregressive filter (b0 = 1, q = p − 1)

Wold decomposition

Wold’s decomposition theorem states that any stationary discrete-time stochastic process [u(n)] may be

decomposed into the combination of a general linear process and a predictable process. The last two

processes are uncorrelated. According to this, the process [u(n)] may be expressed as:

u(n) = y(n) + s(n) (4.33)

The term s(n) is the predictable process, i.e. the sample s(n) can be predicted from its own past values

with zero predictive variance. The term y(n) is the general linear process which may be represented by

the MA model:

y(n) = v(n) +∞∑

k=1

bkv(n − k) (4.34)

where∑∞

k=1 | bk |2< ∞.

The white noise term v(n) which drives the general linear process y(n) is uncorrelated with the pre-

dictable process s(n), i.e. E[v(n)s(k)] = 0 for all pair (n, k). The general linear process may be an AR

process as well; all we have to do is to be sure that the impulse response of the AR filter equals the

4.3 AR parameter estimation 54

impulse response of the MA filter. That is:

h(n) =∞∑

k=0

bkδ(n − k) (4.35)

where b0 = 1.

AR models have gained more popularity than the MA and the ARMA models. The reason lies in the

computation of the filter parameters, which leads to a system of equations that is linear for AR filters and

nonlinear for MA and ARMA filters [81][105].

4.3 AR parameter estimation

4.3.1 Asymptotic stationarity of an AR process

The classical solution to the AR difference equation (Eq. 4.25) separates the homogeneous solution from

the particular solution. The particular solution is the AR model difference equation shown in Eq. 4.27.

But the homogeneous solution yh(n) is of the form:

yh(n) = B1zn1 + B2z

n2 + . . . + Bpz

np (4.36)

where z1, z2, . . . , zp are roots of the characteristic equation (Eq. 4.30) of the filter. The constants B1,

B2, . . . , Bp may be determined by the set of p initial conditions y(0), y(−1), . . . , y(−p + 1). For arbitrary

values of the constants Bk, it is clear from equation 4.36 that the homogeneous solution will decay to

zero as n approaches infinity if and only if:

| zk |< 1, for all k (4.37)

In other words, this means that all the poles of the AR filter lie inside the unit circle in the z-plane. A

system which is able to “forget” its initial values in this way is said to be asymptotic stationary.

The autocorrelation function of such a system satisfies the homogeneous difference equation of the model.

This may be found if we rewrite equation 4.27:

p∑k=0

aky(n − k) = v(n) (4.38)


where a0 = 1. Multiplying both sides by y(n − m) and taking the expectation we get:

E[p∑

k=0

aky(n − k)y(n − m)] = E[v(n)y(n − m)] (4.39)

This may be simplified if we note that the expectation E[y(n − k)y(n − m)] equals the autocorrelation

function for a lag of (m − k), and the expectation of v(n)y(n − m) is zero for m > 0, since the sample

y(n − m) is only related to input samples up to time (n − m).

p∑k=0

akr(m − k) = 0, for m > 0 (4.40)

Expanding the last equation gives the desired result:

r(m) = w1r(m − 1) + w2r(m − 2) + . . . + wpr(m − p), for m > 0 (4.41)

where wk = −ak. We may express the general solution of this equation as:

r(m) =p∑

k=1

Ckzmk (4.42)

where Ck are constants and zk are the roots of the characteristic equation (Eq. 4.30). As a result of this

we can say that the autocorrelation function of an asymptotic stationary AR process approaches zero as

the lag tends to infinity. This autocorrelation function will be damped exponentially if the dominant root

is real, changed in sign alternatively if is negative, or will be a damped sine wave if the dominant roots

are a complex conjugate pair.

4.3.2 Yule-Walker equations

Writing equation 4.41 for m = 1, 2, . . . , p yields a set of p simultaneous equations for the unknowns

a1, a2, . . . , ap assuming that the autocorrelation function r(m) is known at least for the lags from 1 to p:⎡⎢⎢⎢⎣

r(0) r(1) . . . r(p − 1)r(1) r(0) . . . r(p − 2)

...... . . .

...r(p − 1) r(p − 2) . . . r(0)

⎤⎥⎥⎥⎦⎡⎢⎢⎢⎣

w1

w2

...wp

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

r(1)r(2)

...r(p)

⎤⎥⎥⎥⎦ (4.43)

where wk = −ak. This set of equations is known as the Yule-Walker equations. In matrix form:

Rw = r (4.44)


where R is the p × p autocorrelation matrix, w = [w1, w2, . . . , wp]T and r = [r(1), r(2), . . . , r(p)]T . Its

solution is:

w = R−1r (4.45)

It can be seen from Eq. 4.45 that the set of AR coefficients may be uniquely determined from the first p + 1

samples of the autocorrelation function of the process x(n) under modelling. If we evaluate Eq. 4.39 for

m = 0 and y(n) equal to the data time series x(n), we get:

E[p∑

k=0

akx(n − k)x(n)] = E[v(n)x(n)] (4.46)

The right-hand side of Eq. 4.46 is:

E[v(n)x(n)] = E [v(n) (∑p

m=1 wmx(n − m) + v(n)) ]

=∑p

m=1 wk E[v(n)x(n − m)] + E[v(n)v(n)]

= E[v(n)v(n)]

(4.47)

The right-hand side of the equation is the variance of the input noise σ2v . This variance may be determined

from the set of AR coefficients and the first p + 1 samples of the autocorrelation function.

σ2v =

p∑k=0

akr(k) (4.48)

Eq. 4.44 can be solved by Gaussian elimination. However, the Toeplitz characteristic of matrix R is used

efficiently to find the parameters ak. In section 4.6.1 a recursive algorithm to solve the Yule-Walker

equation will be presented.

4.3.3 Using an AR model

An AR model may be used for synthesis or for analysis. In synthesis, a stationary stochastic process y(n)

characterised its variance σ2y and the parameters of its AR model, i.e. the AR filter coefficients, are given,

and we want to generate a time series of the process. In analysis, we want to model a stochastic process

given a time series x(n), by estimating the set of AR parameters for a model order p and the input noise


variance, assuming that p is the optimum model order 4. Next, we will present a second-order example

of a synthesis problem and an analysis problem.

Second order AR process synthesis

Assume that we want to synthesise a real valued, second order stationary AR process y(n) with unit

variance. The difference equation of the AR model is:

y(n) + a1y(n − 1) + a2y(n − 2) = v(n) (4.49)

As a condition for asymptotic stationarity, we need to ensure that the roots of the characteristic equation

of the model lie inside the unit circle in the z-plane:

1 + a1z−1 + a2z

−2 = 0 (4.50)

⇒ z1,2 =−a1 ±

√a21 − 4a2

2(4.51)

where z1 and z2 are the roots of equation 4.50. To satisfy the asymptotic stationarity condition:

| z1 |< 1, and | z2 |< 1 (4.52)

requires the following restrictions for the AR parameters:

−1 ≤ a2 + a1

−1 ≤ a2 − a1

−1 ≤ a2 ≤ 1(4.53)

which is satisfied by a triangular region in the a2, a1 plane, with corners at (−2, 1), (0,−1) and (2, 1). Let

us choose arbitrarily the following values for a1 and a2 from this region:

a1 = −0.1a2 = −0.8 (4.54)

We get roots at:

z1 = 0.9458z2 = −0.8458 (4.55)

4We will not cover in this section the problem of finding the optimum model order


where the positive root z1 dominates the autocorrelation function. In order to calculate the input noise

variance we need to find the first 3 samples of the autocorrelation function r(m):

σ2v = r(0) + a1r(1) + a2r(2) (4.56)

From the Yule-Walker equation:

[r(0) r(1)r(1) r(0)

] [w1

w2

]=[

r(1)r(2)

](4.57)

where w1 = −a1 and w2 = −a2. We know that r(0) = σ2y = 1, and hence we can find the other 2 samples

of r(m), substituting in 4.57:

[1 r(1)

r(1) 1

] [0.10.8

]=[

r(1)r(2)

](4.58)

⇒[

0.2 0.0−0.1 1.0

] [r(1)r(2)

]=[

0.10.8

](4.59)

⇒

⎧⎨⎩

r(1) = 0.5r(2) = 0.85σ2

v = 0.27(4.60)

To generate the time series, we substitute in Eq. 4.49 the values of the AR parameters and run the

difference equation with v(n) from N (0, 0.27), and zero initial values for y(n):

y(n) = 0.1y(n− 1) + 0.8y(n − 2) + v(n) (4.61)

A time series generated in this way is shown in Fig. 4.5. The autocorrelation function plotted in Fig. 4.6

has been calculated applying Eq. 4.41 with the initial set of values r(0), r(1) and r(2) found above:

r(m) = 0.1r(m − 1) + 0.8r(m − 2), for m > 2 (4.62)

Second order AR process analysis

Assume that we have got 128 samples of a time series from a stationary stochastic process [y(n)]. We will

see if the given process can be modelled as an AR process of model order 2. The Yule-Walker equations


0 20 40 60 80 100 120−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

sample n

y(n)

Figure 4.5: Time series of the synthetised AR process

0 10 20 30 40 50 600

0.2

0.4

0.6

0.8

1

lag m

r(m

)

Figure 4.6: Autocorrelation function of the synthetised AR process

z -1

Σ

Σ

z -1

-+

v(n)white noise y(n)

y(n-1)

a2 y(n-2)

a1

AR process

Figure 4.7: Second order AR process generator


may be used to estimate the AR parameters:

w = R−1r (4.63)

but we need first to estimate the first 3 samples of the autocorrelation function from the available data.

Using the sample autocorrelation estimator (Eq. 4.14) for N = 128:

r′(m) =1

128

127−|m|∑n=0

y(n)y(n − m) (4.64)

we may estimate the first 3 samples of r(m), and express the matrix R:

R′ =[

r′(0) r′(1)r′(1) r′(0)

](4.65)

If the matrix R′ is nonsingular, we may find a1 and a2 from the Yule-Walker matrix equation. The input

noise variance may be estimated from the r′(m) sequence by Eq. 4.48. To test the model we may use the

inverse filter to see if it is capable of “whitening” the given time series. If the model fits the data well, the

output of this “whitening” filter will be white Gaussian noise with zero mean and variance σ2v. The direct

filter will have the transfer function H(z) given by:

H(z) =1

1 + a1z−1 + a2z−2(4.66)

then the “whitening” filter transfer function is:

HW (z) = H−1(z)= 1 + a1z

−1 + a2z−2 (4.67)

The “whitening” filter is also called the AR process analyser and its impulse response has finite duration

(FIR). If the process is not truly autoregressive, or if the model order is not p, or if the error in the

estimation of the autocorrelation is high, then the output of the inverse filter will be coloured noise.

As an example, we may use the AR process generator found in the last section to generate a 128-sample

time series to feed into the AR analyser. For a time series generated in this way we estimated the first 3

values of r(k), obtaining 0.7045, 0.1963 and 0.5612. Therefore the matrix R is:

R′ =[

0.7045 0.19630.1963 0.7045

]

4.4 Linear Prediction 61

z -1z -1

Σ Σ

stochastic process

1 2a a

y(n-1) y(n-2)y(n)

noise v(n)

Figure 4.8: Second order AR process analyser

and the vector r′ = [0.1963, 0.5612]T . Applying Eq. 4.63 we get the estimate of w = [0.0614, 0.7795]T .

A better approximation to the true value w = [0.1, 0.8]T can be obtained by increasing the number of

samples or by running the generator several times, collecting several time series of the same process (ie

an ensemble), analysing and averaging the results. Table 4.1 and Fig. 4.9 show the mean and variance of

the AR coefficients estimated using the procedure described in this section for an ensemble of 500 time

series and a number of samples per time series from 16 to 1024.

N 16 32 64 128 256 384 512 1024a1 mean -0.1622 -0.1311 -0.1126 -0.1045 -0.1028 -0.1039 -0.1022 -0.1002

variance 0.0700 0.0258 0.0094 0.0038 0.0018 0.0012 0.0009 0.0004a2 mean -0.5065 -0.6448 -0.7181 -0.7588 -0.7777 -0.7827 -0.7881 -0.7949

variance 0.0408 0.0180 0.0084 0.0034 0.0016 0.0011 0.0008 0.0004

Table 4.1: AR coefficients estimates’ mean and variance

4.4 Linear Prediction

4.4.1 Wiener Filters

A typical statistical linear filtering problem consists of an input time series x(n), a linear filter device

characterised by its impulse response b0, b1, b2, . . . , and the output sequence y(n). This output is an estimate

of a desired response d(n) (Fig. 4.10).

Defining the estimation error as

e(n) = d(n) − y(n) (4.68)


0 200 400 600 800 1000

−0.16

−0.14

−0.12

−0.1a 1 e

stim

ate

(ens

embl

e m

ean)

number of samples N in time series0 200 400 600 800 1000

−0.8

−0.7

−0.6

−0.5

a 2 est

imat

e (e

nsem

ble

mea

n)

number of samples N in time series

0 200 400 600 800 10000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

a 1 est

imat

e va

rianc

e

number of samples N in time series0 200 400 600 800 1000

0

0.01

0.02

0.03

0.04

a 2 est

imat

e va

rianc

e

number of samples N in time series

Figure 4.9: AR coefficients estimates’ mean and variance

1b , b ,2b ,0 ...Σ

outputy(n)

desiredresponse

linear, discrete-time filter+

-

d(n)

x(n)estimation errore(n)

input

Figure 4.10: Filter problem


the filter can be optimised by minimising the cost function J:

J = E[ e(n)e(n) ]

= E[ |e(n) |2 ](4.69)

by making its gradient in the space, constituted by the filter coefficients, equal to zero:

∇J = 0 (4.70)

Solving Eq. 4.70 yields the following result:

E[x(n − k)eo(n)] = 0, for k = 0, 1, 2, . . . (4.71)

where eo(n) denotes the estimation error of the filter operating in its optimum condition.

Substituting Eq. 4.68 in the Eq. 4.71 gives the following set of equations, known as the Wiener-Hopf

equations:

Rbo = c (4.72)

where the p × p correlation matrix R has been defined in Eq. A.15 and

bo = [bo0 bo1 . . . bo,p−1]T

c = [c(0) c(−1) . . . c(1 − p)]T(4.73)

where c(−k) = E[x(n − k)d(n)].

4.4.2 Linear Prediction

One of the most common use for Wiener filters is to predict a future sample of a stationary stochastic pro-

cess, given a set of past samples of the process. The Wiener-Hopf equations may be used to optimise the

predictor in the mean-square sense. Assume that a time series of the process x(n−1), x(n−2), . . . , x(n−p)

is available. The estimation of the sample at time n, x(n) is a linear function of the previous samples:

x(n) =p∑

k=1

bkx(n − k) (4.74)


The desired response is the true value of the sample x(n):

d(n) = x(n) (4.75)

Then, the prediction error for this filter e(n) is:

e(n) = x(n) − x(n) (4.76)

The vector bo in the Wiener-Hopf equations becomes bo = [bo1, bo2, . . . , bop]T . Note the difference in one

of the indices of the coefficients bok with respect to Eq. 4.73, because the input sequence starts at sample

n − 1 instead of n. The input sequence provides the data for the estimation of the first p + 1 samples

of the autocorrelation function r(m), which may be used to find the p × p correlation matrix R and the

vector c. The latter is possible because the desired response is a sample of the input time series:

c =

⎡⎢⎢⎢⎣

E[x(n − 1)x(n)]E[x(n − 2)x(n)]

...E[x(n − p)x(n)]

⎤⎥⎥⎥⎦

=

⎡⎢⎢⎢⎣

r(1)r(2)

...r(p)]

⎤⎥⎥⎥⎦

(4.77)

If the matrix R is nonsingular, the solution of the Wiener-Hopf set of equations gives the optimum linear

predictor, characterised by the set of parameters boi for i = 1, . . . , p. Fig. 4.11 shows a linear predictor of

order p.

z -1 z -1

ΣΣ

1b 2b ...

x(n-2)

x(n)^

x(n-p)x(n-1)

bp

Figure 4.11: Prediction filter of order p

Note that the number of delay units is (p − 1) while the number of filter parameters remains at p. The


apparent incongruity between the number of delay units, the number of parameters and the model order

disappears when the linear predictor is related to the Wiener filter, as is shown in Fig. 4.12.

z -1 z -1

Σ

z -1

ΣΣ

2b1bb0...

x(n-1) x(n-2)x(n)

x(n)

+

-e(n)

linear predictor of order p

x(n-p)

bp

^

Figure 4.12: Prediction-error filter of order p

Relationship between AR models and Linear Prediction

Moreover, the filter of Fig. 4.12 may be rearranged to have the same structure as the AR analyser filter of

Fig. 4.8. The resulting filter is shown in Fig. 4.13.

z -1z -1 z -1

ΣΣ Σ

- 1b - 2b ...

x(n)x(n-1) x(n-2) x(n-p)

e(n)

- bp

Figure 4.13: Prediction-filter filter of order p rearranged to look as an AR analyser

The filters in Fig. 4.13 and Fig. 4.8 show the equivalence of the linear prediction-error filter and the AR

analyser. Both filters are fed with a time series from a stochastic process, and are expected to have an

uncorrelated random sequence at the output, e(n) or v(n), respectively. This random output has been

minimised for the linear predictor in the mean-square prediction-error sense, solving the Wiener-Hopf

equations:

Rbo = c (4.78)

4.5 Maximum entropy method (MEM) for power spectrum density estimation 66

The set of coefficients bi of the linear predictor are related to the parameters ai of the AR model through:

ai = −bi, for i = 1, 2, . . . , p (4.79)

where p = [r(1) r(2) . . . r(p)]T . The set of AR parameters may be calculated using the Yule-Walker

equations:

Rw = r (4.80)

with w = [−a1,−a2, . . . ,−ap]T and r = [r(1), r(2), . . . , r(p)]T . Therefore, the set of AR parameters

found by solving Yule-Walker equations is optimum in the mean-square prediction-error sense.

4.5 Maximum entropy method (MEM) for power spectrum densityestimation

The Yule-Walker equations for AR modelling (or linear prediction) can be used to find the parameters

of the filter that models the stochastic process x(n), and estimate its PSD by using Eq. 4.24. But the

goodness of the estimator S still depends on the statistical characteristics of the estimator for the auto-

correlation function r(m). The periodogram in Eqs. 4.11 and 4.17 assumes that the unknown values of

the autocorrelation (for lags greater in modulus than the data length) are zero. This leads to smearing in

the PSD estimate. Burg [25] applied the principle of maximum entropy to the estimation of the unknown

autocorrelation lags of a Gaussian stochastic process. In this sense, the maximum entropy autocorrelation

estimate will be the one with the most random autocorrelation series, i.e. the maximum entropy estima-

tor will not add any information to the estimate. The solution for a set of 2p + 1 known autocorrelation

lags is:

rMEM(m) =

⎧⎨⎩

r(m), for |m| ≤ p

∑pk=1 bp,krMEM(m − k), for |m| > p

(4.81)

where the coefficients bp,k are none other than the parameters of the p order linear predictor, and there-

fore equal to minus the ap,k parameters of a p order AR filter for the known autocorrelation lags. The

4.6 Algorithms for AR modelling 67

MEM PSD estimate, obtained by the Fourier transform of rMEM, yields:

SMEM(ω) =Pep∣∣∣∣∣1 −

p∑k=1

bp,ke−jωk

∣∣∣∣∣2 (4.82)

where Pep denotes the prediction error power average E[|ep(n)|2] for the p-order linear predictor, which is

equivalent to the input noise variance σ2v,p of the p-order AR model. In terms of the AR model parameters,

Eq. 4.82 is:

SMEM(ω) =σ2

v,p∣∣∣∣∣1 +p∑

k=1

ap,ke−jωk

∣∣∣∣∣2 (4.83)

4.6 Algorithms for AR modelling

4.6.1 Levinson-Durbin recursion to solve the Yule-Walker equation

The Levinson-Durbin algorithm [95][49] uses the symmetric and Toeplitz properties of the autocorre-

lation matrix R to provide an efficient solution of Eq. 4.44, requiring only p2 operations for a model

order p, instead of the p3 computations required for Gaussian elimination. Also, the algorithm reveals the

fundamental properties of AR processes. It recursively computes the filter parameter and input variance

{am,k, σ2m} for model orders m = 1, 2, . . . , p.

The algorithm proceeds as follows:

1. Initialisation:

a1,1 = −r(1)/r(0) (4.84)

σ21 = (1 − |a1,1|2)r(0) (4.85)

2. Recursion for m = 2, 3, . . . , p:

am,m =−

[r(m) +

∑m−1k=1 am−1,k r(m − k)

]σ2

m−1

(4.86)

am,k = am−1,k + am,mam−1,m−k for k = 1, 2, ..,m − 1 (4.87)

σ2m = (1 − |am,m|2)σ2

m−1 (4.88)


The solution {ap,k, σ2p} is the same as would be obtained using Eq. 4.44. The solution sets for lower

model orders provide useful information. If the values r(m) used in the recursion represent a valid auto-

correlation sequence, then it can be shown [10] that the last parameter for each model order satisfies5:

|am,m| ≤ 1 (4.89)

consequently, the input variance follows this property:

σ2m ≤ σ2

m−1 (4.90)

Using the analogy with linear predictors, Eq. 4.90 means that the prediction error decreases or at least re-

mains steady as the model order increases. This represents an advantage if the model order is not known

a priori. If the stochastic process x(n) is actually an AR process of order p and known autocorrelation

function, then the Levinson-Durbin recursion will reproduce the set {ap,k, σ2p} for model orders greater

than p. Under real conditions, either the autocorrelation is unknown or the process is not truly AR, then

the input variance as a function of the model order will decrease monotonically. However, it would show

a “knee” or turning point, where further increments in the model order do not improve significantly the

prediction error.

Lattice form of a linear predictor

The parameters am,m play an important role in the theory of linear prediction. To see how they are

related to the linear predictor of order p, let us define two types of prediction errors:

Forward-prediction error: The prediction error shown in Eq. 4.76 for a p-order linear predictor, de-

noted by ep(n), is:

ep(n) = x(n) −p∑

k=1

bp,kx(n − k)

= x(n) +p∑

k=1

ap,kx(n − k)

(4.91)

5In fact, the condition |am,m| ≤ 1 is necessary and sufficient for the values of r(m) to represent a valid autocorrelation function


where bp,k are the parameters of the p-order linear predictor, and ap,k the p-order AR model parameters.

We will continue using the second form to keep consistency with the notation used in the Levinson-Durbin

algorithm.

Backward-prediction error: If the data time series was reversed and fed to the p-order linear predictor

for the original sequence x(n), then the filter would sequentially predict the “past” samples of the original

time series. Thus, the backward prediction error for the sample x(n − p), denoted by bp(n)6, would be:

bp(n) = x(n − p) +p∑

k=1

ap,kx(n − p + k)

= x(n − p) + ap,1x(n − p + 1) + ap,2x(n − p + 2) + . . . + ap,px(n)

(4.92)

Using Eq. 4.87, a relationship between the forward and backward prediction errors can be found:

ep(n) = x(n) +p−1∑k=1

ap,kx(n − k) + ap,px(n − p)

= x(n) +p−1∑k=1

(ap−1,k + ap,pap−1,p−k)x(n − k) + ap,px(n − p)

=

(x(n) +

p−1∑k=1

ap−1,kx(n − k)

)+

(p−1∑k=1

ap,pap−1,p−kx(n − k) + ap,px(n − p)

) (4.93)

Noting that the terms within the brackets in Eq. 4.93 are related to the p − 1-order linear predictor by:

ep−1(n) = x(n) +p−1∑k=1

ap−1,kx(n − k)

bp−1(n − 1) = x(n − 1 − (p − 1)) +p−1∑k=1

ap−1,kx(n − 1 − (p − 1) + k)

= x(n − p) + ap−1,1x(n − p + 1) + ap−1,2x(n − p + 2) + . . . + ap−1,p−1x(n − 1)

= x(n − p) +p−1∑k=1

ap−1,p−kx(n − k)

(4.94)

thus Eq. 4.93 can be written as:

ep(n) = ep−1(n) + ap,pbp−1(n − 1) (4.95)

Similarly, it can be shown that:

bp(n) = bp−1(n − 1) + ap,pep−1(n) (4.96)

6Note that from this point the symbol b will denote an error and not a filter coefficient


Therefore Eqs. 4.95 and 4.96 relate the forward and backward prediction error of a given model order p

to the error for a linear predictor of model order p − 1. They can be used recursively to derive the lattice

form of a p-order linear predictor. Renaming the parameters am,m as:

am,m = κm (4.97)

and calculating the initial values for the forward and backward prediction errors:

e0(n) = b0(n) = x(n) (4.98)

we get:

e1(n) = e0(n) + κ1b0(n − 1)b1(n) = b0(n − 1) + κ1e0(n) (4.99)

Fig. 4.14 condenses the relationships shown by Eqs. 4.98 and 4.99. This structure resembles the basic

pattern of a lattice.

Σ

κ1

κ1

z -1

Σ e (n)1

b (n)1

x(n)

e (n)0

b (n-1)0

Figure 4.14: Lattice filter of first order

To continue the lattice, Eqs. 4.95 and 4.96 can be generalised by substituting m for p:

em(n) = em−1(n) + κmbm−1(n − 1)bm(n) = bm−1(n − 1) + κmem−1(n) (4.100)

and evaluating them for m = 2, 3, . . . , p. The complete filter will adopt the structure shown in Fig. 4.15.

Note that the transfer function of the lattice linear predictor filter is:

HLP(z) = 1 +p∑

k=1

ap,kz−k (4.101)


z -1

κ1

κ1

Σ

Σ z -1

Σ

Σb (n)1

e (n)1

z -1

Σ

Σ

x(n)

e (n)0

b (n-1)0 b (n-1)1

κ2

κ2

e (n)p-1

b (n-1)p-1

κp

κp

b (n)p

e (n)p

Figure 4.15: Lattice filter of first order

which is the inverse of the transfer function of the corresponding AR filter. By analogy with transmission

line theory, the parameters κm in the lattice filter are called reflection coefficients 7, while the parameters

ap,k could be referred to as the feedback coefficients, a term that is self-explanatory by looking at the AR

filter structure in Fig. 4.2.

The lattice structure has some advantages over the transversal filter structure shown in Fig. 4.13, or

the feedback form shown in Fig. 4.2. Not only does it generate both forward and backward prediction

sequences, but it also has modularity. The first “step”, with coefficient κ1, in the lattice depicted (see

Fig. 4.14) represents the first order linear predictor. Adding a second step κ2 increments the order by

one, yielding a second order linear predictor without having to modify the first step, and so on to reach the

desired model order. Also, each “step” or module is “decoupled” from the others as it can be shown that

forward and backward prediction errors are orthogonal, i.e. uncorrelated with each other for stationary

input data.

Furthermore, the condition |κm| ≤ 1 for m = 1, 2, . . . , p is necessary and sufficient to guarantee that all

the poles of the AR filter are lying within or on the unit circle, which is a condition for stability. If any

of the reflection coefficients equals ±1, then the Levinson-Durbin recursion will terminate with σ2i = 0,

where κi is the first reflection coefficient with unit modulus. The process in this case is purely harmonic,

consisting only of sinusoids.

It is important to note that the set of reflection coefficients κ1, κ2, . . . , κp represents the p-order linear

predictor as the set of feedback coefficients ap,1, ap,2, . . . , ap,p does. Using the Levinson-Durbin recursion,

7The parameters κm are also known as PARCOR coefficients, for partial correlation, in the statistics literature


it is possible to calculate the ap,i’s from the κi’s, as is shown in Table 4.2 for the first three sets of feedback

coefficients.

Model order m am,1 am,2 am,3

1 κ1

2 (κ1 + κ1κ2) κ2

3 (κ1 + κ1κ2 + κ2κ3) (κ2 + κ1κ3 + κ1κ2κ3) κ3

Table 4.2: Feedback coefficients in terms of the reflection coefficients

The inverse Levinson-Durbin recursion in Eq. 4.102 provides the means to calculate the κi’s as a function

of the ap,i’s. Again, the results for the first three model orders are shown in Table 4.3.

am−1,i =am,i − am,mam,m−i

1 − |am,m|2 (4.102)

Model order m κ1 κ2 κ3

1 a1,1

2a2,1

1+a2,2a2,2

3a3,1−a3,2a3,3

1+a3,2−a3,1a3,3−a23,3

a3,2−a3,1a3,3

1−a23,3

a3,3

Table 4.3: Reflection coefficients in terms of the feedback coefficients

It is apparent from Tables 4.2 and 4.3 that the reflection coefficients are less correlated with each other

than the feedback coefficients. This could give another reason to choose the set of reflection coefficients

to represent an AR process over the set of feedback coefficients.

4.6.2 Other algorithms for AR parameter estimation

For short data sets, the estimation of the first p lags of the autocorrelation sequence r(m) limits the

accuracy of the AR parameter estimation when using the Levinson Durbin recursion. Other approaches

use standard statistical estimation directly from the data. The commonly used method of Maximum

Likelihood Estimation (MLE) is too difficult to apply [21], as it leads to a set of nonlinear equations [108].

Approximations to the exact MLE have been sought [26][82][172]. McWhorter and Scharf summarise the

work done in approximate MLE. Unfortunately, hardly any improvement is achieved by using approximate


MLE methods despite the high computational cost. Returning to the least square (LS) approach used in

section 4.4.1, we will present three more methods for AR parameter estimation directly from the data.

LS of the forward prediction error

Given a data time series x(n) of length N , the forward prediction error ep(n) shown in Eq. 4.91 can be

written as:

ep(n) =p∑

k=0

ap,kx(n − k), where ap,0 = 1 (4.103)

Computing ep(n) for all the data available, i.e. for n = 0, 1, . . . , N + p − 1, and assuming that the values

of x(n) for n < 0 and for n ≥ N are zero, we get:

ep(0) = x(0)ep(1) = x(1) + ap,1x(0)ep(2) = x(2) + ap,1x(1) + ap,2x(0)

......

ep(p) = x(p) + ap,1x(p − 1) + ap,2x(p − 2) + . . . + ap,px(0)...

...ep(N − 1) = x(N − 1) + ap,1x(N − 2) + ap,2x(N − 3) + . . . + ap,px(N − p − 1)ep(N − 1) = ap,1x(N − 1) + ap,2x(N − 2) + . . . + ap,px(N − p)

......

ep(N + p − 1) = ap,px(N − 1)

(4.104)

In matrix form, this can be written as:⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

ep(0)...

ep(p)...

ep(N − 1)...

ep(N + p − 1)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

x(0) ©...

. . .x(p) · · · x(0)

......

x(N − 1) · · · x(N − p − 1). . .

...© x(N − 1)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎣

1ap,1

ap,2

...ap,p

⎤⎥⎥⎥⎥⎥⎦ (4.105)

ep = Xa (4.106)

The forward prediction error energy is the summation over all the range of |ep(n)|2:

Ep =∑

n

|ep(n)|2

=∑

n

(p∑

k=0

ap,kx(n − k)

)2 (4.107)


Minimising Ep with respect to ap,k results in a set of p equations:

∂Ep

∂ap,i= 0 ⇒

p∑k=0

ap,k

(∑n

x(n − k)x(n − i)

)= 0, for 1 ≤ i ≤ p (4.108)

The minimum error energy, denoted by Ep,min, is obtained by expanding Eq. 4.107 and substituting into

Eq. 4.108. The result can be shown to be [105]:

Ep min =p∑

k=0

ap,k

(∑n

x(n − k)x(n)

)(4.109)

Using matrix notation for Eqs. 4.108 and 4.109:

XTXa = [Ep min 0 0 . . . 0]T (4.110)

The matrix XTX has a Toeplitz structure. In fact, multiplying both sides of Eq. 4.110 by 1/N makes

the equation equivalent to the Yule-Walker equations using the biased autocorrelation estimator given in

Eq. 4.14, hence, the name of Yule-Walker estimator. The Levinson-Durbin recursion can be used to solve

Eq. 4.110.

However, if we avoid the assumption of zeros for the unknown values of x(n) and restrain the calculation

of ep(n) from n = p to n = N −1, the matrix X in Eq. 4.110 will change to Xcov, defined as one of matrix

X’s partitions:

X =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

x(0) ©...

. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x(p) · · · x(0)...

...x(N − 1) · · · x(N − p − 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . ....

© x(N − 1)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

Xpre

. . . . . . . . . .

Xcov

. . . . . . . . . .

Xpost

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(4.111)

Using the matrix Xcov is essentially equivalent to using the same number of products, N − p, for each

lag in the estimation of the autocorrelation function. Unless the signal is periodic with a period which

is a multiple of N − p, the matrix S = XTcovXcov will not be Toeplitz. Instead it is the so-called sample

covariance matrix. Hence, the resultant set of equations using Eq. 4.111 is called the covariance equations.


The sample covariance matrix is not always positive definite and instabilities of the AR filter may occur.

In the case of a periodic signal with a period which is a multiple of N − p, the sample covariance matrix

and the true autocorrelation matrix are identical. The covariance estimator is statistically closer to the

MLE estimator than the Yule-Walker estimator, although the latter has a lower variance, especially if data

windowing is applied. It is interesting to note that Xcov can be linked to non-linear dynamical systems

theory if seen as a sequence of N − p delay reconstructed vectors in a (p + 1)-dimensional embedding

space when using a one-sample delay [1].

The use of either Xpre or Xpost with Xcov leads to pre-windowed or post-windowed equations respectively.

Algorithms have been developed to solve the covariance equations, the pre-windowed and the post-

windowed equations, but none of these perform significantly better than the Yule-Walker estimator. All

the LS methods based on forward prediction error present line spectral splitting in the PSD estimator.

This phenomenon consists of two peaks being generated in the PSD, very close to each other, where the

real PSD has only one peak. The method that uses the matrix X as in Eq. 4.111 gives the least spectral

resolution. Displacement of the peaks and great sensitivity to noise are other common problems of the

LS forward prediction approach. The next sections describe methods based on the backward prediction

error as well as the forward prediction error, in an attempt to improve the results described up to this

point.

Constrained LS of the forward and backward prediction errors

Provided that the process x(n) is stationary the forward and the backward prediction errors are equal in

a statistical sense. Based on this property, Burg [25] proposed the use of both errors in the cost function

for the parameters ap,k, without making any other assumption about the data:

Ep =N−1∑n=p

|ep(n)|2 + |bp(n)|2

=N−1∑n=p

⎛⎝

∣∣∣∣∣p∑

k=0

ap,kx(n − k)

∣∣∣∣∣2

+

∣∣∣∣∣p∑

k=0

ap,kx(n − p + k)

∣∣∣∣∣2⎞⎠ (4.112)

Burg further proposed to minimise the error Ep subject to the constraint over the ap,k parameters that


they satisfy the Levinson-Durbin recursion:

am,k = am−1,k + am,mam−1,m−k (4.113)

for all orders m from 1 to p.

This constraint ensures a stable AR filter. Substituting the recursive expressions for ep(n) and bp(n) shown

in Eq. 4.100, the cost function Ep becomes a function of the reflection coefficient ap,p, defined in §4.6.1,

and the forward and backward prediction errors for the order immediately below, p− 1. Minimising with

respect to ap,p yields:

ap,p = κp =−2∑N−1

k=p bp−1(k − 1)ep−1(k)∑N−1k=p (|bp−1(k − 1)|2 + |ep−1(k)|2)

(4.114)

The denominator on the right-hand side of Eq. 4.114 can be found recursively using the relations given

in Eq. 4.100 as:

DENp =N−1∑k=p

(|bp−1(k − 1)|2 + |ep−1(k)|2

)

= DENp−1(1 − |κp−1|2) − |bp−1(N − p)|2 − |ep−1(p)|2

(4.115)

The Burg algorithm is then implemented as follows:

1. Initialisation:

e0(n) = b0(n) = x(n)

DEN0 =N−1∑k=0

(|x(k − 1)|2 + |x(k)|2

)

2. Recursion for m = 1, 2, 3, . . . , p:

DENm = DENm−1(1 − |κm−1|2) − |bm−1(N − m)|2 − |em−1(m)|2

am,m = κm =(−2∑N−1

k=m bm−1(k − 1)em−1(k))/

DENm

am,k = am−1,k + am,mam−1,m−k, for k = 1, 2, ..,m− 1

em(n) = em−1(n) + κmbm−1(n − 1)

bm(n) = bm−1(n − 1) + κmem−1(n)

As it may be noted, the Burg method does not minimise the cost function for all the reflection coefficients

at the same time, but rather minimises the error with respect to the last reflection coefficient for each


model order. This has been pointed out as being the cause for the line spectral splitting also observed in

PSD estimators obtained from Burg AR estimates [81]. Several researchers have tried to overcome this

effect by minimising the error function with respect to either all the reflection coefficients or the feedback

coefficients of the p-order AR filter at the same time (see [81] for a review). The last approach removes

the constraint (Eq. 4.113) imposed by Burg, and is usually referred to as forward-backward LS. It requires

about 20% more computations than the Burg algorithm and although it is apparent that this method

removes the spectral line splitting [81], the stability of the AR filter is not guaranteed.

Modified Burg algorithm

Narayan and Burg [117] also proposed a solution to the spectrum line splitting problem for quasi-periodic

time series, in a method called covariance Burg. The method requires a priori knowledge of the period of

the signal to estimate the autocorrelation matrix from the sample covariance matrix.

The algorithm is very similar to the Burg method except for a trapezoidal weighting function applied in

the computation of the reflection coefficient:

κm =−∑N−1

k=m wm(k)bm−1(k − 1)em−1(k)∑N−1k=m wm(k) (|bm−1(k − 1)|2 + |em−1(k)|2)

(4.116)

where wm(n) takes the value of the minimum of n − m + 1, p − m + 1 and N − n, denoted as:

wm(n) = min(n − m + 1, p− m + 1, N − n) (4.117)

For high signal-to-noise ratio and periodicity of the data, this estimator will be much closer to the opti-

mum in the maximum likelihood sense 8 than the one obtained by the covariance method. However, the

algorithm also works quite well with non-periodic data, although the results are not so close to the MLE

optimum as with periodic data. This surprising feature may be explained by the fact that the method

performs a kind of average over the sample covariance matrices (dimension (p + 1) × (p + 1)) along the

N -point long data segment.

8Burg et al [26] developed an algorithm to find the closest Toeplitz matrix to the sample covariance matrix in the maximumlikelihood sense using the normalised distance measure Dn(S,R) = Trace(R−1S)− ln(|R−1S|)− (p +1). The algorithm, called“structured covariance matrices”, is an approximation to the MLE and is computationally very expensive

4.7 Modelling the EEG 78

4.6.3 Sensitivity to additive noise of the AR model PSD estimator

One of the major problems in the use of AR parametric modelling for estimating PSD is the presence of

additive noise in the signal. Although noise has been considered in the model to represent the unpre-

dictable nature of a stochastic signal, the addition of white or coloured noise to the signal will obscure

its spectrum in the PSD estimation of signal plus noise as is shown in Eqs. 4.118 and 4.119 below. Let us

suppose that x(n) is a p-order AR stochastic process contaminated with noise:

xn(n) = x(n) + ν(n) (4.118)

Assuming that ν(n) is white uncorrelated noise with variance σ2ν , the PSD of xn(n) is:

Sn(ω) = Sx(ω) + σ2ν

=σ2

v,p∣∣∣∣∣∣∣

1+

p∑k=1

ap,ke−jωk

∣∣∣∣∣∣∣

2 + σ2ν

=

σ2v,p+σ2

ν

∣∣∣∣∣∣∣

1+

p∑k=1

ap,ke−jωk

∣∣∣∣∣∣∣

2

∣∣∣∣∣∣∣

1+

p∑k=1

ap,ke−jωk

∣∣∣∣∣∣∣

2

(4.119)

As can be seen in Eq. 4.119, the process which includes the additive noise is an ARMA process instead of

the original AR process. This distorts the spectrum, with a loss of resolution in the detection of the peaks

in the original process. Additive noise is very common in signal processing as it is the most common

model for sensor noise.

4.7 Modelling the EEG

AR modelling, just like FFT methods, assumes that the signal under analysis is stationary. The EEG can

be considered to be a stochastic process, which is the result of the summation of numerous sources of

electrical activity. These activities depend on external and internal variables, and on large numbers of

inhibitory and excitatory interactions that can be considered to be random. The central limit theorem

states that the sum of many independent random variables tends to be a random variable with a Gaussian


distribution. The EEG sources are not independent but we may assume them to be statistically indepen-

dent in origin, modelling the numerous interactions between neurons as a shaping filter which transforms

these sources in a correlated Gaussian process. Under stable external and internal conditions, we may

consider the EEG process to be stationary, and, even further, take it to be ergodic.

Now the question is for how long can we assume during continuous recordings that the external/internal

conditions remain stable? The answer depends on many factors, but for the kind of data analysed in

this thesis the main factors are the subject’s activity and presence or absence of a pathological condition.

If the subject is healthy and asleep, and if the external conditions are favourable to sleep, then we can

assume that the changes in the statistical properties of the EEG will occur slowly. However, we know

from section 3.1.2 that the sleep EEG may spontaneously present transient waves like spindles and vertex

waves, which affect the stationarity of the signal, even if the brain stays in the same sleep stage. These

events last for about a second. In fact, several authors [14][12] have recommended that the EEG should

be analysed in segments no longer than 1 second to ensure stationarity. However, segments of 1-s dura-

tion may be too short to obtain an accurate autocorrelation estimate. If the subject is awake, many more

factors can influence the patterns in the EEG, some of which are difficult to control under experimen-

tal conditions. For instance, eye closing/opening affects the alpha rhythm in the EEG. Transitions from

alertness to drowsiness bring an important series of changes in the EEG, some of which may happen very

quickly. Varri et al. use EEG segments of variable length, from 0.5s to 2s [170] in vigilance studies.

Barlow [14] gives a review of methods suitable for detecting EEG non-stationarities. The methods can be

based on fixed intervals or adaptive intervals. EEG features from a fixed reference window are compared

with the features from the EEG in a moving “test” window, looking for significant changes in the features.

Once a change is detected, a boundary is set at the point where the change occurs in order to segment the

signal into stationary segments. A new reference window is then placed at the start of the new segment.

The features are usually based on time descriptors or FFT coefficients or AR modelling.

Another method for analysis of non-stationary EEG is time-varying AR modelling, better known as Kalman


filtering [148]. With this method, the AR parameters are estimated for a short segment at the beginning

of the signal by any conventional algorithm (Yule-Walker, Burg, etc.) and then updated every sample.

The non-stationarities can be tracked from the rate-of-change of the AR parameters. This method has

been reported to yield better results than FFT or conventional AR modelling in recognising rapid changes

in the frequency of oscillations [148], but the computational cost and the data expansion instead of

compression make the method unattractive [14]. Some researchers average the Kalman AR coefficients

over segments of 1s or more, for smoothing and data compression, but this has the same effect as using

a conventional AR method over the segment. Also, brief disturbances like artefacts and transient waves

have a lasting effect with Kalman filtering (depending on gain of the filter), given that the technique

employs information from the recent past to update the current estimate of the model coefficients. In

contrast, in conventional AR modelling, a brief disturbance will only affect the segment during which it

occurs [14].

Chapter 5

Neural network methods

We have already seen that the EEG signal can be described in terms of its power spectrum or as a set

of filter parameters. These values extracted from the PSD or the AR model are generally called features.

The next step is to find the relationship between the EEG features and the mental state, i.e. deep sleep,

wakefulness, drowsiness, etc. This constitutes a classification task which seeks to partition the input

space into 1-of-K classes. The input space is a mathematical abstraction such that a given set of features

xo1, x

o2, . . . , xo

(d) is assigned to a point xo in a d-dimensional space with coordinates x1, x2, . . . , xd. Classi-

fication involves dividing the input space into regions such that points taken from the same region belong

to the same class. The dividing line between two regions is known as a decision boundary.

Classifiers can be divided according to the type of mapping generated. In linear classifiers, the decision

boundaries are hyper-planes (for dimensions greater than three). However, it is well known that real-

world problem datasets show considerable overlap between classes and hence require non-linear decision

boundaries. For the same reason, it is helpful to adopt a probabilistic framework, placing the decision

boundary in the loci for which probabilities of belonging to either class are equal. There are several

methods for evaluating the posterior probability of belonging to a given class, using either parametric

or non-parametric techniques. With the non-parametric methods, no assumption is made regarding the

probability distribution of the data belonging to each class.

Neural networks are non-linear, non-parametric function approximators which can be used in regression

5.1 Neural Networks 82

problems as well as in classification problems. Compared with other types of classifiers, neural networks

offer advantages in problems for which the classification rules are complex and difficult to specify. Pro-

vided that a sufficiently large set of input data is labelled by human experts (called the training set),

a neural network can “learn” the underlying generator of the data and for a given input, produce an

output in terms of the posterior probabilities of the classes. With new data (or test data), drawn from

the same distribution as the training set, the trained neural network should produce accurate posterior

probabilities of the data belonging to the classes, a property sometimes known as generalisation. The set

of posterior probabilities can be fed to a decision-making stage to assign the input to one of the classes.

Training a neural network is a time-consuming task, but once this task is performed, classification is fast,

and requires very little computational resources.

P(C | ), ... , P(C | )1 x K xn n

FeatureExtractor

MLP

DecisionMaking

EEGn

features x , ... , x0 d-1

posterior probabilities

classification

Ck

nn

Figure 5.1: The classification process

5.1 Neural Networks

A neural network consists of arrays of interconnected “artificial neurons”. The structure of an artificial

neuron is showed in Fig. 5.2. The artificial neuron adds the weighted values of its d inputs xi, and applies

a non-linear function to this summation in order to produce an output y, whose value is in the range from


0 to 1:

y = g

(d∑

i=0

wixi

)(5.1)

wd

x

x

1

2inputs Σ y outputa

xd

gh

w1

w2

0w0x =1

Figure 5.2: An artificial neuron

The nonlinear function g(a) is known as the activation function. Several types of activation functions can

be used. One such function is the so-called hard-limiter gh(a), which produces an binary output, 1 or 0,

depending on whether or not the summation exceeds a given threshold, and is defined as:

gh(a) ={

0 for a < 01 for a ≥ 0 (5.2)

where a =∑d

i=0 wixi.

The use of the hard-limiter has a physiological background, as it simulates the “all-or-nothing” rule of

real neurons. The weight associated with the bias input x0, w0 represents the threshold when gh(a) is

used because it sets the minimum value of a for the neuron “to fire”.

Other activation functions have continuous outputs between 0 and 1, allowing the output to be inter-

preted as a probability. Examples of such functions are the sigmoid function gσ, and the softmax function

gsoftmax. The sigmoid function is the hyperbolic tangent function tanh, scaled to lie between saturation

levels of 0 and 1.

gσ(a) =1

1 + e−a(5.3)

This mathematically simple function is widely used in two-class problems [19, pp.231]. The softmax

function, a more generalised form of the sigmoid function also known as the normalised exponential, is

better suited to multiple class problems and will be explained later.


a

1

-1

tanh(a)

a

0.5

1.0

g (a) σ

Figure 5.3: Hyperbolic tangent and Sigmoid functions.

The non-linear mapping performed by a neural network can be written as:

y = y(x;w) = G(x;w) (5.4)

where y represents the vector of outputs [y1, . . . , yK ]T , generally representing the probabilities of belong-

ing to class Ck in a classification problem; and w represents the vector of connection weights between the

input nodes and the neurons, between neurons, and between the neurons and the outputs. The process

of finding the weights to perform the mapping correctly is called learning or training the network. During

supervised learning the input patterns or vectors are presented repeatedly to the network, along with

the desired value for the outputs (the target value tk for the kth output). The weights are successively

adjusted in order to minimise a cost function, generally associated with the mean squared error. The

performance of the trained network is then tested on the test set, i.e. a set of patterns not included in

the training set. An over-trained network will fit the noise rather than the data and hence will generalise

poorly.

5.1.1 The error function

The general goal of the neural network is to make the best possible prediction of the target vector t =

[t1, . . . , tK ]T when a new input vector value x is presented. The most general and complete description

of the data is in terms of the joint probability density p(x, t) given by:

p(x, t) = p(t |x)p(x) (5.5)

where p(t |x) represents the probability of t given a particular value of x, and p(x) is the unconditional


probability of x, given by:

p(x) =∫

p(x, t)dt (5.6)

The cost function to minimise during training can be arbitrarily defined. A good cost function can be

derived from the likelihood of the training set {xn, tn}, which can be written as:

L =∏n

p(xn, tn)

=∏n

p(tn | xn)p(xn)

(5.7)

For optimisation purposes, it is simpler to take the negative logarithm of the likelihood:

E = − lnL

= −∑

n

ln p(tn | xn) −∑

n

ln p(xn)(5.8)

where E defines a cost function, usually called the error function. The second term of the right hand side

of Eq. 5.8 can be omitted as it does not depend on the parameters of the neural network, so:

E = −∑

n

ln p(tn | xn) (5.9)

Sum of squares function

If we assume the target variables tk to be continuous, with independent zero-mean Gaussian distributions,

we can write:

p(t | x) =K∏

k=1

p(tk | x) (5.10)

Furthermore, let us assume that the tk ’s are given by some deterministic function of x with added Gaus-

sian noise ε:

tk = hk(x) + εk (5.11)

where the noise distribution is given by:

p(εk) =1√

2πσ2exp

− ε2k

2σ2 (5.12)


As the training process seeks to model the functions hk(x) with yk(x;w), we can use the latter in Eq. 5.11

and substitute εk in Eq. 5.12 to give:

p(tk | x) =1√

2πσ2exp

−{yk(x;w) − tk}2

2σ2 (5.13)

Combining Eq. 5.13 and Eq. 5.10 in the expression for the error in Eq. 5.9:

E = −∑n

lnK∏

k=1

p(tk | x)

= −N∑

n=1

K∑k=1

ln1√

2πσ2exp

−{yk(x;w) − tk}2

2σ2

=1

2σ2

N∑n=1

K∑k=1

{yk(x;w) − tk}2 + NK lnσ +NK

σln(2π) (5.14)

where N is the number of input patterns used during the training process. Note that the last two terms

of Eq 5.14 do not depend on the weights w so they can be omitted, as well as the dividing factor σ2 in

the first term. Thus the error function E ends up as:

E =12

∑n

∑k

{yk(xn;w) − tnk}2

=12

∑n

‖y(xn;w) − tn ‖2

(5.15)

The error function in Eq. 5.15 is called the sum-of-squares function. It reduces the optimisation process

to a least-squares procedure. Its use is not restricted to Gaussian distributed target data, and although

the sum of the outputs equals unity (very convenient if we want to interpret the outputs as probabilities),

the results cannot distinguish between the true distribution and any other distribution having the same

mean and variance.

Cross-Entropy function

In a classification problem, the target data represent discrete class labels, therefore a more convenient

code for the target data is the “1-of-K” scheme:

tnk = δk�, for xn ∈ C� (5.16)


where δk� is the Kronecker delta which is 1 for k = �, and 0 otherwise.

The output is meant to represent the posterior probability of class membership:

y� = P (C� | x) (5.17)

therefore, we can write p(t� | x) = (y�)t� , and more generally, assuming that the distributions p(tn | xn)

are statistically independent:

p(tn | xn) =K∏

k=1

(ynk )tk (5.18)

Substituting Eq. 5.18 in Eq. 5.9 for the log-likelihood error function:

E = −∑

n

K∑k=1

tnk ln ynk (5.19)

This error function has an absolute minimum with respect to the yk ’s when ynk = tnk for all k and n. At the

minimum, E is:

Emin = −∑

n

K∑k=1

tnk ln tnk (5.20)

If tk takes only values 0 or 1, this minimum is equal to zero, but if tk is a continuous variable in the

range (0, 1) this minimum does not necessarily get to zero. In fact, it will represent the cross entropy [19,

pp.244] between the distributions of the target and the output. Hence, this error function derived from

the maximum likelihood criterion for a 1-of-K target coding is called the cross-entropy error function.

To ensure a zero value at the minimum, the value Emin is subtracted from the error function in Eq. 5.19,

giving this modified error:

E = −∑

n

K∑k=1

tnk lnyn

k

tnk(5.21)

which is non-negative and equals zero when ynk = tnk for all k and n.

The cross-entropy error function has some advantages over the sum-of-squares error function. Firstly, it

can be proved [19, pp.235-6] that for an infinitely large data set the outputs yk are exactly the posterior


probability P (Ck | x), and therefore are limited to the (0, 1) range. Secondly, it performs better at

estimating small probabilities. Indeed, if we denote the error at the output ynk as εn

k then the cross-

entropy error is:

E = −∑

n

K∑k=1

tnk lntnk + εn

k

tnk

= −∑

n

K∑k=1

tnk ln(

1 +εn

k

tnk

) (5.22)

Its clear from Eq. 5.22 that the cross-entropy error function depends on the relative errors of the neural

network outputs, in contrast with the sum-of-squares function which depends on the squares of the

absolute errors. Therefore, minimisation of the cross-entropy error will tend to give similar relative

errors on both small and large probabilities, while sum-of-squares tends to give similar absolute errors

for each pattern, resulting in large relative errors for small output values.

Cross-entropy error for a two-class problem For K > 2 it is desirable to have one output per class, so

that each output represents the posterior probability of belonging to one of the classes, but for a 2-class

problem, only one output representing one class is necessary as the probability for the other class can

be found by subtracting the output value from 1. This causes a few changes in the cross-entropy error

function, which will be reviewed here briefly.

Assigning y=P (C1 | x) the conditional probability p(t | x) can be written as:

p(t | x) = yt(1 − y)1−t (5.23)

where t takes value 1 if x ∈ C1 and 0 if x ∈ C2. The cross-entropy error takes the form:

E = −∑

n

{tn ln yn + (1 − tn) ln(1 − yn)} (5.24)

Differentiating with respect to yn:

∂E

∂yn=

yn − tn

yn(1 − yn)(5.25)


It is easy to see from Eq. 5.25 that the cross-entropy function for a 2-class problem has an absolute

minimum at 0 when yn = tn for all n. Again, it will be convenient to subtract the minimum from the

expression in Eq. 5.24:

E = −∑

n

{tn lnyn

tn+ (1 − tn) ln

(1 − yn)1 − tn)

} (5.26)

5.1.2 The decision-making stage

To arrive at a classification from the posterior probabilities evaluated at the outputs of the neural network,

the minimum error-rate criterion is usually adopted. To minimise the probability of misclassification, a

new input should be assigned to the class having the largest posterior probability. Several aspects should

be taken into account when training in real-world problems. Firstly, the neural network is trained to

estimate the posterior probabilities of class membership based on the assumption that the input data

has been drawn from the same data distribution as the training set. Hence the output k represents

P (Ck | x, D) where D = {xn, tn} is the training set data. Secondly, the proportion of data from each

class in the training set reflects the prior probabilities of classes. From Bayes’ theorem:

P (Ck | x) =p(x | Ck)P (Ck)

p(x)(5.27)

⇒ p(x | Ck)P (Ck) = P (Ck | x)p(x) (5.28)

Integrating on both sides of Eq. 5.28 yields:

∫p(x | Ck)P (Ck) dx =

∫P (Ck | x)p(x) dx (5.29)

⇒ P (Ck) =∫

P (Ck | x)p(x) dx (5.30)

By assuming that all the values of x are equally probable in the training set, the right-hand side of Eq. 5.30

can be approximated as:

P (Ck) =∫

P (Ck | x)p(x) dx ≈ 1N

N∑n=1

P (Ck | x) (5.31)


Thus, the prior probabilities are approximated as the average of each neural network output over all the

patterns in the training set. Hence, the prior probability P (Ck) should determine the proportion of the

patterns belonging to class Ck in the training set. In some cases, this can be problematic if the class with

the maximum risk of misclassification is very scarce (e.g. when diagnosing a fault or a disease). In these

cases, it would be desirable to include as many patterns from the high risk class as from the other classes

in the training set. Compensation for the different prior probabilities can be easily performed multiplying

each output by the ratio of the “true” prior probability with respect to the prior in the training set, and

normalising the corrected outputs so that they sum to unity.

5.1.3 Multi-layer perceptrons

A neural network with its neurons arranged in layers is called a perceptron. A single-layer of neurons is

therefore called a single-layer perceptron. It has inputs whose values are x1, x2, . . . , xd written as a feature

vector x. Given that there are no connections from the outputs to the inputs, a single-layer-perceptron

is a feed-forward neural network. It is also a linear classifier as it partitions the input space with hyper-

planes. The perceptron learning rule [159, pp.11] guarantees to find a solution with a single-layer if the

input feature vectors are linearly separable.

If more complex decision boundaries are required, two or more layers of neurons should be used. Such

a network is known as a multi-layer perceptron (MLP). Fig. 5.4 shows an I−J−K (2-layer) MLP, with

I-dimensional input patterns, J hidden units zj , and K outputs yk. The neurons in the output layer are

simply called “outputs”, while the neurons in the intermediate layer are called “hidden units”. When

necessary, the superindices z or y will be used to distinguish the weights w or the inputs of the activation

function a of the hidden layer from those in the output layer.

It can be shown that a 2-layer perceptron with smooth nonlinearities is able to approximate any arbitrary

function [159, pp.16]. However, the decision boundaries will not be abrupt, as with the hard-limiter

perceptron, but smooth and continuous instead. The approximation accuracy will then depend on the

number of units in the hidden layer. A low number of hidden units will give an insufficiently complex


wj kwi j

xi yk

z1

zJ

zjhiddenunits

0z =1

x2

xI

y1

yK

z y

inputs outputs

x1

=1x0

Figure 5.4: A I−J−K neural network.

model for the given problem, while a large number of hidden units will result in an over-fitted model. An

over-trained or over-fitted network is a disadvantage in real world problems, since most real-world data

is very noisy.

Training an MLP: the error backpropagation algorithm

The training algorithm that underpins the use of multi-layer perceptrons is the so-called error backpropa-

gation algorithm. It uses error gradient information to seek a minimum of the error function. In order to

apply this algorithm to an MLP it is necessary to use continuous and differentiable activation functions.

The activation function for the hidden units does not necessarily have to be the same as for the outputs.

Hyperbolic tangent or sigmoid functions are usually chosen as the non-linearity for the hidden units.

If probabilities are to be represented at the outputs, then these units have to be restricted to the [0,1]

range. Hence, the sigmoid function for a 2-class problem, or its generalisation, the softmax function for

a K-class problem (K > 2) are recommended for the outputs. The softmax function is defined as:

gsoftmax(ak) =eak∑k′ eak′

(5.32)

where ak =∑J

j=1 wjkyj is the summation of the inputs to the kth neuron and k′ is the index of the

summation over all the K neurons in the output layer.

Let us assume that a neural network is to be trained to solve a classification problem for K > 2 mutually

exclusive classes, with a training set of input patterns xn with n = 1, . . . , N , represented by I feature

values, xn = [xn1 , . . . , xn

I ]T , and a class membership target vector tn = [tn1 , . . . , tnk ]T coded with a 1-of-K


scheme. Assume that the network is a 2-layer MLP with J hidden units zj with a sigmoidal activation

function, and one output per class yk with softmax activation function. We would like to find values for

all the weights in the neural network, the vector w that minimises the error function E(xn; tn;w).

The gradient of the error, given by the vector ∇wE, points in the opposite direction to that of the deepest

descent of the error function in weight space. It can, therefore, be used in the search for the minimum of

the error function in weight space, by recursive updating of the weights given by:

w(τ+1) = w(τ) + Δw(τ)

= w(τ) − η∇wE(τ) (5.33)

where η is called the learning rate and τ denotes the iteration number. Expressing Eq. 5.33 for each

weight leaves:

w(τ+1)i = w

(τ)i + Δw

(τ)i

= w(τ)i − η ∂E(t)

∂wi(5.34)

For the reasons given in section 5.1.1 the cross-entropy error function is chosen to optimise the network

parameters. This error function is now written as a function of the weight vector:

E(w) =∑

n

En(w)

= −∑

n

K∑k=1

tnk lnyn

k (w)tnk

The derivatives of the cross-entropy error function with respect to the weights of the neural network can

easily be found by propagating back the error at the outputs towards the hidden and input layers as will

be shown below.

Derivatives of E with respect to the hidden-to-output weights

The output units’ activation function (softmax g(·), Eq. 5.32) includes in its denominator the inputs ayk

for all the outputs yk, so the weight wyjk affects all the outputs. Therefore, all of the outputs should be


considered when differentiating the error for pattern n with respect to the output weights wyjk:

∂En

∂wyjk

=K∑

k′=1

∂En

∂ynk′

∂ynk′

∂ayn

k

∂ayn

k

∂wyjk

(5.35)

The first partial derivative for the right term of Eq. 5.35 is:

∂En

∂ynk′

= − tnk′

ynk′

(5.36)

The second partial derivative can be found from Eq. 5.32:

∂ynk′

∂ayn

k

= ynk′δkk′ − yn

k′ynk (5.37)

and the last derivative in the chain:

∂ayn

k

∂wyjk

= znj (5.38)

which does not depend on k′. Combining these two partial derivatives and summing over k′:

K∑k′=1

∂En

∂ayn

k

= ynk − tnk (5.39)

Then, substituting Eqs. 5.39 and 5.38 into Eq. 5.35 gives:

∂En

∂wyjk

= δyn

k znj (5.40)

where δyk =yk − tk.

Derivatives of E with respect to the input-to-hidden weights

We follow a similar procedure to find the derivatives of the error function with respect to the hidden

layer weights wzij , this time noting that the sigmoid activation function of the unit zj only depends on the

inputs to this unit:

∂En

∂wzij

=K∑

k′=1

∂En

∂ynk′

K∑k=1

∂ynk′

∂ayn

k

∂ayn

k

∂znj

∂znj

∂azn

j

∂azn

j

∂wzij

(5.41)


The first two derivatives in Eq. 5.41 have already been found above, and are denoted by δyn

k . The third

derivative is:

∂ayn

k

∂znj

= wyjk (5.42)

Using Eq. 5.3:

∂znj

∂azn

j

= znj (1 − zn

j ) (5.43)

The last derivative of the chain is:

∂azn

j

∂wzij

= xni (5.44)

Combining all of these derivatives according to Eq. 5.41 gives:

∂En

∂wzij

= znj (1 − zn

j ) xni

K∑k=1

δyn

k wyjk (5.45)

= δzn

j xni (5.46)

As can be seen in the above equation, the weight update for the input-to-hidden weights depends on the

weight update for the hidden-to-output weights. Weight errors (the δ’s) are propagated backwards, to

the preceding layer, hence the name given to the algorithm.

Backpropagation for the two-class problem

As we have seen in section 5.1.1, for a 2-class problem only one output is needed, in which case the cross-

entropy error function is slightly different and gives rise to a modified expression for the backpropagation

“errors”. Also, as there is only one output, the sigmoid activation function is used for all units in the

network.

The part of error function that depends on the weights is:

E(x) = −∑

n

{tn ln yn(w) + (1 − tn) ln(1 − yn(w))}


Differentiating with respect to the hidden-to-output weights wj:

∂En

∂wj=

∂En

∂yn

∂yn

∂ayn

∂ayn

∂wj(5.47)

we find that:

∂En

∂yn= yn−tn

yn(1−yn) (5.48)

∂yn

∂ayn = yn(1 − yn) (5.49)

∂ayn

∂wj= zn

j (5.50)

Then we get:

∂En

∂wj= δyn

znj (5.51)

where δy =y − t. This result is exactly the same as for the multiple-class problem.

To find the derivatives of E with respect to the input-to-hidden weights:

∂En

∂wij=

∂En

∂yn

∂yn

∂ayn

∂ayn

∂znj

∂znj

∂azn

j

∂azn

j

∂wij(5.52)

we find that the partial derivatives in the chain are:

∂En

∂yn

∂yn

∂ayn = yn − tn (5.53)

∂ayn

∂znj

= wj (5.54)

∂znj

∂azn

j

= znj (1 − zn

j ) (5.55)

∂azn

j

∂wij= xn

i (5.56)

Substituting all of them into Eq. 5.52 gives:

∂En

∂wij= δyn

wjznj (1 − zn

j )xni (5.57)

= δzn

j xni (5.58)

5.2 Optimisation algorithms 96

where δzn

j = δyn

wjznj (1 − zn

j ). This result is also exactly the same as for the multiple class problem with

K =1. This is very convenient because it makes the backpropagation of errors independent of the number

of classes in the problem.

5.2 Optimisation algorithms

5.2.1 Gradient descent

As we mentioned above, the training of an MLP is performed by minimising the error function E(w) in

the weight space W, conformed by all the weights in the network, using the deepest descent method, also

called gradient descent. This algorithm can be applied in batch fashion or sequentially. The first version

averages the Δwn for all the patterns and then updates the weights. The sequential version updates the

weights after each pattern is presented to the network. In either version a suitable value for the learning

rate η needs to be selected. A range can be found for η by using a quadratic approximation of the error

function around the minimum at w∗.

E(w) ≈ E(w∗) +12(w −w∗)TH(w −w∗) (5.59)

where H is the Hessian matrix of the error function with elements Hij = ∂2E∂wi∂wj

|w∗ , for i = 1, 2, ...,W

and W is the total number of weights.

The gradient of this approximation is:

∇wE = H(w −w∗) (5.60)

The eigenvalue equation for the Hessian matrix is:

Hui = λiui (5.61)

where the eigenvectors ui can be used as a basis in W, so we can write:

w −w∗ =∑

i

αiui (5.62)


where αi can be interpreted as the distance from the minimum in the ui direction. Then, the gradient

approximation can be written in terms of the eigenvectors of H:

∇wE =∑

i

αiλiui (5.63)

and also the difference between the weights for two consecutive iterations of the algorithm:

w(τ+1) −w(τ) =∑

i(α(τ+1)i − α

(τ)i )ui

=∑

i Δαiui

(5.64)

But since Δw=−η∇wE(τ), then:

∑i Δαiui = −η

∑i α

(τ)i λiui

⇒ α(τ+1)i − α

(τ)i = −ηα

(τ)i λi

⇒ α(τ+1)i = (1 − ηλi)α

(τ)i

(5.65)

After τf steps from a starting point w0, with α(0)i :

α(τf )i = (1 − ηλi)τf α

(0)i (5.66)

To reach the minimum, αi should tend to zero as τf increases. Then, the condition on η and the λi’s is:

|1 − ηλi| < 1

⇒ 0 < ηλi < 2(5.67)

for i=1, 2, ...,W .

It can be proved that if λi >0 for all i the minimum at w∗ is a global minimum. This is true for a definite

positive Hessian matrix. In this case, the condition in Eq. 5.67 gives the following range for η:

0 < η <2

λmax(5.68)

Note that in Eq. 5.66 the step size is constant around the minimum, imposing a linear convergence

towards the minimum. The convergence speed will be dominated by the minimum eigenvalue, so taking

the maximum value allowed for η, we find that the size of the minimum step is:

1 − 2λmin

λmax(5.69)


The ratio λmin/λmax is called the conditional number of the Hessian matrix. If this ratio is very small

(i.e. the error function has high curvature around the minimum) the convergence will be extremely slow.

A way to overcome this problem by increasing the effective step size can be achieved by adding an extra

term in the weight update equation.

Momentum

Adding a term proportional to the previous change in the weight vector in the equation for the weight

update may speed the convergence of w and smooth the oscillations.

Δw(τ) = −η∇wE |w(τ) +μΔw(τ−1) (5.70)

where μ is called the momentum parameter. If the momentum rate is in the open interval (0, 1), the effect

of adding momentum to the weight update in low-curvature error surfaces is an increase in the effective

learning rate by the factor:

11 − μ

(5.71)

However, in regions of large curvature the momentum term loses its effectiveness, and oscillations around

the minimum generally occur. In fact, the gradient descent rule used in the backpropagation algorithm

makes a very inefficient search for the minimum because, in practice, the error gradient does not point

towards the minimum most of the time, causing oscillations in the search for the minimum. Another

disadvantage of this method is the inclusion of two parameters, η and μ, with non-specified values and

no formal criteria for choosing their values.

5.2.2 Conjugate gradient

If instead of moving w a fixed distance along the negative gradient, we look in this direction until we

find the minimum of E(w) and then set the point as the new weight vector, the size of the step becomes

optimum in the search direction. At the new point the component in the search direction of the error

gradient vanishes. If additionally, we choose the new searching direction as one that does not “spoil” the

5.3 Model order selection and generalisation 99

minimisation achieved in the previous direction, i.e. that keeps the projection of ∇wE in the previous

direction null, and minimise again in this new direction, and repeat the procedure successively, we will,

after W steps, reach the minimum w∗. The set of non-interfering (or conjugated) directions can be found

without any need for extra parameters as is shown in Appendix B, where this method is described in

detail. This represents a definite improvement over the gradient descent method, even if in practice the

convergence is achieved in more than W steps for general non-linear error functions.

As already stated above, this algorithm does not have any non-specified parameter and in general con-

verges much faster than gradient descent, but it requires the computation of the first and second order

partial derivatives of the error function with respect to the weights. The backpropagation formulae for

the first order derivatives still apply. To avoid the use of the Hessian in the computation of αj , a numer-

ical procedure of line searching, or central differences can be used as an approximation to the Hessian.

The latter is used in a modification of this algorithm called scaled conjugate gradient algorithm which is

described next.

Scaled conjugated gradient

Apart from using an approximation to avoid the calculation of the Hessian, this algorithm overcomes

the other two major drawbacks of the conjugate gradient algorithm that arise when the error is far from

being quadratic. A technique called model trust region, based on a quadratic approximation of E, can be

applied to make sure that every step leads to a lower error when the Hessian matrix is negative definite

(otherwise the error may increase with the step). Also, the quality of the quadratic approximation is

tested at every step to adjust the parameter of the model trust method. The scaled conjugate gradient

method is described in the Appendix B.

5.3 Model order selection and generalisation

As stated in section 5.1.3, the number of hidden units J in an MLP determines the accuracy and degree

of generalisation that the neural network can achieve. As with any regression problem, too few free


parameters will fail to fit the function properly, while too many parameters will over-fit the noisy data. A

compromise between accuracy and generalisation has to be found. This can be compared to the trade-off

between the bias and the variance of the network. A neural network with zero bias produces zero error

on average to all the possible sets of patterns drawn from the same distribution as the training data.

Even in this case, if the neural network has a marked sensitivity to a particular set of patterns, then the

variance of the network is high. Unfortunately there is no formal means of relating the number of hidden

units to the bias and variance of the network. Roughly, the number of hidden units should be close to the

geometric mean of the number of inputs I and the number of outputs K:

J =√

IK (5.72)

One of the simplest, although very demanding in computational resources, way to find the optimum J

is to train a set of neural networks with a range of number of hidden units around the estimate given

in Eq.5.72. Because the error optimising algorithm can get stuck in a local minimum, several random

weight initialisations should be tried in order to increase the probability of finding a good minimum. The

optimum is then selected based on the performance on an independent set of labelled input patterns,

known as the validation set.

Another way to “optimise” the number of free parameters in the network is to set a sensibly high value of

J and then penalise the least relevant weights in the network with an appropriate cost function during

training. This method is called regularisation and it will be explained next.

5.3.1 Regularisation

Regularisation is a common technique in regression theory which aims to encourage a smoother fit

through the inclusion of a penalty term Ω in the error function:

E = E + νΩ (5.73)

where ν is a control parameter for the penalty term Ω. The penalty function should be such that a good

fit will produce a small error E, while a smooth fit will produce a small value for Ω.


It is well known heuristically that an over-fitted mapping produces large values of weights, whereas small

values of weights will drive the activation units mostly in the linear region of the activation function,

producing an approximately linear mapping, which is the smoothest possible. Therefore, a good function

for Ω would be one that increases as the magnitude of the weights increases. The simplest of these is the

sum-of-the-squares, commonly called weight decay:

Ω =12

∑i

w2i (5.74)

If a gradient descent procedure is applied to optimise the modified error function E, the gradient of

which is proportional to the weights:

ΔwE = ΔwE + νw (5.75)

then the variation of the weights in “time” due to the penalty term can be seen as:

dwdτ

= −ηνw

⇒ w(τ) = w(0)e−ηντ

(5.76)

which shows how, as a result solely of the influence of the penalty term, all the weights “decay” expo-

nentially towards zero during the training. It can be easily shown in a second order approximation of the

error function that the components of the error function along the directions with the lowest variances

of E in weight space are the most penalised by the regularisation term. This can be expressed as:

wj =λj

λj + νw∗

j (5.77)

where w is the minimum of the error function with weight decay E, w∗ is the minimum of the original

error function E, and λj is an eigenvalue of the Hessian matrix H evaluated at w∗, in a weight space

aligned with the eigenvectors of H. Therefore, weight decay will tend to reduce the value of the weights

with less influence on the error function. The final result will be a smoother fit than the one achieved

with w∗.

The weight decay function in Eq. 5.74 is not consistent with a linear transformation performed on the

input data as it treats weights and biases on equal grounds. A bias unit is an additional unit in the input


and hidden layers of an MLP, with a permanent input value of 1, placed to compensate for the differences

between the yk’s mean and the tk ’s mean. Considering weights from different layers separately and

excluding the bias unit weights from the regularising term will solve this consistency problem as in:

Ω =νz

2

∑w∈Wz

w2 +νy

2

∑w∈Wy

w2 (5.78)

where Wz are the input-to-hidden weights except for the bias weights w0j , and Wy are the hidden-to-

output weights except for the bias weights w0k.

5.3.2 Early stopping

Another way to prevent an MLP with a relatively high number of hidden units from over-fitting the

training data is to stop the training process at a premature stage. This method, called early stopping,

makes use of a validation set to stop the training process when the error on the validation reaches a

minimum as is shown in Fig 5.5.

E

τvτ

validationerror

errortraining

Figure 5.5: Early stopping

5.3.3 Performance of the network

To evaluate the performance of the network the error function can be used, or its gradient. However,

given that the goal of training is to learn to discriminate between the classes of the training set in order

to perform interpolation of the class membership probabilities on test data, a more suitable measure of

performance could be the percentage of correctly classified patterns (accuracy) in a given dataset. For a


K-class problem and a dataset with Nk patterns from each class, the accuracy of a classifier is defined as:

A = 100 ∗∑K

k=1 N ck∑K

k=1 Nk

(5.79)

where N ck is the number of correctly classified patterns from class k. A more common measure in pattern

recognition is the classification error rate, defined as the percentage of misclassified patterns:

Erate = 100− A (5.80)

If several networks are being evaluated, the optimal network should be chosen according to the final

validation error, while the performance of the selected network should be measured on data never seen

before, that being the purpose of the test set.

If several networks trained on the same data are very close to each other in terms of validation error, a

committee of networks can be formed, the output of this association being the average of the individual

outputs. It can be proved that a committee of networks statistically performs better, or at least not worse,

than the individual networks [19, pp. 366].

The training, validation and test sets should ideally be of equal size. The number of input patterns in

the training set should be at least 10 times greater than the number of free parameters in the network

[159, pp.70], a requirement sometimes difficult to meet, even more if an equal number of patterns has

to be saved for the validation and test sets. In this case, a technique usually applied in statistics as part of

“jack-knife” estimation [109], cross-validation, can be applied, such that the data is split into S subsets,

or partitions, of equal size. Each of these subsets is in turn the test set for the neural network while

the remaining S − 1 subsets are used to form the training and validation sets. The S neural networks

obtained in this way can then be combined in a committee of networks. If the data is not plentiful for

division into S subsets, then the leave-one-out method, a variant of the jack-knife method whereby every

sub-set consists of only one sample, can be used.

5.4 Radial basis function neural networks 104

5.4 Radial basis function neural networks

Another kind of neural network, which can be used to estimate posterior probabilities, is the so-called

Radial Basis Function (RBF) network. Its architecture, shown in Fig. 5.6, is very similar in appearance

to the MLP but its operation is very different. Firstly, the activation function of the hidden units is not a

weighted summation followed by a non-linearity. Instead, it is a radially symmetric function φj (usually

a Gaussian) with a different mean vector μj for each unit. In addition, the output units only perform

a linear combination of the hidden unit outputs, without applying any nonlinear function to the result.

Also, RBF training is different from that of an MLP, since it is performed in two phases instead of one, and

a nonlinear optimisation process is not required, the equations for minimising the quadratic output error

over the second-layer weights being linear.

xi yk

wj k

x

x

1

2

xI

y1

yK

inputs outputs

Φ

JΦ

1

hiddenunits Φj

Figure 5.6: A radial basis function network

Originally developed to perform exact function interpolation, early RBF networks made a non-linear

mapping of N input vectors in an I-dimensional space to K target points in a 1-dimensional space,

through N radial basis φj(·) functions. In order to obtain better generalisation when fitting noisy data,

the number of basis functions was reduced, to a number significantly lower than the number of input

vectors. The resulting RBF has been widely used not only in noisy interpolation, but also in optimal

classification theory.

The mapped points or RBF outputs for a K-class problem are given by:

yk =J∑

j=1

wjkφj + w0k for k = 1, 2, . . . ,K (5.81)


For the case of a Gaussian basis function, φj is defined as:

φj(x) = exp{−1

2(x − μj)

TΣ−1j (x − μj)

}(5.82)

where μj and Σ represent the mean and covariance matrix respectively. In an RBF network, the covari-

ance matrix of the Gaussian basis functions can be considered to be of the form σ2I (hyper-spherical

Gaussians) without loss of generalisation. In this case, Eq. 5.82 takes the form:

φj(x) = exp

(−‖ x − μj ‖2

2σ2j

)(5.83)

The Gaussian functions in an RBF are un-normalised since any multiplier factors are absorbed in the

weights wjk in Eq. 5.81.

5.4.1 Training an RBF network

The first training phase is used to estimate the parameters of the basis functions φj(·) and no class

information is required for it, hence this phase is unsupervised. Once this phase is completed, the kernel

function activation is determined only by the distance between the input vector x and the mean vector

μj , and the kernel width σj . It can be shown [19] that, after this training phase, the summation of all the

radial basis function outputs is an estimate of the unconditional probability of the data p(x). Posterior

probabilities for each class P (Ck) are estimated at the outputs of the RBF network after the second phase

of training, which adjusts the second-layer weights wjk, this time using the target values tk of each input

vector x in the training set. For this reason, the second phase is called supervised.

Unsupervised phase: cluster analysis

Unsupervised training can be viewed as a clustering problem. Each Gaussian kernel represents a group

of similar vectors in the I-dimensional input space. Since the objective of the initial phase of learning

is to model the unconditional probability density function, the clusters do not necessarily separate data

from different classes. To summarise, the aim of this phase is to find the location of the cluster centres

and the distribution of data within them in order to determine the mean and variance of each Gaussian


(hidden unit of the RBF network). One of the most common clustering algorithms, the so-called K-

means algorithm [159], can be used for this purpose. The number of means K is chosen to be equal to

the number of hidden units J .

The K-means Algorithm This algorithm seeks to find a partition of the input data set into K regions or

clusters. Usually, the similarity criterion that defines a cluster is the distance between data (Euclidean in

most cases). The algorithm determines, for each of the K clusters, Ck, the location of its centre mk, and

identifies the patterns xi that belong to this cluster. In an iterative optimisation process, the partition is

modified so that the distances between the patterns belonging to a cluster and its centre are minimised.

This can be expressed as the optimisation of a quadratic error function defined by:

E2K =

K∑k=1

∑x∈Ck

‖ x −mk ‖2 (5.84)

Random values are initially assigned to the centres mk, and each data point is assigned to the cluster

with the centre nearest to it. Then, each centre mk is changed to be the mean of the data belonging to

the cluster Ck, reducing in this way the value of the error function defined in Eq. 5.84. These two last

steps are repeated until no significant change in centre positions is detected.

The procedure described above is known as the “batch” version of the K-means algorithm, since, at

every step, the centres are modified once all the patterns have been assigned to the clusters. There is an

“adaptive” version whereby the nearest centre is modified each time a pattern is considered, so that the

distance between them is reduced.

m(τ+1)k = m(τ)

k + η(x −m(τ)k ) (5.85)

where the parameter η is a learning parameter. The adaptive version is a stochastic procedure because

the patterns are chosen from the data set randomly, and the algorithm is more prone to becoming trapped

in a local minimum.

The value for K can be found by running the algorithm for k = 1, 2, 3, . . . until a knee in the curve


E2k-vs-k is obtained. Typically this curve decreases monotonically (reaching zero value for k = N), but

the “knee” indicates a substantial change in the ‘error function rate’, which is large for small values of k

and decreases less quickly for k values above the ‘knee’ value. The value of k at the knee can be taken as

the optimum value [159, pp.23].

Normalisation Since Euclidean distance is used to set the location of the Gaussian kernels, differences

in dynamic range between features will cause the smallest ones to be ignored by the clustering algorithm.

To avoid this, zero-mean, unit-variance normalisation is applied to the entire data set of N patterns before

the unsupervised phase:

xni =

xni − μi

σi(5.86)

where:

μi =1N

N∑n=1

xni (5.87)

σ2i =

1N − 1

N∑n=1

(xni − μi)2 (5.88)

Then, a clustering procedure like the one described in the previous section is performed to find a set of

J centres. Once the unsupervised phase of the RBF training is complete, the cluster variance σ2j is found

for each cluster Cj :

σ2j =

1Nj

∑x∈Cj

(x − μj)T (x − μj) (5.89)

where Nj is the number of patterns that belong to cluster Cj .

Second phase: linear optimisation of weights

The second training phase uses the labelled patterns in supervised learning mode. The output layer

receives the information from the hidden-unit outputs and it can be trained with a data set smaller than


the one used for the first training stage. The LMS algorithm is used to minimise the error function:

E(w) =12

N∑n=1

K∑k=1

⎛⎝ J∑

j=0

wjkφnj − tnk

⎞⎠

2

(5.90)

Differentiating the error with respect to the weights wjk and setting it to zero to find the minimum gives:

N∑n=1

⎛⎝ J∑

j′=0

wj′kφnj′ − tnk

⎞⎠ φn

j = 0, for j = 1, 2, . . . , J and k = 1, . . . ,M (5.91)

These equations are known as normal equations, and have an explicit solution. Using matrix notation:

ΦT ΦWT = ΦTT (5.92)

with the elements of the matrices being defined as (Tkn) = tnk , (Wjk) = wjk and (Φjn) = φj(xn). The

solution is:

WT = Φ†T (5.93)

where Φ† represents the pseudo-inverse matrix of Φ. The pseudo-inverse matrix is given by:

Φ† ≡ (ΦT Φ)−1ΦT (5.94)

for which Φ†Φ = I always, although ΦΦ† �= I in general. When data is noisy, it is very common to find

that the matrix (ΦTΦ) is nearly singular. In this case, the singular value decomposition (SVD) algorithm

can be used to avoid larger values for the weights wjk, since it sorts out the roundoff error accumulation

problem and chooses from a set of possible solutions the one that gives the smallest values [131].

5.4.2 Comparison between an RBF and an MLP

In general, the performance of an MLP is slightly better than that of an RBF. The reason behind this is be-

cause the MLP’s fully surpevised non-linear optimisation is in general better than the RBF’s unsupervised

non-linear clustering process followed by a linear optimisation[159]. Advantages of the use of an RBF are

the shorter training time, since training does not require non-linear optimisation, and the lack of the need

for a validation set. The hidden layer representation of an RBF is more accessible. Since it represents the

5.5 Data visualisation 109

unconditional probability of the training set, it can be used as a novelty detector on new data, when all the

hidden units show very low activation, indicating that the RBF network is extrapolating, and therefore,

no confidence should be given to the result [159].

5.5 Data visualisation

One of the first stages in the solution of a classification problem usually consists of getting more insight

into the data structure. If the features extracted do not reveal enough separation between the classes

in the feature space, a search for new features should be considered. It is also desirable to obtain more

details such as inter-subject variability and incidence of outliers. The visualisation of the data distribution

for a number of features L less than or equal to 3 is straightforward, otherwise more sophisticated

procedures are required.

The relations of proximity and organisation of the data in a feature space, with dimensionality higher

than 3, can be visualised through a non-linear projection from RL to R

M , with M typically 2 or 3. This is

the basis of the Sammon map, which will be described next.

5.5.1 Sammon map

Sammon’s algorithm seeks to create a mapping such that the distances between the image points in the

projection plane are as close as possible to the corresponding distances between the original data points

in feature space. The following error function at iteration number τ is defined:

E(τ) =1∑N

i

∑Nj=i+1 δ

(τ)ij

N∑i

N∑j=i+1

[dij − δ(τ)ij ]2

δ(τ)ij

(5.95)

where N refers to the number of vectors to be mapped, dij to the Euclidean distances between the vectors

xi and xj in L-space, and δ(τ)ij to the Euclidean distances between the corresponding vectors (images or

projections) y(τ)i and y(τ)

j in M -space.

dij =‖ xi − xj ‖ (5.96)

δij =‖ yi − yj ‖ (5.97)


Minimisation of this error function can be achieved, starting from random locations for the image points,

by adjusting them in the direction which gives the maximum change in the error function (gradient

descent method), as is shown in Eq. 5.98

y(τ+1)im = y

(τ)im − αΔ(τ)

im for m = 1, . . . ,M (5.98)

where:

Δim =∂E

∂yim÷

∣∣∣∣ ∂2E

∂y2im

∣∣∣∣ for m = 1, . . . ,M (5.99)

and the gradient proportionality factor α is determined empirically to be between 0.3 and 0.4 [141]. The

partial derivatives are given by:

∂E

∂yim=

−2∑Nk=1

∑Nj=k+1 dkj

N∑j=1j �=m

[dij − δij

dijδij

](yim − yjm) (5.100)

∂2E

∂y2im

=−2∑N

k=1

∑Nj=k+1 dkj

N∑j=1j �=m

1dijδij

[(dij − δij) −

(yim − yjm)2

dij

(1 +

dij − δij

δij

)](5.101)

A small number of representative vectors can be extracted (using K-means clustering, for example) to

reduce the number of computations required [O(N2)] to complete the mapping.

5.5.2 NeuroScale

The Sammon map’s main drawback is that it acts as a look-up table. Previously unseen data cannot be

located in the projection map without re-running the optimisation procedure. A parameterised transfor-

mation yi = G(xi;w), where w is the parameter vector, would allow the desired interpolation however.

This parametric transformation can be performed by a neural network. During training, this neural

network has no fixed targets, the outputs and weights being adjusted to minimise an error, or “stress”

measure, related to the Sammon map error function and given by:

E =N∑i

N∑j=i+1

[dij − δij ]2 (5.102)


where the terms dij and δij are given by Eqs. 5.96. and 5.97. The training of such a neural network

is said to be relatively supervised as there is no specific output target, but a relative measure of target

separation between each pair {yi,yj}.

For an RBF with H basis functions the square of dij can be expressed as:

d2ij =

M∑m=1

(H∑

h=1

whm[φh(‖ xi − μh ‖) − φh(‖ xj − μh ‖)])2

(5.103)

Then, the derivatives of the stress function with respect to the weights for each data point xi are given

by:

∂Ei

∂whm=

∂Ei

∂yi

∂yi

∂whm(5.104)

where:

∂Ei

∂yi= −2

N∑j=1j �= i

dij − δ2ij

δij(yi − yj) (5.105)

Note the difference between the derivative in Eq. 5.105 and the corresponding term for a supervised

problem with sum-of-squares error (see Eq. 5.90), the latter being given by:

∂Ei

∂yi= yi − ti (5.106)

Thus the relatively supervised training procedure has an estimated target vector ti given by:

ti = yi −∂Ei

∂yi

= yi + 2N∑

j =1j �= i

dij − δ2ij

δij(yi − yj)

(5.107)

However, the minimisation of the stress measure cannot be performed in one step as in the linear phase

of an RBF training (Eq. 5.93) because the estimated targets are not fixed, but depend upon the current

outputs yi and weights. Instead, the minimum can be sought in an iterative approach with an EM1-like

procedure, which is more efficient than backpropagation in an MLP [163].

1Expectation-Maximisation [19, pp.65] is a two-step procedure to solve the highly non-linear, coupled equations of maximumlikelihood optimisation problems


To prevent an increase in the stress during the early stages of the algorithm, when the estimate of the

targets is poor, a learning rate η(τ) control is introduced in Eq. 5.107:

t(τ)i = y(τ)

i − ητ∂Ei(τ)

∂y(τ)i

(5.108)

where η(τ) is initially set to have a small value and is progressively increased as the stress decreases

during training.

The training algorithm then becomes:

1. Initialise the weights to small random values

2. Initialise η1 to some small value

3. Calculate the pseudo-inverse matrix Φ†

4. Initialise τ = 1

5. Calculate the target vectors t(τ)i

6. Solve W(τ)T = Φ†T

7. Calculate the stress

8. • If the stress has increased, decrease η

• If the stress has decreased, increase η

9. If the stopping criterion is not satisfied, return to step 5

The increase and decrease in η in step 8 is arbitrarily set to a range of 10 − 20% [163]. If the stress

measure is comparable to the final stress calculated in a standard Sammon mapping procedure then the

algorithm can be stopped. This optimising procedure is called the shadow targets algorithm and the RBF

described in this section has been referred to as NEUROSCALE [163].

A caveat for the use of the techniques described above is that the lower dimensional projection generated

may show data overlap, which may not be present (or at least not in the same proportion) in the high

dimensional feature space.

5.6 Discussion 113

5.6 Discussion

So far we have introduced the neural network approach to classification as a non-parametric method

for the estimation of the posterior probabilities of class membership. Non-parametric methods are more

flexible than parametric approaches and are easier to apply than semi-parametric methods, such as Gaus-

sian Mixture Models [19, pp.60]. The probabilistic nature of the neural network outputs gives them an

advantage over other classifiers, like linear discriminants and support vector machines to mention a few

[169].

We presented two kinds of neural networks, the MLP and the RBF network, and stated that the first tends

to outperform the second one for the reasons given in §5.4.2 (see [175] for a comparison between an

MLP and an RBF performance in disturbed sleep analysis). A balanced dataset should be used to assign

the same relevance to all the classes. For MLP training, a validation and a test set should be reserved

from the balanced dataset in order to avoid over-fitting the training data. If the amount of data is not

sufficient to allow this partitioning, the leave-one-out method should be used.

The aim of the work presented in this thesis is to estimate the state of the brain in the sleep context

(for μ-arousal detection), and with in the vigilance alertness-drowsiness continuum. Although the data

is labelled according to six or seven discrete classes (see sections 3.2.1 and 3.4.2), the neural network is

capable of performing interpolation between classes. The 1-of-K code is recommended when the targets

are discrete, as is the case in both the sleep and vigilance problems. The cost function associated with

this coding scheme is the cross-entropy error function. Minimisation of the cost function can be achieved

efficiently by using the scaled conjugate gradients algorithm. The performance of the network may be

evaluated using the misclassification error as the criterion.

The trade-off between bias and variance of the network suggests that the search for an optimum network

architecture can be carried out by training several networks with different initial values for the network

parameters. Regularisation techniques can be applied to achieve better generalisation, and although the

values of the regularisation parameters cannot be found analytically, they can be included in an extensive

5.6 Discussion 114

search for the best generalisation, the latter being evaluated as the classification performance on the

validation set.

The techniques known as Sammon map and NEUROSCALE, introduced in this chapter, can be used to

visualise the relations of proximity between patterns from different classes in the feature space, providing

hints as to which classes should be used for neural network training, and helping to establish what might

be expected from the neural network performance. These visualisation techniques can also help to rule

out outliers, i.e. data from a distribution different from that of the training data, the NEUROSCALE map

being particularly useful when analysing new data.

Chapter 6

Sleep Studies

Prior to the analysis of the sleep of patients with OSA, a deeper understanding of normal sleep should

be acquired. Previous work has shown that neural network methods for data visualisation and classifi-

cation provide useful information on data structure and clustering [123], and these methods have been

successfully applied to sleep staging and tracking [143][123][16][158]. The EEG is the most significant

and reliable physiological measure of sleep, and is relatively easy to acquire as a signal. As we have

seen in Chapter 3, the sleep EEG of OSA sufferers has the same characteristics as that of normal subjects,

the difference being in the higher number of rapid transitions from sleep to wakefulness in OSA patients.

Therefore, it should be possible to use a neural network trained with normal sleep EEG data with the EEG

recorded from OSA patients. In this chapter we report on the training of MLP networks using a database

of normal sleep EEG records and investigate their subsequent performance on OSA sleep EEG (test data).

6.1 Using neural networks with normal sleep data: benchmark ex-periments

6.1.1 Previous work on normal sleep

In previous work [123], 10th-order AR modelling of 1s EEG segments and a visualisation technique

known as the Kohonen map [90] were applied to give an overall view in 2-D of the AR coefficients for

normal sleep EEG. Kohonen’s map is a self-organising algorithm which projects an entire data set or

input vectors from an L-dimensional space into a relatively few cluster centres or “code-vectors” laid

6.1 Using neural networks with normal sleep data: benchmark experiments 116

out on a mesh in a lower, M -dimensional space (usually M = 2), in such a way that the relations of

proximity (topology) between the input vectors are preserved1. This work showed that there were three

well differentiated groups or clusters of data in the sleep EEG database, corresponding to the stages of

wakefulness, REM/light sleep (stage 1) and deep sleep (stage 4). Intermediate stages 2 and 3 did not

form separate clusters, but transient events such as K-complexes and spindles were mapped onto different

regions of the map. This phase of learning is unsupervised since no labels are taken into account when

constructing the Kohonen map, although labels are used later to identify the clusters. Based on the results

obtained with the Kohonen map, a neural network was trained with the same sleep EEG database, the

aim being to classify the sleep EEG into the 3 categories identified in the Kohonen map, by estimation of

the posterior probabilities of class membership. Results on test data showed that the plot of the neural

network outputs over time “tracks” the sleep-wake continuum with a better resolution than the R&K

discrete stages (as the neural network outputs can take any value between 0 and 1) and with a better

resolution in time since 1-s epochs are used to segment the EEG rather than the 30 seconds of the R&K

hypnograms. Fig. 6.1 shows the time course of the three neural network outputs (P(W ) for wakefulness,

P(R) for REM/light sleep, and P(S) for stage 4) for a 7-hour sleep recording. The main features of

the normal sleep-wake cycle can be seen in these plots. The P(W ) output takes a value close to 1 at

the beginning of the night, followed by a rapid descent to zero and remains at this level for about 40

minutes, while the P(R) output rises from zero to a value higher than 0.5 at the same time as the P(W )

output decreases, indicating a transition from fully awake to the first stage of sleep (sleep onset). The

P(S) output, which starts at zero, rises steadily as the P(W ) and P(S) outputs decrease, and stays high for

the remaining 40 minutes of the first hour of the night. For the rest of the night, the P(R) and the P(S)

outputs wax and wane alternately, an indication of the 90-minute REM and non-REM sleep cycle, with a

progressive lightening of sleep as the night advances. When P(W ) is high (subject awake), P(S) is low,

since it is not physiologically possible that these two probabilities can exhibit a similar value, except when

both are near zero. In such a case the value P(R) must be high since the sum of the three probabilities

1Kohonen’s map main disadvantage over Sammon’s map is that the image points are constrained to lie on a rectangular grid


must be equal to one, indicating that the subject is in REM/light sleep. Hence, the Wakefulness output

P(W ) and the deep sleep output P(S) were combined in a measure of “sleep depth” P(W )-P(S), in which

the values of 1, 0 and -1 indicate wakefulness, REM/light sleep and deep sleep respectively. The trace

P(W )-P(S), shown at the bottom of Fig. 6.1, is similar to the R & K hypnogram, but with a continuous

resolution in amplitude and a ×30 time resolution.

0 1 2 3 4 5 6 7

−1

0

1

0

1

0

1

0

1

hours

P(W

)−P

(S)

P(W

)P

(R)

P(S

)

Figure 6.1: The neural network’s wakefulness P(W ), REM/light sleep P(R) deep sleep P(S) outputs; andmeasure of sleep depth P(W )-P(S) (from Pardey et al. [123])

The above work established the feasibility of using AR coefficients at the inputs to a neural network with

3 outputs to describe the sleep-wake continuum. A previous investigation [176] compared the use of

5th order AR parameters with the power in five EEG bands when these where used as inputs to a neural

network, and showed that they contribute the same information to the analysis of normal and disturbed


sleep. In this chapter, we build on these results in order to detect μ-arousals in the sleep of OSA subjects.

A 10-th model order is used as Pardey et al. [123] found that some EEG segments corresponding to

wakefulness may be under-fitted with a lower model order. Since the effect of OSA on the EEG is only

to change the sleep structure rather than the EEG itself, we can use a neural network trained on normal

subjects to analyse the sleep of OSA patients. We re-visit the choice of algorithm for extracting coefficients

and aim to minimise variance whilst ensuring stationarity. We go beyond the work described in [123]

by carrying out a thorough investigation of network architecture and free parameters (including weight

decay coefficients) in order to identify the optimal network. This involves training more than 2,000

networks. Finally, we use the optimal network in order to detect μ-arousals in sleep EEG recordings

acquired from seven subjects with severe OSA.

6.1.2 Data Extraction

The normal sleep EEG from nine healthy female adults, with no history of sleep disorders, aged between

21 and 36 years (average 27.4), was recorded with electrode pair C4/A1, and digitised with an 8-bit

A/D converter and a sampling rate of 128 Hz. Prior to digitisation, the analogue EEG was filtered with a

bandpass filter (0.5–40 Hz with a −40 dB/dec slope in the transition band). Two EOG channels and the

submental EMG were also recorded for the purpose of generating R & K hypnograms.

The length of every record was approximately 8 hours. Each record was divided into 30s-segments, and

classified separately by three human experts, trained in the same laboratory, according to the R&K rules.

The number of 30s-segments for which the three experts were in agreement in their classification varied

from 137 for stage 1, which a transitional stage that only last a few minutes (see section 3.2.1), to 2,665

for stage 2, the most abundant and easiest to score of all sleep stages. These segments are referred to as

being consensus-scored.


6.1.3 Feature extraction

Pre-processing

The sampled EEG signal was also digitally filtered with a low-pass linear phase filter, with a cutoff fre-

quency at 30 Hz, a bandpass gain of 1.00 ± 0.01 and −50 dB attenuation at 50 Hz, using a zero-phase-

distortion filtering technique. The mean of each EEG recording (calculated over the whole record) was

removed.

Autoregressive Analysis

To apply AR modelling to the EEG segments, an investigation of the algorithms described in chapter 4

was undertaken. The relationship between segment data length and the bias and variance of the AR

coefficients was also studied. 911 and 5,214 consensus-scored 4s-segments of Wakefulness and Sleep

Stage 4 respectively were selected from the database, and their reflection coefficients (for an AR model

order 10) were estimated using the Burg algorithm. The means of these estimates, μW and μS , were

then used to synthesise “typical” Wakefulness and Sleep Stage 4 EEG signals. The mean reflection coeffi-

cients were transformed to 10th-order mean feedback coefficients (for definition see §4.6.1) by using the

inverse Levinson-Durbin recursion (Eq. 4.102). Following the procedure described in section 4.3.3 for

AR synthesis, an ensemble of 500 time series with length N was generated using a white noise generator

with unit variance and the AR feedback coefficients, and four algorithms, namely Burg, Covariance (Cov),

Modified Burg (ModBurg) and Structure Covariance Matrices (SCM), were used in turn to estimate the

AR reflection coefficients for each time series. The Euclidean distance between the mean of the estimates

for the ensemble μ and the value used to generate the ensemble μ was calculated, as well as the trace of

the covariance matrix of the estimates TrS. The results for N , a power of 2 from 16 to 512, are shown

in Tables 6.1 and 6.2, and show that there is little difference between the various algorithms, at least for

data lengths N ≥ 128.

The Burg algorithm has lower computational cost than the others and so was chosen to estimate the


mean errorN 16 32 64 128 256 384 512Burg 0.3814 0.1714 0.0952 0.0517 0.0272 0.0204 0.0155Cov -a 0.6402 0.1260 0.0521 0.0270 0.0204 0.0154ModBurg 3.9057 0.7073 0.0947 0.0520 0.0276 0.0204 0.0155SCM 0.9083 0.1783 0.0990 0.0529 0.0277 0.0204 0.0156

covariance matrix traceN 16 32 64 128 256 384 512Burg 0.6267 0.2000 0.0929 0.0449 0.0207 0.0143 0.0105Cov -a 395 4.4268 0.0500 0.0219 0.0148 0.0107ModBurg 7897 225 0.0958 0.0451 0.0209 0.0144 0.0106SCM 1.5395 0.2175 0.0955 0.0455 0.0210 0.0144 0.0106

aCovariance algorithm requires the data length to be at least twice the model order

Table 6.1: Mean error and trace of covariance matrix for synthesised EEG reflection coefficients (wake-fulness)

mean errorN 16 32 64 128 256 384 512Burg 0.3410 0.1407 0.0719 0.0405 0.0189 0.0128 0.0094Cov -a 7.2100 0.5122 0.0406 0.0187 0.0128 0.0093ModBurg 2.1033 1.2879 0.0707 0.0399 0.0188 0.0128 0.0094SCM 0.8055 0.1410 0.0729 0.0405 0.0189 0.0127 0.0094

covariance matrix traceN 16 32 64 128 256 384 512Burg 0.5482 0.1719 0.0809 0.0385 0.0189 0.0127 0.0096Cov -a 27595 68.6 0.0428 0.0195 0.0129 0.0097ModBurg 2620 839 0.0837 0.0390 0.0190 0.0127 0.0096SCM 1.5370 0.1861 0.0829 0.0391 0.0190 0.0127 0.0096

aCovariance algorithm requires the data length to be at least twice the model order

Table 6.2: Mean error and trace of covariance matrix for synthesised EEG reflection coefficients (stage 4)


reflection coefficients of a 10th-order AR model of the EEG data. Figure 6.2 illustrates the results of the

Burg algorithm on the synthesised EEG with N going from 16 to 1,024. It can be seen that the accuracy

and the variance of the estimates improve significantly once N gets to 256.

As was discussed in §4.7, stationarity concerns suggest that segments should not be greater than 1 second

(128 samples) when analysing the EEG. But the results here show that the variance of the estimates can

be reduced by increasing the length of the time series segment to 256 or greater. A compromise was found

by using a 384-sample sliding window (corresponding to 3 seconds), which is advanced in one-second

steps (128 samples). Each set of reflection coefficients is then taken to represent the middle second of

the 3-second window.

0 200 400 600 800 10000

0.1

0.2

0.3

0.4

data length N

mea

n er

ror

Wakefulness

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

data length N

cova

rianc

e m

atrix

trac

e

Wakefulness

0 200 400 600 800 10000

0.1

0.2

0.3

0.4

data length N

mea

n er

ror

Sleep stage 4

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

data length N

cova

rianc

e m

atrix

trac

e

Sleep stage 4

Figure 6.2: Mean error and covariance matrix trace for reflection coefficients computed with the Burgalgorithm (wakefulness and Sleep stage 4) vs data length N

6.1.4 Assembling a balanced database

The same number of segments for each of the categories (wakefulness (W), REM (R) and sleep stage 4

(S)) was randomly taken from the overall database to build a balanced data set using only consensus-


scored segments, the overall number being determined by the minimum number available in any one

class. In sleep studies, it is obviously the wakefulness set which will be the smallest, with only 164 30-s

segments, yielding 4,920 one-second segments and hence 4,920 sets of 10 reflection coefficients were

assembled for each class. From now on, this dataset will be referred to as the balanced sleep dataset.

6.1.5 Data visualisation

The Sammon map and NEUROSCALE visualisation techniques (see section 5.5) were applied in order to

gain insight into the clustering present in the data. The reflection coefficients were normalised to give

a zero mean and unity standard deviation in each axis of the feature space. This gives each coefficient

equal importance a priori.

K-means algorithm

Given that the amount of data (14,760 data points) is too large to be handled by the visualisation algo-

rithm, per-class clustering using the K-means algorithm was applied with 60 means per class and an η

factor (see Eq. 5.85) of 0.02.

Sammon Map

A 2D Sammon map algorithm was applied to the 180 mean vectors generated by the K-means algorithm.

The gradient proportionality factor α, (see Eq. 5.98) was adjusted to a value of 0.06. The Sammon map

for the three classes and for each class separately is shown in Fig. 6.3, with a circle around each centre

whose radius indicates the relative size of the cluster represented by the centre.

It can be seen from the map that the classes form well defined clusters with some overlap between them.

The Wakefulness cluster is the most sparse, whilst the REM/light sleep cluster lies between wakefulness

and deep sleep, as expected.


(a) classes W, R and S (b) W class

(c) R class (d) S class

Figure 6.3: Sammon map for the balanced sleep dataset; classes W, R and S

NeuroScale

A NEUROSCALE neural network with 50 basis functions was trained with the same reduced data set for

comparison and also to explore the overall distribution of the data points, as the advantage introduced

by this visualisation technique is that data not seen before, but belonging to the training data distribu-

tion, can be mapped onto the trained visualisation map (see §5.5.2). The map of the centres and the

subsequent projection of all the data points in the balanced feature set are shown in Fig. 6.4.

6.1.6 Training a Multi-Layer Perceptron neural network

A multi-layer perceptron was chosen over a radial basis function neural network because MLPs tend to

perform slightly better than RBF networks, when the latter are trained using the two-phase process de-

scribed in §5.4. The balanced dataset was divided into 3 balanced subsets of 4,920 data points each

(1,640 per class), namely the training set, validation set and test set (see introductory section and sec-

tion 5.3 in chapter 5). Although, all the inputs xi have absolute value equal to or lower than 1, they have


−10 −8 −6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6NeuroScale with sleep data, 60 means per class, 50 basis functions, 500 iterations

Wakefulness REM/Light SleepDeep−sleep

(a) Means only

−10 −8 −6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6NeuroScale with sleep data, 60 means per class, 50 basis functions, 500 iterations

Wakefulness REM/Light SleepDeep−sleep

(b) All patterns

Figure 6.4: NEUROSCALE map for the balanced sleep dataset; classes W, R and S

different dynamic ranges. In theory, this does not affect the MLP training, since the weights are capable of

correcting the differences in dynamic ranges, but certain optimisation procedures, such as regularisation

(see Eq. 5.78), require equal range of variations at the inputs to the neural network. Hence, zero-mean

and unit-variance normalisation was performed on the three subsets, using the training set statistics (μ,

σ). Normalisation also helps to reduce neural network training time [159, pp.84]. The 3 outputs of the

MLPs represent the classes W, R or S (1-of-K coding with softmax activation post-processing). Cross-

entropy error was selected as the cost function for the scaled conjugate gradients optimisation algorithm

for network training.

Optimising the network architecture

As was explained in chapter 5 there is no analytical means of determining the optimal value of MLP

parameters such as the number of hidden units J or the weight decay factors νz and νy. Although we

know that the regularising parameters νz and νy penalise an “excessive” number of hidden units, the

limits for this number are unknown. Therefore, we evaluate the performance on the validation set of a

number of MLP’s trained with values of these parameters varying over a given range in order to find the

“optimal” MLP architecture.

Equation 5.72 suggests that the numbers of hidden units J should be approximately the geometric mean


of the number of inputs times the number of outputs, i.e.√

10 × 3 = 5.5 here. J was therefore varied from

4 to 10. No guideline is available for the regularisation parameters, and so these were varied between

10−6 to 10−2 in powers of ten. To avoid being trapped in local minima, a stochastic optimum search

was performed by shuffling the patterns using five random seeds when allocating them to the training,

validation and test sets. In addition, three different random weight initialisations were employed. This

yields the following total number of networks of:

5 shuffling seeds for training, validation and test sets ×3 weight initialisations ×

7 values of J×5 values of νz×5 values of νy

=2,625networks

The results show little performance variation with respect to weight initialisation, no more than 0.9% in

the difference between the best and the worst classification error. The variation in the classification error

for the data shuffling into the three datasets is not greater than 1.5% for the training and test sets, and

less than 2% for the validation set. Figure 6.5 shows the relationship between the number of hidden units

and the average performance of the networks (averaged for all data partitioning seeds, weight seeds and

reguralisation terms).

4 5 6 7 8 9 106.3

6.4

6.5

6.6

6.7

6.8

6.9

7

7.1Mean classification error vs number of hidden units

number of hidden units

% m

iscl

assi

fatio

ns

training set validation settest set

Figure 6.5: Average performance of the MLPs vs number of hidden units

It can be seen from the plot that the performance for the training set improves monotonically with an

increase in the number of hidden units. But the percentage of misclassifications in the validation set


(and also the test set) reaches a minimum for J = 6 and then shows a slight increasing trend for larger

number of hidden units. The three neural networks which produce the best classification performance

on the validation set were all generated using the same shuffling seed, but have different initial values of

weights. The smallest of the three has a 10-6-3 architecture (see Table 6.3). Therefore, the 10-6-3 MLP

with νz = 10−4 and νy = 10−5 was chosen as the optimal network. Incidentally, this 10-6-3 MLP has the

best performance on the test set of the three optimal MLPs.

J νz νz training validation test8 10−3 10−6 6.63% 5.75% 6.83%7 10−3 10−6 6.54% 5.75% 6.79%6 10−4 10−5 6.63% 5.75% 6.28%

Table 6.3: Misclassification error (expressed as a percentage) for the best three MLPs

Figure 6.6 shows the performance of the 10-6-3 MLP for the whole range of (νz , νy) parameters. The

training set performance increases as the values of (νz, νy) decrease. However, the validation set per-

formance (and also that of the test set) shows an increase in the percentage of misclassifications as the

values of (νz , νy) are simultaneously decreased, the minimum being located at the (10−4, 10−5) point in

the (νz, νy) plane. A similar trend is found for the rest of the trained MLPs, with the exception of three

10-8-3 MLPs (same set shuffling seed and weight initialisation seed), from all the 2,625 MLPs, which were

the only ones to get stuck in a local minimum, with a percentage of misclassification equal to 66.3%.

It is clear that MLP performance on the training set tends to improve as parameters are moved towards

their extreme values. But the validation set performance also reveals that the MLP is being over-trained

as the number of hidden units is increased, or as the amount of regularisation is decreased. These trends

are all related, as the regularisation parameters penalise the non-relevant weights, compensating for an

excessive amount of hidden units.

6.1.7 Sleep analysis using the trained neural networks

The missclassification error on the test set, in Table 6.3, only shows how well an MLP trained using

“well-defined”, consensus-scored EEG segments from the three main stages of the sleep-wakefulness con-


−6

−5

−4

−3

−2

−6

−5

−4

−3

−24

5

6

7

8

9

log10

(νy )

training set

log10

(νz )

% m

iscl

assi

ficat

ion

−6

−5

−4

−3

−2

−6

−5

−4

−3

−24

5

6

7

8

9

log10

(νy )

validation set

log10

(νz )

% m

iscl

assi

ficat

ion

−6

−5

−4

−3

−2

−6

−5

−4

−3

−24

5

6

7

8

9

log10

(νy )

test set

log10

(νz )

% m

iscl

assi

ficat

ion

Figure 6.6: Performance of the 10-6-3 MLP vs regularisation parameters

tinuum, performs on data with the same “well-defined” characteristics. In order to test the performance

of the MLP with more general and “noisy” data (still drawn from the same distribution), the optimal

MLP was used to process an overnight record from one of the subjects in the sleep database (subject ID

9). The 10 reflection coefficients extracted from the EEG were presented to the MLP consecutively, on a


second-by-second basis (using a 3-second windows with 2-second overlap). The results for the 3 outputs,

the probability estimates P(W ), P(R) and P(S) are shown in Fig. 6.7. As expected, the night starts with a

high value for P(W ), and then this value decreases progressively, while the P(S) value increases. When

the P(R) output rises, the P(S) value decreases, suggesting that the subject has a REM or light sleep

period2.

0

1

Sleep database subject 09 MLP outputs

P(W

)

0

1

P(R

)

0 1 2 3 4 5

0

1

P(S

)

time [hours]

(a) all-night time courses

0

1

Sleep database subject 09 MLP outputs (zoom in)

P(W

)

0

1

P(R

)

1:12:00 1:13:12 1:14:24 1:15:36 1:16:48 1:18:00 1:19:12 1:20:24 1:21:36 1:22:48 1:24:00

0

1P

(S)

time (HH:MM:SS)

(b) zoom-in

Figure 6.7: MLP outputs, P(W ), P(R) and P(S) for subject 9’s all-night record, showing a 12-minutesegment detailed

Using the representation of the sleep-wake continuum described in [123] and in section 6.1.1, we com-

pare the “depth of sleep” [P(W )-P(S)] with the hypnogram generated by a human expert in Fig. 6.8(a)

and (c). The extreme values (-1,+1) indicate the deep sleep and fully awake states respectively, and the

middle value (0) indicates REM/light sleep.

The spikes in the [P(W )-P(S)] output have two different causes. In the first instance, the MLP output

is generated on second-by-second basis, while experts score sleep on a 30-s basis, and so some of the

spikes in the MLP output show the short-time variations of the sleep-wake process. The second cause of

the spikes is the variability of the AR estimates and the possible overlap between the classes as shown by

the 2D projections in §6.1.5. To minimise the first of these effects when comparing with the 30-s epoch

2It is not possible to distinguish between REM or light sleep on the basis of the EEG alone


hypnogram, a 31-point median filter is applied to [P(W )-P(S)] for comparison with the hypnogram (see

Fig. 6.8(b)3).

0 1 2 3 4 5

−1

0

1

Sleep database subject 09

P(W

)−P

(S)

(a)

0 1 2 3 4 5

4

3

2

1

R

M

W

slee

p st

ages

time [hours](c)

0 1 2 3 4 5

−1

1

31−

pt m

edia

n fil

tere

dP

(W)−

P(S

)

(b)

Figure 6.8: Sleep database subject 9 P(W )-P(S), raw (a) and 31-pt median filtered (b) compared tohuman expert scored hypnogram (c)

The correlation between these two plots is excellent. The [P(W )-P(S)] output shows an initial value of 1

for the first 20-minute interval, in agreement with the human expert, who scored it as wakefulness. Then,

the [P(W )-P(S)] output shows three slow oscillations between −1 and 0, which match the transitions

from deep sleep (stage 4) to REM/light sleep (stage 1) of the hypnogram and back. During the intervals

in which the [P(W )-P(S)] output has a well defined mean at a value of -1 the hypnogram indicates sleep

3Label “M” in the hypnogram stands for movement

6.2 Using the neural networks with OSA sleep data 130

stage 4. Also, the intervals scored by the human expert as REM/sleep stage 1 (or light sleep) correspond

closely to those in which the [P(W )-P(S)] output has a near-zero mean. It is interesting to note that

some of the remaining spikes in the filtered [P(W )-P(S)] correspond to periods of movement. Movement

generally induces high-frequencies in the EEG, which can be indistinguishable from β rhythm once the

EEG has been low-pass filtered (see §3.1.4), and hence are categorised by the MLP as wakefulness. The

intervals corresponding to intermediate stages 2 and 3 in the hypnogram are not very stable, nor is the

[P(W )-P(S)] output, which shows the most pronounced local oscillations during these intervals.

6.2 Using the neural networks with OSA sleep data

6.2.1 Data description, pre-processing and feature extraction

Sleep EEG recordings from seven subjects with severe OSA (provided by the Osler Chest Unit, Churchill

Hospital, Oxford), with apnoea/hypopnoea index (AHI) higher than 30/h, were analysed in order to

detect the occurrence and length of the micro-arousals. The Fp1/A2 or Fp2/A1 electrode pair was used

instead of the C4/A1 montage to facilitate the recognition of the arousals by the human experts [156].

Other electrophysiological measures like EOG, chin EMG, nose and mouth airflow, ribcage and abdominal

movements, and oxygen saturation, were also taken to aid the experts in the identification of the breath-

ing events. The length of the records varies from 32 to 61 minutes, but in all of them only 20 consecutive

minutes were scored according to standard American Sleep Disorders Association (ASDA) rules [11] (see

§3.3.2).

The OSA sleep EEG was sampled and pre-processed in the same way as the normal sleep data (see §6.1.3).

Autoregressive analysis with model order 10 was applied to the EEG recordings using the Burg algorithm

and a sliding window as described in §6.1.3. The patterns consisting of 10 reflection coefficients for each

second were stored as an OSA test set, with each recording being processed as a continuous sequence of

patterns.


6.2.2 MLP analysis

Normalisation was carried out on the OSA patterns using the normal sleep training set statistics. The

normalised OSA test set of patterns was then presented to the 10-6-3 MLP selected in §6.1.6, which had

been trained with the normal sleep data. Figure 6.9 shows the MLP outputs for 20 minutes of processed

EEG from two representative subjects in the OSA database, with ID number 3 and 8. The [P(W )-P(S)]

output is shown in Fig. 6.10 for each subject (upper and middle traces). Twenty minutes of [P(W )-P(S)]

for sleep subject 9 from the normal sleep database, chosen from her second hour of sleep, during the

transition from deep sleep to REM sleep, are shown at the bottom of Fig. 6.10 for reference. None of the

outputs shown in Figs. 6.9, 6.10 or in the subsequent figures in this chapter has been median-filtered.

0

0.5

1

OSA subject 3

P(W

)

0

0.5

1

P(R

)

0 2 4 6 8 10 12 14 16 18 20

0

0.5

1

P(S

)

time in minutes

0

0.5

1

OSA subject 8

P(W

)

0

0.5

1

P(R

)

0 2 4 6 8 10 12 14 16 18 20

0

0.5

1

P(S

)

time in minutes

Figure 6.9: OSA sleep MLP outputs for subjects 3 and 8

The oscillating nature of the [P(W )-P(S)] output shown in Fig. 6.10 compared with its counterpart in

normal sleep suggests that the sleep cycle in the OSA database is severely disrupted, with frequent (more

than 1 per minute) transitions from deep sleep to wakefulness for brief periods of time.

6.2.3 Detection of μ-arousals

According to the ASDA rules, a non-REM4 sleep μ-arousal is defined as an EEG shift in frequency lasting

3 seconds or more [11]. Given that the MLP has been trained to detect the changes in the EEG frequency

4Submental (chin) EMG is necessary to score an μ-arousal in REM sleep. Given that we are only using the EEG, this study isrestricted to non-REM sleep events


−1

0

1

OSA subject 3

P(W

)−P

(S)

−1

0

1

OSA subject 8

P(W

)−P

(S)

0 2 4 6 8 10 12 14 16 18 20

−1

0

1

Normal sleep subject 9

P(W

)−P

(S)

time in minutes

Figure 6.10: [P(W )-P(S)] output for OSA sleep subjects 3 (top) and 8 (middle); and for normal sleepsubject 9, (bottom)

associated with sleep, μ-arousals can be detected from the [P(W )-P(S)] output by applying a threshold

and discarding transitions which last for less than 3s. The ASDA rules also treat two consecutive μ-

arousals separated by less than 10 seconds as the same event. Therefore, we can automate the μ-arousal

scoring process according to the ASDA rules by removing pulses (thresholded [P(W )-P(S)] output) whose

duration is less than 3s and by merging two pulses which are separated by less than 10s. The automated

μ-arousal detection procedure applied to 3 minutes of [P(W )-P(S)] output from OSA subject 3 is shown

in Fig. 6.11, with a threshold of 0.5.

Events marked as “A” in Fig. 6.11 (middle trace) are pulses shorter than 3s, while those shown as “B” are

negative-going transitions also shorter than 3s. These two types of events have been removed from the

final output (lower trace). An event denoted by the letter “C” corresponds to two pulses separated by

less than 10s. These are considered to be the same event, according to the ASDA rules and they therefore


−1

0

0.5

1

P(W)−P(S)

0

1

after thresholding

A B CA B C

200 220 240 260 280 300 320 340 360

0

1

including 3s and 10s ASDA criteria

time [s]

Figure 6.11: μ-arousal detection procedure. Upper trace: [P(W)-P(S)] and a 0.5 threshold; middle trace:thresholding result; lower trace: μ-arousal automatic score with ASDA timing criteria

appear merged in the final μ-arousal output on the lower trace.

To evaluate the performance of this μ-arousal detector, the final output was compared with the μ-arousals

scored by the human expert (visual scoring). A true positive is found when both the visual and the

automatic scores agree on the occurrence of an event (logical AND equal to 1) as is shown in Fig. 6.12

for OSA subject 2. A false positive is an event only scored by the automated system (post-processed

[P(W )-P(S)] output). False negatives are the events missed by the automated system, scored only by the

expert using the visual method.

In the case of multiple detection of a single event only one true positive is counted, as can be seen in the

middle trace of Fig. 6.12 for the 3rd and 4th pulses. These two automated scored events match the second

visually scored μ-arousal but are considered as a single true positive. The dip between the two pulses is


0

1

automated system scores for threshold 0.7

TP FP TP TP TP TP FN

0

1

automated system scores for threshold 0.8

TP FP TP TP TP TP FN

150 200 250 300 350 400

0

1

human expert scores

time [s]

Figure 6.12: μ-arousal validation Upper trace: automated score for 0.7 threshold; middle trace: auto-mated score for 0.8 threshold; lower trace: visually scored signal

not counted as a false negative. This is an arbitrary decision, introducing a bias in favour of the automated

system, but it is taken to facilitate comparison between different thresholds (see section 6.2.4).

The performance of the automated μ-arousal detector was assessed by estimation of the ratios known as

sensitivity (Se) and positive predictive accuracy (PPA)[138], given by:

Se = P ( an event has been detected | an event has occurred )

≈ TP

TP + FN(6.1)

PPA = P ( an event has occurred | an event has been detected )

≈ TP

TP + FP(6.2)

where TP is the number of true positives, FP the number of false positives and FN the number of false


negatives.

Se indicates the ability of the method under test to detect events, while PPA represents the selectivity of

the method, i.e. the ability to pin-point only the true events. A low value for the PPA indicates a large

number of false detections. The ideal detector would have Se and PPA values equal to 1.0, since neither

false negatives nor false positives would occur.

Although performance measures such as Se and PPA can give an idea of how many events are identified

by the automated system, they do not provide any indication of the relative timing between the events

scored by the automated system and those scored by the human expert. This is illustrated in Fig. 6.12,

where two sets of scores generated from the [P(W )-P(S)] output (thresholds 0.7 and 0.8) with the same

number of true positives (TP ), false positives (FP ) and false negatives (FN), and hence the same Se

and PPA, are compared with the human expert scores. The gray thick lines under each signal indicate

the segments for which there is an exact match between the automated and the human expert scores.

The first true positive found by the automated system with a threshold of 0.7 (upper trace on Fig. 6.12)

has a similar duration and starting time as the visually scored event (lower trace). This is no longer

true when the threshold is given a value of 0.8 (middle trace). Other examples can be found later: see

the second, fourth and fifth events. For this reason, the correlation measure given below is used as an

additional indicator of the performance of the automated μ-arousal detector.

Corr = 1 − 1N

N∑i=0

(ynn(i) ⊕ yhs(i)) (6.3)

where yws(i) represents the [P(W )-P(S)] output at time i seconds, thresholded and with pulses shorter

than 3s filtered out, yhs(i) represents the human scores, the ⊕ sign denotes the binary “exclusive OR”

operation and N is the duration in seconds of the two sequences.

For the two sequences shown in Fig. 6.12 (thresholds of 0.7 and 0.8) the correlation indices have values

of 0.83 and 0.75 respectively.


6.2.4 The choice of threshold

The shift in frequency that defines a μ-arousal can occur from any sleep stage to a lighter stage (sleep or

wake). This poses a problem in the setting of the threshold, illustrated in Fig. 6.10, which show subject

3’s sleep-wake continuum going from a value near 0 (REM or light sleep) to a value near 1 (wakefulness),

while subject 8’s sleep is disrupted at a deeper level, going from near -1 (deep sleep or sleep stage 4)

to near 1 (wakefulness). Several values of threshold from the [0-0.9] range were investigated and the

values of Se, PPA and Corr were calculated from each of these. The results are shown in Table 6.4 and

Fig. 6.13.

ThresholdSubject 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Se 1.00 1.00 1.00 1.00 0.97 0.97 0.97 0.94 0.77 0.392 PPA 1.00 1.00 1.00 1.00 0.97 0.94 0.94 0.91 0.89 0.92

Corr 0.44 0.67 0.72 0.79 0.82 0.83 0.81 0.81 0.74 0.65Se 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

3 PPA 1.00 1.00 1.00 0.96 0.96 0.90 0.90 0.87 0.93 1.00Corr 0.37 0.45 0.57 0.68 0.74 0.81 0.85 0.89 0.93 0.90Se 1.00 1.00 1.00 1.00 1.00 0.96 0.96 0.96 0.85 0.50

4 PPA 1.00 1.00 0.96 1.00 1.00 1.00 1.00 1.00 1.00 1.00Corr 0.40 0.76 0.90 0.92 0.90 0.87 0.85 0.83 0.77 0.68Se 0.88 0.62 0.47 0.41 0.32 0.26 0.26 0.18 0.15 0.06

5 PPA 0.86 0.75 0.76 0.88 0.92 0.90 0.90 0.86 1.00 1.00Corr 0.40 0.65 0.73 0.75 0.74 0.73 0.72 0.71 0.70 0.69Se 1.00 0.93 0.83 0.76 0.72 0.72 0.66 0.66 0.55 0.45

6 PPA 1.00 0.96 0.92 0.92 0.95 0.95 0.95 0.95 0.94 1.00Corr 0.46 0.76 0.84 0.83 0.82 0.82 0.80 0.76 0.72 0.64Se 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.68 0.50

7 PPA 1.00 1.00 0.96 0.96 0.92 0.88 0.88 0.91 1.00 1.00Corr 0.38 0.60 0.71 0.77 0.79 0.83 0.83 0.83 0.74 0.71Se 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.91 0.56

8 PPA 1.00 1.00 1.00 0.97 0.97 1.00 0.97 0.97 0.88 0.95Corr 0.50 0.51 0.52 0.54 0.55 0.56 0.59 0.66 0.72 0.70

Table 6.4: Se, PPA and Corr per subject for various threshold values

In the light of our discussion of Fig. 6.12, we would agree that correlation is the most relevant index to

assess performance. The plots in Fig. 6.13 show that the Se and PPA indices are greater than 0.83 for all

subjects (except subject 5) at the point of maximum correlation. Table 6.5 shows the optimal threshold

from the results in Table 6.4 using the degree of correlation Corr as a criterion.


−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.2

0.4

0.6

0.8

1

OSA subject 2

theshold

Se PPA corr

−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.2

0.4

0.6

0.8

1

OSA subject 3

theshold

Se PPA corr

−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.2

0.4

0.6

0.8

1

OSA subject 4

theshold

Se PPA corr

−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.2

0.4

0.6

0.8

1

OSA subject 5

theshold

Se PPA corr

−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.2

0.4

0.6

0.8

1

OSA subject 6

theshold

Se PPA corr

−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.2

0.4

0.6

0.8

1

OSA subject 7

theshold

Se PPA corr

−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.2

0.4

0.6

0.8

1

OSA subject 8

theshold

Se PPA corr

Figure 6.13: Se, PPA and Corr vs threshold for OSA subjects

Equi-distance to means (EDM) threshold

Two methods of finding the optimal threshold are considered, although this can only ever be done ret-

rospectively. The first of these is to find the centres of the two main clusters of data points for the

[P(W )-P(S)] output, by running the K-means algorithm for K = 2 on the [P(W )-P(S)] output, and set-

ting the threshold at the point x where the distances to the two centres, m1 and m2, become equal. The


Subject Optimal threshold Se PPA Corr2 0.5 0.97 0.94 0.833 0.8 1.00 0.93 0.934 0.3 1.00 1.00 0.925 0.3 0.41 0.88 0.756 0.2 0.83 0.92 0.847 0.5 1.00 0.88 0.838 0.8 0.91 0.88 0.72

Table 6.5: Optimal threshold

distance to each mean, d1 and d2, is normalised with respect to the standard deviation, s1 and s2, of the

corresponding mean to allow for the possibility of different data densities around each mean or cluster

centre, as shown below:

d1 =1s1

‖ x − m1 ‖ (6.4)

d2 =1s2

‖ x − m2 ‖ (6.5)

To find the threshold (x in Eq. 6.6 below) these two distances are made equal. Fig. 6.14 illustrates the

procedure to find the EDM threshold for OSA subject 2.

d1 = d2

⇒ 1s1

‖ x − m1 ‖ =1s2

‖ x − m2 ‖

⇒ 1s1

√(x − m1)2 =

1s2

√(x − m2)2,

⇒ 1s21

(x − m1)2 =1s22

(x − m2)2, (6.6)

Developing the square binomial in both sides of Eq. 6.6:

(s22 − s2

1)x2 − 2(m1s

22 − m2s

21)x + (m2

1s22 − m2

2s21) = 0 (6.7)

Solving the quadratic for x yields two possible solutions:

x1 =m1s2 − m2s1

s2 − s1, (6.8)

x2 =m1s2 + m2s1

s2 + s1(6.9)


one of which is outside the range [m1,m2] with m1 ≤ m2, and is therefore discarded, while the other one

sets the EDM threshold.

The new results for the automated system using the equi-distance to means threshold are presented in

Table 6.6.

0 5 10 15 20 25 30 35

−1

−0.5

0

0.5

1

OSA subject 2

P(W

)−P

(S)

time in minutes

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

50

100

150

200

250

300Amplitude histogram of P(W)−P(S)

EDM threshold

Figure 6.14: [P(W )-P(S)] output for OSA sleep subjects 2 (top) and amplitude histogram showing thetwo main clusters, surrounded by a circle of one standard deviation, and the EDM threshold (bottom)

Subject EDM threshold Se PPA Corr2 0.47 0.97 0.94 0.833 0.55 1.00 0.90 0.834 0.47 1.00 1.00 0.885 -0.27 0.97 1.00 0.346 0.46 0.72 0.95 0.827 0.47 1.00 0.88 0.828 0.49 1.00 1.00 0.56

Table 6.6: Equi-distance to means (EDM) threshold

It can be noticed in Table 6.6 that the threshold for the majority of the subjects lies within 0.5 ± .05.

Therefore, the simple approach of setting the threshold half way between REM/light sleep and wake-

fulness (i.e. [P(W )-P(S)]=0.5) was also tested. The results are shown in Table 6.7. Fig. 6.15 shows

the results using the two methods for setting the threshold compared with the results obtained with the

optimal threshold. Except for subject 5, the two methods can be seen to give very similar results.


Subject Threshold Se PPA Corr2 0.5 0.97 0.94 0.833 0.5 1.00 0.90 0.814 0.5 0.96 1.00 0.875 0.5 0.26 0.90 0.736 0.5 0.72 0.95 0.827 0.5 1.00 0.88 0.838 0.5 1.00 1.00 0.56

Table 6.7: Fixed (0.5) threshold

2 3 4 5 6 7 80

0.5

1

subject

Se

2 3 4 5 6 7 80

0.5

1

subject

PP

A

2 3 4 5 6 7 80

0.5

1

subject

Cor

r

Figure 6.15: Se, PPA and Corr for the best threshold (blue), the EDM threshold (red), and a 0.5 fixedthreshold (green)

6.2.5 Discussion

From the results obtained with the automated scoring system (shown in Fig. 6.15), two OSA subjects

stand out, subject 5 and subject 8, because of their low correlation values in relation to the rest of the

subjects. In order to investigate this, we examined the EEG and its power spectral density (PSD) for

these two subjects during the intervals which were scored as a μ-arousal by the human expert. The EEG

revealed that OSA subject 5 falls into a much deeper sleep than the other subjects before the onset of

a μ-arousal. Some of the deep sleep EEG is usually scored by the human expert as being part of the μ-

arousal. Thus, this subject’s μ-arousals are characterised by an increase in magnitude both for the lower

frequencies (which is unusual) and the higher frequencies (which is the expected EEG change during a

μ-arousal) during the first few seconds of the event. For the rest of the μ-arousal, the EEG is generally

dominated by α activity (see §3.4.3), which is often interpreted as light sleep by the neural network. This


is illustrated in Fig. 6.16, which shows 24 seconds of EEG and the corresponding [P(W )-P(S)] output

during a μ-arousal event for subject 5. The start and end of the event, as determined by the expert scorer,

are shown by the broken vertical lines. Fig. 6.17 shows the 1s resolution spectrogram (PSD vs time)

of the EEG segment shown in Fig. 6.16. Note the increase in magnitude of both the δ and α rhythms

during the first few seconds of the μ-arousal and also the prevalence of the peak at 10Hz, indicating the

presence of α rhythm during the whole event. The relatively high power in the lower frequency bands

for subject 5’s EEG may be the reason for which some events are totally missed by the automated system,

as is shown in Fig. 6.18.

510 515 520 525 530

−100

−50

0

50

100

OSA subject 5

time [s]

volta

ge [

μV]

510 515 520 525 530

−1

−0.5

0

0.5

1

time [s]

P(W

)−P

(S)

Figure 6.16: OSA subject 5 EEG and [P(W )-P(S)] output during a typical μ-arousal for this subject (24s)

Another subject with low correlation value (Corr=0.56 for the EDM threshold and the 0.5-threshold) is

subject 8, who has an EEG with high frequency content and shows a reduction in the higher frequencies

prior to the onset of the μ-arousals. The [P(W )-P(S)] output is near 1 (wakefulness) most of the time,

falling to low negative levels in the few seconds prior to the start of the μ-arousal, resulting in the μ-

arousals identified by the automated system being longer than those scored by the expert. Fig. 6.19

shows a 2-minute long section of the [P(W )-P(S)] output and the corresponding scores from the human

expert.

6.3 Summary 142

0

5

10

15

20

25

30510

515520

525530

0

10

20

30

40

time [s]frequency [Hz]

mag

nitu

de [d

B]

β

δθ

α

Figure 6.17: Spectrogram of the EEG segment shown in Fig. 6.16 calculated with 1s resolution using10th-order AR modelling

Comparing results with those using a 1-second analysis window

Previous work in the Neural Networks Research Group [123][175][176] used a 1-second window with

no overlap for the EEG feature extraction, but we found that, with such a window length, the misclas-

sification error of the MLP on the validation set in normal sleep is grater than 10%, compared with the

5.75% obtained with the 3-s window. Fig. 6.20 shows the [P(W )-P(S)] output using the 1-s window

for normal subject 9, compared with the [P(W )-P(S)] output using the 3-s window, together with the

corresponding expert scores. The “noisier” appearance of the output in relation to the 3-s case is likely to

be due to the higher variance of the AR estimates.

Also, the averaged sensitivity (median 0.77) and correlation (median 0.76) in μ-arousal detection are

lower using a 1-s window than using a 3-s window (Se median 0.97 and Corr median 0.82).

6.3 Summary

In this chapter two databases have been presented, corresponding to normal sleep and OSA sleep. The

normal sleep database consists of nine all-night EEG recordings using the central electrode montage. The

EEG is labelled independently, according to the R&K rules, by three human experts on a 30-second basis.

6.3 Summary 143

840 845 850 855 860

−100

−50

0

50

100

OSA subject 5

time [s]

volta

ge [

μV]

840 845 850 855 860

−1

−0.5

0

0.5

1

time [s]

P(W

)−P

(S)

Figure 6.18: OSA subject 5 EEG and [P(W )-P(S)] output during a μ-arousal missed by the automatedscoring system (24s)

The OSA sleep database has seven 20-minute frontal EEG records corresponding to seven subjects with

severe OSA. The records have been scored for μ-arousals by a human expert using the ASDA rules.

An investigation was made to select the algorithm for the estimation of the reflection coefficients, used

to represent the frequency content of the EEG, and also to select the number of samples in the analysis

window. The Burg algorithm was selected for its low computational cost and competitive performance. A

3-second window with 2-second overlap was chosen as a compromise between minimising the variance

of the AR coefficient estimates and the requirement to ensure stationarity of the EEG.

Based on previous work [123], three classes were chosen to describe normal sleep, namely Wakefulness,

REM/light sleep and Sleep stage 4, and a balanced feature set was formed to train a neural network to

estimate the posterior probabilities of class membership.

A 2-layer MLP with the softmax function for the output units was used. The backpropagation algorithm

for multiple classes and the scaled conjugate gradient optimisation algorithm were used to train the

network (cross-entropy error function). Optimisation of the MLP parameters, number of hidden units

and weight decay terms, was achieved using cross-validation. The optimal network (performance on the

validation set) is a 10-6-3 MLP with weight decay parameters (νz, νy) values at 10−4 and 10−5 respectively.

6.3 Summary 144

0 20 40 60 80 100 120

−1

0

1

time [s]

P(W

)−P

(S)

OSA subject 8

0 20 40 60 80 100 120

0

1

time [s]

visu

al s

core

s

Figure 6.19: OSA subject 8 [P(W )-P(S)] output and human expert scores (2 minutes)

The percentage of misclassification on the test set achieved with this network is 6.28%.

The optimal MLP was used to analyse the all-night EEG record of a subject from the normal sleep

database. The time courses of two of the three MLP outputs were combined to give a measure of sleep

depth [P(W )-P(S)], which shows a high correlation with the hypnogram generated by a human expert,

suggesting that the MLP is able to interpolate between classes for intermediate sleep stages 2 and 3.

The sleep EEG of OSA subjects was analysed using the optimal MLP. The time courses of the [P(W )-P(S)]

output show severe disruption in the sleep. A method for automated μ-arousal detection using threshold-

ing of the [P(W )-P(S)] output was introduced. The output of the automated scores was post-processed

to follow ASDA rules for μ-arousal scoring as closely as possible. Sensitivity, positive predictive accuracy

and correlation were used to evaluate the performance of the automated detection system with respect

to the human expert scores. The correlation measure was used to choose the optimal threshold value

per subject, and two methods for setting the threshold, one of them subject-adaptive, were applied ret-

rospectively. The results for five of the seven subjects show a high correlation (greater than 0.8) value,

with values of Se and PPA mostly over 0.9. Possible causes for the lower correlation values (0.56 and

0.34-0.73) obtained with the other two subjects may be explained by the fact that these two subjects

have different types of μ-arousal.

6.4 Conclusions 145

0 1 2 3 4 5

−1

0

1

Sleep database subject 09

P(W

)−P

(S)

usin

g 1s

(a)

0 1 2 3 4 5

−1

1

P(W

)−P

(S)

usin

g 3s

(b)

0 1 2 3 4 5

4

3

2

slee

p st

ages

time [hours](c)

Figure 6.20: Sleep database subject 9 raw P(W )-P(S) using a 1-s analysis window (a) and using a 3-sanalysis window (b), compared to the human expert scored hypnogram (c)

6.4 Conclusions

The neural network, trained with normal sleep data, is capable of following the abrupt transitions in

the sleep EEG of OSA patients. The methods introduced for automated μ-arousal detection were able

to identify a high percentage of the events scored by the human expert, giving the beginning and the

end times for the μ-arousal with relatively high accuracy (as measured by a simple correlation index)

for most of the OSA subjects in the database. The study of the subjects with low correlation levels in

the automated μ-arousal detection showed different changes in the EEG frequency content prior to and

during the μ-arousal.

The 3-second analysis window with a 2-second overlap for the AR modelling has yielded better results in

terms of MLP performance and in the sensitivity and correlation of the μ-arousal detection.

Chapter 7

Visualisation of thealertness-drowsiness continuum

Daytime drowsiness or sleepiness is a common complaint in patients with OSA. A full assessment of an

OSA case may include a vigilance test after a night-time sleep recording has been performed. In any case,

it would be very useful for clinicians to have a method of assessing the day-time performance of OSA

patients in relation to the severity of their sleep disorder.

Drowsiness is a state in which a person will easily fall asleep in the absence of external stimuli. It is quite

different from exhaustion as a result of physical activity. While drowsiness is a mental state which occurs

prior to sleep, its opposite, alertness, is a physiological activated state of the human brain, characterised

by consciousness and awareness. Human beings experience fluctuations in their levels of alertness during

the day because of the circadian rhythm. These fluctuations can be affected by sleep deprivation or low

quality of sleep as is the case with OSA.

In this chapter we investigate changes in the level of alertness which may be gradual rather than abrupt,

like the short events (arousals during sleep) of the previous chapter. Two databases are considered:

1. The “sleep database”, previously used for training neural networks to track the sleep-wake con-

tinuum and hence detect arousals in test data. This has the three previously defined categories of

wakefulness, REM/light sleep and deep sleep.

2. The “vigilance database” described below in which eight sleep-deprived subjects perform vigilance

7.1 The vigilance database 147

tasks while having their EEG monitored. For reasons which are explained below, there are two broad

categories in this database: alertness and drowsiness.

One important question is the inter-relationship and overlap between these five categories. For example,

wakefulness in the sleep database corresponds to a mental state in which the subjects lie in bed with their

eyes shut in a darkened room. On the other hand alertness in the vigilance database represents a state in

which the subjects are awake, with their eyes open, in a well-lit room in front of a computer screen. In

both instances, the subjects are awake but their EEG activity may be different.

In both the analysis of the sleep EEG and the vigilance EEG [153][127][170][171][104][50] it is the

frequency content of the signal which is used to characterised it. Although a 5th-order AR model has been

used previously [50] in the analysis of vigilance EEG, we decided that, in order to be able to compare EEG

signals from both databases, the same parameterisation should be used in both cases, namely reflection

coefficients from a 10-th order model. The inter-relationship between these coefficients for the different

classes will be visualised in 2-D using both the Sammon map and the NEUROSCALE algorithm.

The rest of this chapter is organised as follows. Firstly, the vigilance database used both in previous

work[50] and in subsequent chapters is introduced. Secondly, the Sammon map and the NEUROSCALE

algorithm for visualising the high-dimensional data are applied to the vigilance database to investigate

the separation (or overlap) between the two classes, alertness and drowsiness. Finally, the visualisation

tools are used to study the inter-relationships between the EEG patterns of the five categories present in

the two databases together.

7.1 The vigilance database

The Department of Psychology at the University of West England conducted a study in which eight healthy

young subjects performed various vigilance tests for approximately 2 hours (see Appendix C), after a

night of sleep deprivation and no stimulant consumption for 24 hours before or during the test. The

EEG was recorded from a number of sites on the scalp but only the central (C4) site recordings, as in


the sleep EEG studies, were used in the work described in this thesis. Expert scoring based on the visual

assessment of the EEG, EMG and EOG was undertaken on a 15-second basis, according to the Alford et

al. sub-categories of Table 3.2 [8]. A brief summary of the database is given in Appendix C and the Table

is reproduced in simpler format below:

Vigilance sub-category DescriptionActive Wakefulness (Active) active/alert, > 2 eye mov/epoch, definite body mov.a

Quiet Wakefulness Plus (QWP) active/alert, > 2 eye mov/epoch, possibly body mov.Quiet Wakefulness (QW) alert, < 2 eye mov/epoch, no body mov.

Wakefulness with Intermittent α (WIα) burst of α < half of an epochWakefulness with Continuous α (WCα) burst of α > half of an epochWakefulness with Intermittent θ (WIθ) burst of θ < half of an epochWakefulness with Continuous θ (WCθ) burst of θ > half of an epoch

amovement

Table 7.1: Alford et al. vigilance sub-categories

In previous work in the Neural Networks Research Group, Duta [50] investigated the tracking of fluc-

tuations in vigilance using both the central and mastoid (behind the ears) EEG sites. In that work the

EEG was divided into one-second segments. Since the expert scoring of the (central) EEG was under-

taken using a 15-second timescale, a large number of 1-s segments are wrongly labelled, as for instance

a 1-s segment from a 15-s epoch of vigilance category WIα may consist predominantly of α-wave ac-

tivity whereas another segment in the same epoch may correspond to Quite Wakefulness (QW). Duta

re-labelled the data using a combination of the expert scoring and Kohonen feature maps to visualise the

cluster to which the one-second segment belonged. As a result of this, she defined two categories:

1. Alertness: one-second segments which are labelled by the expert as Active, QWP or QW, and have

corresponding feature vectors which are mapped onto the area of the Kohonen map mostly visited

by the Active, QWP and QW sub-categories and not visited by WIα, WCα and WIθ.

2. Drowsiness: one-second segments labelled WIα, WCα or WIθ whose feature vectors visit the area

of the Kohonen map mostly visited by the WIα, WCα and WIθ sub-categories and not visited by

Active, QWP and QW.

In addition, an extra class of Uncertain was defined as containing the one-second segments whose fea-


ture vectors are mapped onto an area of the Kohonen map visited by feature vectors extracted from

one-second segments from all vigilance sub-categories. There are approximately 8,000 and 20,000 1-s

segments which belong to the Drowsiness and Alertness classes respectively, although the distribution is

not uniform amongst the subjects. The distribution of patterns per subject per class is shown in Table 7.2.

SubjectClass 1 2 3 4 5 6 7

Drowsiness 282 1541 413 804 1817 1218 2116Intermediate 1220 2038 1978 2262 1181 2084 1749

Alertness 4802 1896 3280 3394 2591 2416 1368Artefact 1151 2625 2204 2195 2286 4797 2717Total 7455 8100 7875 8655 7875 10515 7950

Table 7.2: Number of patterns per subject per class in vigilance training database

7.1.1 Pre-processing

Although the data in this database was sampled at 256 Hz, it is down-sampled to 128 Hz in order to keep

the pre-processing filters and AR modelling consistent across all databases in this thesis. Ten reflection

coefficients per second are calculated using the Burg algorithm for each 3-s window with 2-s overlap, as

with the sleep database (see section 6.1.3).

7.1.2 Visualising the vigilance database

Ideally we would take an equal number of Alertness (A) and Drowsiness(D) patterns per subject in order

to have every subject equally represented when training the visualisation algorithm. Unfortunately, some

subjects in the database have a very small number of patterns for the Drowsiness class. If we take 800

patterns per class per subject, 5 out of the 7 subjects can provide this number. A training set is then built

randomly selecting 800 patterns per class for each subject, or the maximum available when this is not

possible (see Table 7.3).

The visualisation algorithms used in this thesis, the Sammon map and NEUROSCALE, require a small

number of feature vectors for a reasonable convergence time. With approximately 5,000 patterns per

class, a reduction in the size of the training set is needed. Using the K-means clustering algorithm, the


Subjecta

Class 1 2 3 4 5 6 7 TotalD 282 800 412 800 800 800 800 4694A 800 800 800 800 800 800 800 5600

aNote that only seven subjects are listed above. The eighth subject wasdiscarded for reasons explained later.

Table 7.3: Number of patterns per subject per class in K-means training set

number of patterns in the training set is reduced to about 200 mean patterns per class (by choosing 14

means per subject per class).

The Sammon map and NEUROSCALE algorithms are run independently with the reduced dataset using

the same parameters as for the sleep database (for Sammon map, gradient proportionality factor = 0.06;

and for NEUROSCALE, number of basis functions = 50). The projections of the means produced by

both visualisation techniques, presented in Figs. 7.1 and 7.2, are very similar and show two partially

overlapped clusters, representing the A and D classes respectively. Of course, the overlapping does not

necessarily occur in the 10-D space as it does in the 2-D projection, in the same way as the edges of a 3-D

cube may touch each other in a 2-D projection.

Visualising the feature vectors for each subject

The cluster size in the Sammon maps shown in Fig. 7.1, represented by the radius of the circles around the

cluster mean, is calculated by counting the number of feature vectors in the training set which “belong”

to that mean (as defined by the Euclidean distance in 10-D between the feature vector and the cluster

mean). The distribution of patterns per subject can also be investigated by considering only the feature

vectors belonging to a specific subject. The results of using the Sammon and NEUROSCALE algorithms on

each subject individually are shown in Figs. 7.3 and 7.4.

7.1.3 Discussion

The maps showing the distribution of the patterns per subject reveal some differences between subjects.

One of the subjects in the database, subject 8 (not shown in the tables), was discarded because she was


(a) Both classes

(b) Drowsiness (c) Alertness

Figure 7.1: Vigilance Sammon map

identified by the expert who scored the records as belonging to the minority class α+ (see sections 3.4.1

and 3.4.3), a condition in which the subject’s EEG shows an α-rhythm during eyes-open wakefulness

[87]. Although alpha-plus people represent a significant fraction of the population, the lack of data

and subjects for this category in our database makes it difficult to include it in the rest of the analysis.

However, the data from subject 8 allows us to exploit the advantage that the NEUROSCALE algorithm

has over the Sammon map algorithm. The trained NEUROSCALE network can be used on previously

unseen data, provided that the new data is drawn from the same probability distribution as the training

data. Thus, the NEUROSCALE network trained with the 7-subject training set described in Table 7.3, can

be used with this α+ subject as input in order to visualise the A and D patterns of this subject with

respect to those from the rest of the subjects. Fig. 7.4h clearly shows that the D patterns for subject 8

7.2 Visualising vigilance and sleep data together 152

−8 −6 −4 −2 0 2 4 6 8−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

6Vigilance NeuroScale, 14 means per class, 50 basis functions, 500 iterations

DrowsyAlert

(a) Means only

−8 −6 −4 −2 0 2 4 6 8−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

6Projection on Vigilance NeuroScale map (14 means per class)

DrowsyAlert

(b) All patterns

Figure 7.2: Vigilance NEUROSCALE map

lie mostly in the area where the A patterns from the others subjects are found. Given that this subject’s

EEG differs from the EEG of most of the population, it is very likely that the NEUROSCALE neural network

is extrapolating when presented with this subject’s patterns as they are not represented in its training

set. Another NEUROSCALE neural network is therefore trained, this time with subject 8’s mean patterns

added to the training set. The resulting 2-D plot for this 8-subject training set is shown in Fig. 7.5 and

the projection of subject 8’s patterns using this neural network is shown in Fig. 7.6h.

This figure reveals an interesting phenomenon which could not be seen in the 7-subject NEUROSCALE 2-D

projection. On Fig. 7.6h, the D patterns from subject 8 lie in an area where there are no patterns from

any other subject. Also subject 8’s A patterns overlap completely with her D patterns. This can be tracked

to the first five reflection coefficients for the D class which for subject 8 have mean values different from

those of the other subjects (see Fig. 7.7).

7.2 Visualising vigilance and sleep data together

To explore the relationship in feature space between the sleep and vigilance classes, a NEUROSCALE

neural network is trained with means extracted both from the sleep database classes Wakefulness (W),

REM/Light-sleep (R) and Deep-sleep (S) and from the vigilance categories Alertness (A) and Drowsiness

(D). An equal number of means is extracted for each class from the databases giving a total of 210 means.

7.2 Visualising vigilance and sleep data together 153

The resulting NEUROSCALE plot of the means is shown in Fig 7.8. The maps showing the projection of

the feature vectors for each of the five classes can be seen in Fig 7.9.

A Sammon map was also trained with the means from the combined sleep-vigilance databases. The

results shown in Fig. 7.10 are comparable to those obtained with NEUROSCALE (Figs. 7.8 and 7.9), but

the relation between pairs of two classes may be seen more clearly on the Sammon map, as shown in

Fig. 7.11.

7.2.1 Discussion

It can be seen from Fig. 7.11b that the Wakefulness class from the sleep database is broader than the

Alertness category from the vigilance database. Although the Alertness patterns are mostly mapped

onto a region of the map covered by the Wakefulness class, it is not necessarily correct to say that the

Alertness category is a subset of the Wakefulness class. On the one hand, we have the Alert patterns of

sleep-deprived subjects performing a rather boring task (see Appendix C), fighting to remain awake. On

the other hand, we have the Wakefulness patterns from subjects lying in bed, ready to sleep, in a quiet,

dark and comfortable room. It is not known whether these subjects were relaxed or not, but it is very

likely that they were not concentrating their mind on anything in particular. The overlap between these

two classes is understandable but it was also expected that there would be a region for each class not

shared with the other one. It is possible that this region may be represented by three dense Alertness

clusters at the lower edge of this class on the Sammon map, a region not visited by any other class. The

same region is seen in the NEUROSCALE plot as the right-hand side of Alertness category in Fig. 7.9d. It

is also encouraging to find a small area where the Wakefulness patterns on the Sammon map overlap the

Drowsiness patterns but not the Alertness patterns (see Figs. 7.11b and 7.11c).

The spatial relationship between Alertness, Drowsiness and REM/Light Sleep is shown in Figs. 7.11d and

7.11e. There is a large area of overlap between REM/Light Sleep and Drowsiness, but the REM/Light

Sleep area only overlaps Alertness in the area where the latter overlaps Drowsiness. This is reasonable,

as the brain cortex, fully active when the subject is alert, is randomly stimulated during REM sleep. The

7.3 Conclusions 154

Drowsiness area extends onto the Wakefulness area towards the upper-centre border of the map, where

it becomes the dominant class. The centre-left region of the map is dominated by REM/Light Sleep.

Finally, Fig. 7.11f shows two well defined completely separated clusters representing the 2-D projections

for Drowsiness and Deep Sleep. This is expected as Drowsiness only includes short bursts of θ rhythm

and no δ rhythm, while Deep Sleep patterns consists mainly of δ waves with some occasional θ rhythm.

From the visualisation maps, the following hypotheses can be formulated:

• A transition from an alert state of mind to sleep may progress from the area exclusive to Alertness

through Drowsiness’ area shared by A, W and D, Light Sleep’s area shared by A, D and R, and then

into Deep Sleep.

• Another transition from a relaxed state of Wakefulness to sleep starts from the region of Wakefulness

not shared with Alertness, moves towards Drowsiness’ area shared by W and D, and then into Light

Sleep’s area shared by R and D only, eventually reaching Deep Sleep.

7.3 Conclusions

In this chapter, we have analysed the EEG recordings from the vigilance database, which consists of 2-

hour recordings from seven healthy sleep-deprived subjects performing vigilance tasks. Two vigilance

categories were defined, namely Alertness and Drowsiness, and used to label 1-s EEG segments based on

the scores from a human expert. The EEG was processed in the same way as for the sleep database. A

near-balanced training set was built from the vigilance database by randomly selecting an equal num-

ber of patterns per subject and per class. Visualisation of the data distribution in the feature space

revealed inter-subject variability in both the Alertness and Drowsiness classes. An interesting example

was discussed, namely an α+ subject whose Drowsiness patterns seem to be different from the rest of the

feature vectors in the training set.

A further visualisation study was carried out integrating the sleep and vigilance categories in one training

set. From this analysis we may draw the following conclusions:

7.3 Conclusions 155

1. The Alertness and Drowsiness patterns give rise to two well-defined but partially overlapping clus-

ters.

2. Wakefulness (from the sleep database) is a very broad class that includes some alert patterns as

well as some drowsy ones.

3. There is a small but relatively dense area beyond Wakefulness occupied by Alertness only.

4. The area shared by Wakefulness and Drowsiness patterns only may represent the sleep onset not

included in the REM/Light Sleep region.

5. The REM/Light Sleep and Drowsiness classes overlap significantly but not totally with obvious areas

not represented by any other class.

6. Deep Sleep is a separate class, of relatively low importance for the study of vigilance.

It is obvious that the vigilance categories, Alertness and Drowsiness, are not fully represented by any of

the sleep classes and therefore require a separate neural network analysis.

7.3 Conclusions 156

(a) All subjects (b) Subject 1

(c) Subject 2 (d) Subject 3

(e) Subject 4 (f) Subject 5

(g) Subject 6 (h) Subject 7

Figure 7.3: Vigilance Sammon map showing subject’s distribution (Alertness in red and Drowsiness inblue)

7.3 Conclusions 157

−8 −6 −4 −2 0 2 4 6 8−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

6

(a) Subject 1

−8 −6 −4 −2 0 2 4 6 8−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

6

(b) Subject 2

−8 −6 −4 −2 0 2 4 6 8−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

6

(c) Subject 3

−8 −6 −4 −2 0 2 4 6 8−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

6

(d) Subject 4

−8 −6 −4 −2 0 2 4 6 8−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

6

(e) Subject 5

−8 −6 −4 −2 0 2 4 6 8−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

6

(f) Subject 6

−8 −6 −4 −2 0 2 4 6 8−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

6

(g) Subject 7

−8 −6 −4 −2 0 2 4 6 8−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

6

(h) Subject 8

Figure 7.4: Vigilance NEUROSCALE map projections for each subject (Alertness in magenta and Drowsi-ness in blue)

7.3 Conclusions 158

−8 −6 −4 −2 0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

10

128−subject vigilance NeuroScale, 192 means per class, 50 basis functions, 500 iterations

DrowsyAlert

(a) Means only

−8 −6 −4 −2 0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

10

12Projection on the 8−subject Vigilance NeuroScale map

DrowsyAlert

(b) All patterns

Figure 7.5: Vigilance NEUROSCALE map trained with all subjects, including the α+ subject

7.3 Conclusions 159

−8 −6 −4 −2 0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

10

12

(a) Subject 1

−8 −6 −4 −2 0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

10

12

(b) Subject 2

−8 −6 −4 −2 0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

10

12

(c) Subject 3

−8 −6 −4 −2 0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

10

12

(d) Subject 4

−8 −6 −4 −2 0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

10

12

(e) Subject 5

−8 −6 −4 −2 0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

10

12

(f) Subject 6

−8 −6 −4 −2 0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

10

12

(g) Subject 7

−8 −6 −4 −2 0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

10

12

(h) Subject 8

Figure 7.6: Vigilance NEUROSCALE trained with all subjects, including α+ subject (Alertness in magentaand Drowsiness in blue)

7.3 Conclusions 160

−1 −0.9 −0.8 −0.7 −0.6 −0.5

Coe

ff. 1

Subject 12 drowsiness patterns

0.2 0.4 0.6 0.8 1

Coe

ff. 2

−0.8 −0.7 −0.6 −0.5 −0.4 −0.3

Coe

ff. 3

0.4 0.5 0.6 0.7 0.8

Coe

ff. 4

−0.8 −0.7 −0.6 −0.5 −0.4 −0.3

Coe

ff. 5

0.2 0.3 0.4 0.5 0.6 0.7

Coe

ff. 6

Figure 7.7: Subject 8 reflection coefficient histogram (green) in relation to the rest of the subjects in thetraining set (magenta)

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6NeuroScale with vigilance and sleep data, 42 means per class, 50 basis functions, 500 iterations

Wakefulness REM/light−sleepDeep−sleep Drowsiness Alertness

Figure 7.8: Vigilance and sleep NEUROSCALE map

7.3 Conclusions 161

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6Wakefulness

(a) Wakefulness

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6REM/light sleep

(b) REM/light sleep

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6Deep sleep

(c) Deep Sleep

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6Alertness

(d) Alertness

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6Drowsiness

(e) Drowsiness

Figure 7.9: Vigilance and sleep NEUROSCALE projections for all the patterns in each class (colour code:W, cyan; R, red; S, green; A, magenta; and D, blue)

7.3 Conclusions 162

(a) All classes (b) Wakefulness

(c) REM/light sleep (d) Deep Sleep

(e) Alertness (f) Drowsiness

Figure 7.10: Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta; andD, blue)

7.3 Conclusions 163

(a) All classes (b) Alertness and wakefulness

(c) Wakefulness and drowsiness (d) REM/light sleep and drowsiness

(e) REM/light sleep and alertness (f) Deep Sleep and drowsiness

Figure 7.11: Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta; andD, blue)

Chapter 8

Training a neural network to track thealertness-drowsiness continuum

At the end of the previous chapter, we showed that a neural network used to assess the level of drowsi-

ness in OSA patients should be trained using exclusively vigilance labelled patterns. In this chapter we

train and test a neural network to track the alertness-drowsiness continuum using single-channel EEG

recordings from control subjects performing vigilance tests.

8.1 Neural Network training

The visualisation techniques applied to vigilance data in section 7.1.2, showed a high degree of overlap

between the A and D classes in the 2-D projection of the vigilance database feature vectors. Despite this

overlap, a neural network may be able to resolve the differences using 10-D feature vectors as inputs. We

expect that a two-class neural network trained exclusively with patterns from the extreme conditions of

fully alert (A) and fully drowsy (D), will be able to interpolate when a pattern belonging to an interme-

diate stage is presented at the input. In this way the vigilance continuum may be tracked by an output

fluctuating between the full alertness and the full drowsiness levels.

8.1.1 The training database

The 7-subject vigilance database described in chapter 7 is used for the neural network training and

testing. The set of 10 reflection coefficients extracted from the A and D patterns used for visualisation in

8.1 Neural Network training 165

§7.1.2 is now used in this chapter for the training process.

8.1.2 The neural network architecture

An MLP neural network is selected for the same reasons as in chapter 6. As with the sleep-wake contin-

uum network, the cross-entropy error function and the scaled conjugate gradient optimisation algorithm

are used during the training process. Given that only one output is required in a two-class problem1, the

configuration for the MLP is 10-J-1, the output representing the posterior probability of the input vector

belonging to the alertness class. The estimate of the number of hidden units J given by Eq. 5.72, i.e.

the geometric mean of the number of inputs times the number of outputs, is√

10 × 1 = 3.16. Hence a

search for the optimum J is done training 10-J-1 MLPs with values of J from 2 to 15. As before, the

problem of over-fitting the network is dealt with by introducing regularising terms νz and νy, one for

each weight layer. Based on results from a preliminary investigation, the values of the regularisation

parameters are varied between 10−3 and 1 for the input-to-hidden layer νz, and from 10−7 to 10−5 for

the hidden-to-output layer νy, increasing in powers of ten. To avoid being trapped in a local minima,

three different random weight initialisations are used. Cross-validation is used, as before, to optimise the

MLP architecture and regularisation parameters.

8.1.3 Choosing training, validation and test sets

Ideally, balanced training and validation sets should be assembled for the cross-validation tests, assuming

equal prior probabilities for both classes. However, inter-subject differences were found when visualising

the vigilance database (§7.1.2). All the subjects should be equally represented in the training and valida-

tion sets, but as is shown in Table 7.2, the distribution of A and D patterns among the vigilance database

is very uneven. Using the same criterion as for the NEUROSCALE training set, 800 (or fewer patterns

when this is not possible) were drawn per class for each subject, yielding 5,494 patterns for Alertness

and 5,822 for Drowsiness.

1P (D | x) = 1 − P (A | x)


Assigning these patterns to two equal-sized sets, we obtain approximately 2,800 patterns per class in

each set. With as few as 7 subjects in our database and the high degree of inter-subject variability seen

in the visualisation studies, the best strategy for training and testing the MLP will be the leave-one-out

method [159]. This requires the leaving of one subject out of the training and validation sets, so that it

can be used as a test subject, and repeating this for each subject in turn. This method leads to 7 different

partitions of the data, as shown in Table 8.1.

Training and validation Total TestPartition subjects A D Tr Va subject

1 2, 3, 4, 5, 6, 7 4800 4412 4606 4606 12 1, 3, 4, 5, 6, 7 4800 3894 4347 4347 23 1, 2, 4, 5, 6, 7 4800 4282 4541 4541 34 1, 2, 3, 5, 6, 7 4800 3894 4347 4347 45 1, 2, 3, 4, 6, 7 4800 3894 4347 4347 56 1, 2, 3, 4, 5, 7 4800 3894 4347 4347 67 1, 2, 3, 4, 5, 6 4800 3894 4347 4347 7

Table 8.1: Partitions and distribution of patterns in training (Tr) and Validation (Va) sets

The MLP training and parameter optimising process can be summarised as follows:

1. Build training and validation sets on (n − 1) subjects using 800 (or as many as there are if this is

not possible) patterns per subject per class. Repeat this for each subject2.

2. For each partition:

(a) Normalise training, validation and test sets with respect to the training set statistics.

(b) For each set of values of the network parameters (J, νz, νy) and weight initialisation seed,

train a 10-J-1 MLP using the cross-entropy error function and the scaled conjugate gradient

optimisation algorithm.

(c) Choose the optimal MLP based on the performance on the validation set.

(d) Test the optimal MLP on the nth subject. Compare with the expert assessment.

2Here n = 7


Hence, the MLP parameter optimisation involves the training of the following number of networks: (7

partitions) × (3 weight initialisations) × (14 values of J) × (4 values of νz) × (3 values of νy)= 3,528

networks

8.1.4 Optimal (n − 1)-subject MLP per partition

The optimisation of the MLP parameters yields the results shown in Table 8.2. Fig. 8.1 shows the average

variation in misclassification error for the validation set with respect to the number of hidden units J .

The optimum value for J is clearly between 3 and 4 for the majority of the partitions as estimated using

Eq. 5.72. Partition 5 is the only one which has a higher value for optimum J . The best 10-3-1 MLP for

partition 5 has a classification error of 18.79% on the validation set, but the optimum value of J = 13

was kept for this partition. Fig. 8.2 shows the average variation in misclassification error in the validation

set with respect to the regularisation parameters (νz,νy) for the 10-3-1 MLPs. Either from the plot or

from the table, it can be seen that the optimum regularisation parameters occur towards the end of the

ranges (10−3, 10−7) in many of the partitions, suggesting that the search could have been continued in

that direction. However, a previous investigation found that network performance on the validation set

drops significantly for smaller values of νz and νy. This is expected since the regularisation terms become

negligible with a consequent loss in generalisation.

partition J νz νy Tr error Va error

1 3 10−2 10−6 21.02 20.73

2 3 10−3 10−7 20.54 19.65

3 3 10−3 10−6 19.69 19.22

4 3 10−3 10−6 19.37 19.39

5 13 10−2 10−5 18.36 18.73

6 3 10−3 10−7 22.15 22.22

7 4 10−3 10−7 19.62 20.31

Table 8.2: Optimum MLP parameters per partitions and percentile classification error for training (Tr)and validation (Va) sets

8.2 Testing on the nth subject 168

2 4 6 8 10 12 14 1620.5

20.6

20.7

20.8

20.9

21

21.1

21.2

21.3

21.4

21.5Vigilance 6−subject MLP optimisation


aver

age

clas

sific

atio

n er

ror

[%]

Figure 8.1: Average misclassification error for the validation set vs. number of hidden units J for the(n − 1)-subject MLP

−3−2.5

−2−1.5

−1−0.5

0

−7

−6.5

−6

−5.5

−519.4

19.6

19.8

20

20.2

20.4

20.6

20.8

21

21.2

log10

(νy )

Vigilance 6−subject MLP optmisation (J=3)

log10

(νz )

aver

age

clas

sific

atio

n er

ror

[%]

Figure 8.2: Average misclassification error on the validation set with respect to regularisation parameters(νz,νy) for the (n − 1)-subject MLP with J = 3 (linear interpolation used between 12 values)

8.2 Testing on the nth subject

The optimal MLP for each partition is tested using the nth subject. Given that the main goal is not classi-

fication, but the tracking of the alertness-drowsiness continuum, the assessment of MLP performance on

test data is carried out on the time course of the MLP output instead of on a number of randomly selected

1-s segment feature vectors. The time course of the MLP output is compared with the expert assessment

of the subject’s vigilance according to the Alford et al. scale described in section 3.4.2. The time courses

of the MLP output and expert scores are shown in Figs. 8.3 to 8.9. Given that the expert scored the EEG

on a 15-s basis, the MLP output is filtered using a 15-pt median filter. This allows comparison with the

8.2 Testing on the nth subject 169

expert’s discretised representation of the alertness-drowsiness continuum.

8.2.1 Qualitative correlation with expert labels

A visual inspection of the time courses and corresponding expert labels reveals that, in all cases, the time

course of the MLP output follows the fluctuations in the vigilance scale fairly closely.

There is no difference between the MLP outputs corresponding to labels Active and QWP, for which it is

almost always 1.0. Subject 1’s time course shows that the MLP is not reaching the lower values associated

with the WIθ category. The 2-D projections of this subject’s feature vectors in Figs. 7.3 and 7.4 may give a

possible explanation. It can be seen in the figures that the D patterns of this subject lie in the overlapping

area between the A and D classes. The MLP is not always able to resolve the difference between the two

classes in this area, hence the posterior probabilities of belonging to either class are approximately equal

(MLP output ≈ 0.5). Subject 2 is not affected by this problem, the network performance being generally

as expected as the MLP output sweeps the [0-1] range in synchronism with the expert labels. Large

fluctuations remain, even after the filtering, but this is expected from an individual who goes from being

totally drowsy to being fully active several times during the recording. Subject 3 is similar to subject

1, as the MLP output does not reach the drowsiness levels. His A and D patterns in the 2-D feature

space projection also lie in an area of high overlap. The performance for subject 4 is poor, the output

remaining persistently high despite the multiple occurrences of the WIθ label. There are fewer problems

with label WIα, the MLP output reaching a value of around 0.5. This subject’s A patterns seem to be

divided in two clusters far apart in the 7-subject Sammon map (Fig. 7.3), and some of its means are not

visited by the A patterns of other subjects. In contrast, the MLP analysis yields good results in general

for the next three subjects, as with subject 2. Subject 5’s MLP output matches the expert labels with only

two major exceptions, around times 00:42 and 01:57 (42 and 117 minutes), in which the MLP output

is low when the expert labels are WIα-QWP. Similar errors can be found in the time course of the MLP

output for subject 6, when for brief periods of time (around 00:25, 00:47 and 01:18), the MLP output

is high when the subject labels are WIθ-WIα. Note that this subject’s A and D patterns are the furthest

8.3 Training an MLP with n subjects 170

away in the 2-D projection of the feature space, lying in areas of little or no overlap between classes. One

possible reason for the segments with the poor correlation in the MLP output time course is the presence

of artefacts, as occurs during the interval centered on 01:18. Subject 7’s MLP performance also shows

a good correlation with the expert labels, with just two segments at times 0:17 and 1:20 for which the

output fails to indicate an intermediate to high level of alertness.

8.2.2 Quantitative correlation with expert labels

To give a more objective measure of MLP performance on each test subject in turn, the 15-pt median fil-

tered MLP output range was divided into three sub-intervals. Values between 0.0 and 0.3 are considered

to match the drowsy labels WCα, WIθ and WCθ. The second interval, bounded between 0.3 and 0.7, is

to represent the intermediate state WIα, and values between 0.7 and 1.0 are to correspond to the alert

states Active, QWP and QW. Correlation of the median-filtered MLP output with the expert labels, on a

1-s basis, according to this assignment, reinforces the visual assessment (see Table 8.3). The gap between

the best and the worst values is as narrow as 16.4%, the worst correlation being found for subject 4, as

expected, and the best for subjects 1 and 6.

partition 1 2 3 4 5 6 7correlation 60.93 53.04 50.82 44.56 47.45 58.38 49.13

Table 8.3: Percentage correlation between 1-s segments of the 15-pt median filtered MLP output and15s-based expert labels

8.3 Training an MLP with n subjects

The results in the last section show that an MLP trained with the vigilance database is able to track the

alertness-drowsiness continuum. The set of optimal MLPs for all the partitions could be used to analyse

new data as a committee of networks (see §5.3.3). The new data would be presented to all the networks

and the average of the outputs used as an estimate of the alertness posterior probability P (A | x).

However, this average may conceal rapid changes in the alertness-drowsiness continuum which may be

8.4 Summary and conclusions 171

important in the assessment of sleepiness in OSA patients. An alternative and easier approach is to train

an MLP using all seven subjects in the vigilance database in order to analyse subsequent test data. This

neural network will be referred to as the 7-subject MLP in the sections and chapters which follow.

The sequence of steps follows as:

1. Build training and validation sets on n subjects using 800 (or as many as there are if this is not

possible) patterns per subject per class. This yields a total of 2,347 D and 2,800 A patterns (see

Table 7.3) per set.

2. Normalise training and validation sets with respect to the training set statistics.

3. For each set of values of the network parameters (J, νz, νy) and weight initialisation seed, train a

10-J-1 MLP using the cross-entropy error function and the scaled conjugate gradient optimisation

algorithm. Use three different random initialisations for the weights to increase the chance of

finding a better minimum for the error function during the training process.

4. Choose the optimal MLP based on the performance on the validation set.

The range for the MLP regularisation parameters is the same as for the MLP trained with (n−1) subjects.

The number of hidden units J is varied from 2 to 10. Thus, the total number of 7-subject MLP trained to

find the optimum parameters is 324. Fig. 8.10 shows the average misclassification error for the validation

set against the number of hidden units J . The optimal MLP is found at J = 3, with regularisation

parameters (νz, νy) optimal at (10−3, 10−6). The best classification error on the validation set is 20.24%,

with a corresponding error of 20.67% on the training set.

8.4 Summary and conclusions

In this chapter, the vigilance database has been used to train a single-output MLP in order to track the

alertness-drowsiness continuum. Wakefulness EEG is more susceptible to artefacts and rapid changes

than sleep EEG. When the high degree of overlap between Alertness and Drowsiness classes is also con-


sidered, this makes the analysis of the vigilance EEG a more difficult problem. As there are only 7 subjects

available and it was known from the visualisation studies that there existed a large amount of inter-subject

variability in the feature vectors, the leave-one-out method was used to train the neural network MLP. For

a 7-subject database this method yields 7 data partitions, each with 6 subjects. Training and optimisation

of the MLP parameters was carried out for each partition, and the optimal network tested in each case

with the nth subject. The correlation between the MLP output and the expert labels varies from 44.6% to

60.9% across the subjects, showing that an optimal MLP trained with (n − 1) subjects from the vigilance

database is capable of tracking the variations in the level of alertness of the nth (test) subject. For further

use with unseen data, the MLP is re-trained using all the subjects in the 7-subject database. Its use in the

evaluation of test data acquired from other subjects is considered in the following chapter.


0.0

0.3

0.7

1.0

MLP output

Vig

ilanc

e su

bjec

t 1

0.0

0.3

0.7

1.0

15−pt median filtered

020

4060

8010

012

0

WC

θW

IθW

Cα

WIα

QW

QW

P

activ

e

expert labels

time

[min

utes

]

Figure 8.3: Time course of the MLP output for vigilance subject 1


0.0

0.3

0.7

1.0

MLP output

Vig

ilanc

e su

bjec

t 2

0.0

0.3

0.7

1.0


020

4060

8010

012

0

WC

θW

IθW

Cα

WIα

QW

QW

P

activ

e

expert labels

time

[min

utes

]



0.0

0.3

0.7

1.0

MLP output

Vig

ilanc

e su

bjec

t 3

0.0

0.3

0.7

1.0


020

4060

8010

012

0

WC

θW

IθW

Cα

WIα

QW

QW

P

activ

e

expert labels

time

[min

utes

]



0.0

0.3

0.7

1.0

MLP output

Vig

ilanc

e su

bjec

t 4

0.0

0.3

0.7

1.0


020

4060

8010

012

014

0

WC

θW

IθW

Cα

WIα

QW

QW

P

activ

e

expert labels

time

[min

utes

]



0.0

0.3

0.7

1.0

MLP output

Vig

ilanc

e su

bjec

t 5

0.0

0.3

0.7

1.0


020

4060

8010

012

0

WC

θW

IθW

Cα

WIα

QW

QW

P

activ

e

expert labels

time

[min

utes

]



0.0

0.3

0.7

1.0

MLP output

Vig

ilanc

e su

bjec

t 6

0.0

0.3

0.7

1.0


020

4060

8010

012

014

016

0

WC

θW

IθW

Cα

WIα

QW

QW

P

activ

e

expert labels

time

[min

utes

]



0.0

0.3

0.7

1.0

MLP output

Vig

ilanc

e su

bjec

t 7

0.0

0.3

0.7

1.0


020

4060

8010

012

0

WC

θW

IθW

Cα

WIα

QW

QW

P

activ

e

expert labels

time

[min

utes

]



1 2 3 4 5 6 7 8 9 10 1120.5

20.6

20.7

20.8

20.9

21

21.1

21.2

21.3

21.4

21.5Vigilance 7−subject MLP optimisation


aver

age

clas

sific

atio

n er

ror

[%]

Figure 8.10: Average misclassification error for the validation set vs. number of hidden units J for the7-subject MLP

Chapter 9

Testing using the vigilance trainednetwork

The MLP trained with the vigilance database can now be used to track the vigilance continuum in new

OSA patients. This chapter presents the use of the 7-subject vigilance MLP with new data obtained during

a separate vigilance study in OSA patients.

9.1 Vigilance test database

A physiological vigilance study carried out by the Osler Chest Unit staff at the Churchill Hospital, Ox-

ford, provides frontal EEG records from ten OSA subjects, with varying degrees of severity of the sleep

disorder. The EEG was recorded during a vigilance test which lasted for a maximum of 40 minutes, the

duration depending on the degree of sleepiness of the subject during the test. The test, performed in a

sleep promoting environment, consists of the subject having to respond (by pushing a button) after he

has seen a light emitting diode (LED) flash for about 1s. The LED flashes every 3 seconds and the test

finishes after the subject misses 7 consecutive stimuli. More details about the test and clinical details of

the patients’ sleep disorders can be found in Appendix D. No expert scores are provided for this database,

just the button signal for every test. This can be used as a performance measure to validate the analysis

of the EEG. A summary of this database follows:

9.2 Running the 7-subject vigilance MLP with test data 182

Number of subjects 10

Condition diagnosed with OSA

Description 4 to 6 vigilance tests denoted with the letters A to F in

chronological order.

Electrode montage Frontal

Sampling frequency 128 Hz

Number of expert scorers none, but performance measure is available

9.2 Running the 7-subject vigilance MLP with test data

9.2.1 Pre-processing

EEG signal: The EEG data was pre-processed with the 19pt-low pass FIR filter and the mean removed

as described in previous chapters. Feature extraction, using 10th-order reflection coefficients calculated

using Burg’s algorithm within a sliding 3s-window with 2s overlap, yields a 10-D vector for each second

of EEG. The complete set of these feature vectors will be referred to as the LED test set from now on.

Visual identification of artefacts in the EEG was performed to mark and discard from the analysis those

segments contaminated with saturation and artefacts caused by poor electrode contact. Subject 10’s tests

B and E were excluded from the analysis that follows, due to artefacts or the lack of regular response to

the stimuli.

Button signal: The pulse signal from the button was filtered and used to extract a performance measure

related to the number of missed stimuli. No trigger signal was provided, hence the start of each test was

set to be the second at which the subject starts pressing the button with regularity, every 3s, assuming that

the LED flashed every 3s from that moment on. A missed stimulus is then recorded as occurring when

the button is found not to have been pressed during the three seconds between flashes. The number of

consecutive missed stimuli is calculated on a 3s basis, synchronised with the stimuli, i.e. if the subject has

missed n consecutive hits at time ta seconds, then 1 missed hit has been recorded at (ta−3n) seconds, 2


missed hits at (ta−3(n− 1)) seconds, . . . , (n−1) missed hits at (ta−3) seconds, and finally n missed hits

at ta.

9.2.2 MLP analysis

Normalisation of the reflection coefficients extracted from the EEG signals acquired during the LED tests

was performed, using the 7-subject training set statistics, and the normalised patterns were presented

to the 7-subject 10-3-1 vigilance MLP (see §8.3). Figures 9.1 to 9.22 show the MLP output time courses

along with the missed stimuli performance measure for each test for each patient. None of the MLP

outputs shown have been median-filtered. Note also the different time scales for each figure, depending

on the length of each test.

Visual inspection of the time courses does not reveal a consistent pattern of correlations between the

MLP output and the performance measure across patients, and not even between different tests for the

same patient. For instance, test A for subject 1 shows a paradoxically low value of the MLP output for the

first 3 minutes of the test, when the subject missed no more than 3 stimuli, and an increase in the MLP

output towards the second half of the test as the subject starts to miss more and more button hits. The

MLP output for the other two tests (C and D) suggests a drowsy subject struggling to keep himself awake

throughout the test, with little or no correlation with the actual performance, with the exception of the

last few seconds of test D, at which point the output goes close to zero while the missed stimuli measure

shows a severe decrease in the subject’s performance.

The next example, subject 2 test A, shows very good correlation between the MLP output and the perfor-

mance measure. The MLP output, generally high during the first half of the test, when the performance is

good, suddenly decreases, just before the performance starts to deteriorate, and remains close to drowsi-

ness levels towards the end of the test. However, the MLP output in subsequent tests for the same

subject, suggests a drowsier subject, remaining under 0.5 even when the performance is good. Some iso-

lated peaks and other oscillations close to intermediate values may indicate the subject’s struggle against

drowsiness. Subject 3 is another example of good correlation in the first and fourth tests but not in the


other three tests, for which the MLP output is highly oscillatory with no apparent connection with the

button hits. The fourth test is similar to the first one, showing a decreasing trend as the performance

worsened.

Subject 4 endured the four tests with excellent performance, with a high level MLP output, in agreement

with the performance. Unfortunately, this subject has no data for the “drowsy stages” since he never

missed more than two hits. The flat output during the 33th minute in test C for subject 4 is an artefact

due to a loose-electrode connection. Subject 5 has a medium-to-low MLP output, with a trend that tends

to match the decrease in the performance in all the tests.

The MLP output for subject 6 shows dramatic oscillations between the extreme values of drowsiness and

alertness (0 and 1) as the performance decreases. The period of the oscillations is of the order of minutes,

and for most of the cases when the number of missed hits rises above a value of two, a dip in the MLP

output which is approximately 20s-long precedes the increase in the number of missed stimuli. Although

this indicates a better correlation than for the subjects discussed up to now, a low value in the MLP output

at the beginning of tests A, B, C and E, when the performance is very good, spoils the overall correlation

between the MLP output and the performance measure for this subject.

Subject 7 starts with a very short test, during which the MLP output is constantly low notwithstanding

the first minute of perfect stimuli response. His second test shows the expected correspondence between

the MLP output and the number of missed hits. His third test is similar to the second one, but longer,

suggesting that the subject was drowsier than in the previous tests, perhaps performing reasonably well

because he had learned to cope with the test in spite of his increasing drowsiness. His fourth test differs

from the rest in that he did not fall asleep. The MLP output suggests that he was drowsy at the start of

the test, progressively gaining a better degree of vigilance, until he starts missing several LED flashes, and

then struggles between drowsiness and alertness throughout the rest of the test, although maintaining

good performance until the end.

The 8th subject shows an MLP output close to 1.0 throughout, with a few dips, most of them matching


the loss of ability to respond the stimuli. The one exception to this pattern is the 3-minute period at the

beginning of test C, where the MLP output sweeps through a much wider range. It is worthwhile noting

that subject 8 has the lowest value for the subjective measure of sleepiness (ESS) of any patients in this

database (see Table D.2), and also one of the lowest oxygen saturation dip rates during the overnight

sleep study (see Table D.3), values comparable to those of subject 4, whose performance was the best

as he did not fall asleep during any of the tests. These two subjects could be considered to be at the

lower end of the spectrum of OSA severity, and the MLP output seems to corroborate the clinical results.

However, we shall see later that subject 8 ’s EEG largely differs from the rest of the EEGs in the LED

database and in the original vigilance database of chapter 8, which explains why the output is almost

constant at the upper end of the scale.

The next subject, number 9 in the database, has the worst ESS value and a relatively high index of

night-time O2 de-saturations, and fell asleep very quickly in every test. The MLP output is generally well

correlated with the performance measure, in trend and locally, as for example, when the subject recovers

from a peak in the number of missed stimuli. The last subject in the database, subject 10, performed

very short tests, and there is a degree of correlation between the MLP output and the performance

measure in the first two tests. His third test shows a notch in the MLP output preceding a 9s-long lapse in

performance. His fourth test, however, does not show any significant decrease in the MLP output when

the performance drops towards the end.

To corroborate these comments, the MLP output values were plotted against the number of consecutive

missed stimuli for each LED subject. Only the values at the times at which the stimuli occur have been

considered, as we cannot make any assumption about the subject’s vigilance state in the absence of the

stimulus.

The scatter plots in Figs. 9.23 and 9.24 show that the MLP output tends to take on values below 0.5 as

the number of missed hits increases, especially in subjects 1, 2, 5 and, to some extent, in subjects 7 and

9. This cannot be said of subjects 3, 6, 8, and 10, however, although results for subject 8 can be discarded


0

0.5

1

MLP

out

put

LED subject 1, test A

0 1 2 3 4 5 6

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test A

0

0.5

1

MLP

out

put

LED subject 1, test B

0 5 10 15 20 25 30

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test B

Figure 9.1: LED subject 1 MLP output and missed hits time courses


0

0.5

1

MLP

out

put

LED subject 1, test C

0 2 4 6 8 10 12 14 16

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test C

0

0.5

1

MLP

out

put

LED subject 1, test D

0 5 10 15 20 25

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test D



0

0.5

1

MLP

out

put


0 5 10 15 20 25

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test A

0

0.5

1

MLP

out

put


0 2 4 6 8 10 12 14 16

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test B



0

0.5

1

MLP

out

put


0 5 10 15

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test C

0

0.5

1

MLP

out

put


0 2 4 6 8 10 12

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test D



0

0.5

1

MLP

out

put


0 0.5 1 1.5 2 2.5 3 3.5 4

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test A

0

0.5

1

MLP

out

put


0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test B



0

0.5

1

MLP

out

put


0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test C

0

0.5

1

MLP

out

put


0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test D



0

0.5

1

MLP

out

put

LED subject 3, test E

0 1 2 3 4 5 6

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test E



0

0.5

1

MLP

out

put


0 5 10 15 20 25 30

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test A

0

0.5

1

MLP

out

put


0 5 10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test B



0

0.5

1

MLP

out

put


0 5 10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test C

0

0.5

1

MLP

out

put


0 5 10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test D



0

0.5

1

MLP

out

put


0 2 4 6 8 10 12 14 16 18 20

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test A

0

0.5

1

MLP

out

put


0 2 4 6 8 10 12

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test B



0

0.5

1

MLP

out

put


0 2 4 6 8 10 12 14 16

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test C

0

0.5

1

MLP

out

put


0 2 4 6 8 10 12 14 16 18 20

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test D



0

0.5

1

MLP

out

put


0 2 4 6 8 10 12 14

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test A

0

0.5

1

MLP

out

put


0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test B



0

0.5

1

MLP

out

put


0 2 4 6 8 10 12

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test C

0

0.5

1

MLP

out

put


0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test D



0

0.5

1

MLP

out

put

LED subject 6, test E

0 2 4 6 8 10 12 14 16 18

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test E



0

0.5

1

MLP

out

put


0 0.5 1 1.5 2

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test A

0

0.5

1

MLP

out

put


0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test B



0

0.5

1

MLP

out

put


0 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test C

0

0.5

1

MLP

out

put


0 5 10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test D



0

0.5

1

MLP

out

put


0 5 10 15 20

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test A

0

0.5

1

MLP

out

put


0 5 10 15 20 25 30

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test B



0

0.5

1

MLP

out

put


0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test C

0

0.5

1

MLP

out

put


0 5 10 15 20 25 30

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test D



0

0.5

1

MLP

out

put


0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test A

0

0.5

1

MLP

out

put


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test B



0

0.5

1

MLP

out

put


0 0.1 0.2 0.3 0.4 0.5 0.6

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test C

0

0.5

1

MLP

out

put


0 0.2 0.4 0.6 0.8 1 1.2

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test D



0

0.5

1

MLP

out

put


0 1 2 3 4 5 6

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test A

0

0.5

1

MLP

out

put


0 0.5 1 1.5 2 2.5 3 3.5

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test C



0

0.5

1

MLP

out

put


0 0.5 1 1.5 2 2.5

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(a) test D

0

0.5

1

MLP

out

put

LED subject 10, test F

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0

1

2

3

4

5

6

7

mis

sed

hits

time [minutes]

(b) test F



as we will see in the next section. Subject 4 lacks any data for more than 2 missed hits. It is important to

note that the reliability of the results for high values of missed hits is low, as not enough data points are

available, given that the test finishes whenever the subject fails to respond to 7 consecutive LED flashes.

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1LED subject 1, tests

missed hits

MLP

out

put

(a) Subject 1

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


missed hits

MLP

out

put

(b) Subject 2

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


missed hits

MLP

out

put

(c) Subject 3

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


missed hits

MLP

out

put

(d) Subject 4

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


missed hits

MLP

out

put

(e) Subject 5

Figure 9.23: LED subjects MLP output vs missed hits scatter plots

It is also clear that, when the subject responds to the stimuli (i.e. no missed hits), the MLP output can

take on any value over the whole range, with a distribution that varies from unimodal to bimodal to

uniform, as is shown in Fig. 9.25. This figure is particularly puzzling for subjects 1 and 2, as well as

subject 5, since all these have a unimodal distribution around zero or a very low value of MLP output.

Subjects 3, 7 and 9 on the other hand, present a very uniform distribution. This suggests that severe OSA

patients may perform reasonably well for some time when their brain is in a ”drowsy” state. However,

they cannot maintain this level of performance indefinitely, and it drops off sooner or later depending of

the severity of the disorder.

It can also be said, when reviewing the MLP outputs for all the subjects, that the transition to Drowsiness

can happen in a progressive manner as well as in sudden dips. Two examples of the former are shown

9.3 Visualisation analysis 209

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


missed hits

MLP

out

put

(a) Subject 6

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


missed hits

MLP

out

put

(b) Subject 7

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


missed hits

MLP

out

put

(c) Subject 8

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


missed hits

MLP

out

put

(d) Subject 9

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


missed hits

MLP

out

put

(e) Subject 10

Figure 9.24: LED subjects MLP output vs missed hits scatter plots

in Figs. 9.11 and 9.3, and two examples of the latter can be found in Figs. 9.12 and 9.6. It is not clear

why this should be but it is probably dependent on the subject rather than on the condition under which

the test is performed (e.g. the time at which the test takes place), as no subject was found to exhibit

both types of behaviour in the LED tests. The time courses of the MLP outputs from the normal sleep-

deprived subjects (Figs. 8.3 to 8.9 in the previous chapter) showed predominantly a gradual transition

to drowsiness, with a few occasional dips (for example, subject 2 at times 55 minutes and 100 minutes)

but the lack of a suitable performance measure for these records prevents us from being able to draw a

definite conclusion.

9.3 Visualisation analysis

In order to get a deeper insight into the EEG data corresponding to good performance (i.e. no missed hits

as for the histograms of Fig. 9.25), the distribution of the vectors in feature space is investigated using

the 7-subject vigilance NEUROSCALE map of section 7.1.2.


0 0.5 10

50

100

150

200

250

300

350LED subject 1

0 0.5 10

100

200

300

400

500LED subject 2

0 0.5 10

10

20

30

40

50

60

70LED subject 3

0 0.5 10

200

400

600

800

1000

1200LED subject 4

0 0.5 10

50

100

150

200

250

300LED subject 5

0 0.5 10

50

100

150

200

250

300

350

400

450LED subject 6

MLP output0 0.5 1

0

50

100

150

200

250LED subject 7

MLP output0 0.5 1

0

200

400

600

800

1000

1200

1400

1600

1800LED subject 8

MLP output0 0.5 1

0

5

10

15

20

25

30LED subject 9

MLP output0 0.5 1

0

20

40

60

80

100

120LED subject 10

MLP output

Figure 9.25: LED subjects no-missed hits MLP output histogram

9.3.1 Projection on the 7-subject vigilance on the NEUROSCALE map

The LED feature vectors are normalised using the mean and variance of the 7-subject vigilance database,

and then presented to the NEUROSCALE map previously trained on the same vigilance database. Figs. 9.26

and 9.27 show the projection of the LED patterns (thick dots, grey for data points with 0 and 1 missed

hits, yellow for data points with 2, 3 or 4 missed hits, and red for data points corresponding to 5, 6 or 7

missed hits) in the 7-subject vigilance 2-D map (A means in magenta ×’s and D means in blue o’s). The

scale is the same for Figs. 9.26 to 9.28, slightly expanded on Fig. 9.29 and completely different for subject

8 on Fig. 9.30. In each case, the percentage of outliers not shown on the map is indicated in brackets.

The 2-D projections show that for more than half of the LED subjects, the patterns lie in the same area of

the vigilance A and D means, with a few patterns lying outside, mainly in the middle-lower section of the

plots and in the outer area around the A means. The LED database, in contrast with the vigilance database

which used the central electrode montage, was acquired with frontal electrodes and it was found that


−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8

Drowsy Alert 0−1 missed hits

−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8LED subject 1, (outliers 3.4%)


−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


(a) LED subject 1

−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8LED subject 2, (outliers 0%)


−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


(b) LED subject 2

Figure 9.26: Patterns from LED subject 1 and 2 projected onto the 7-subject vigilance NEUROSCALE map

EEG contaminated by blinking artefacts or movement artefacts produce patterns that lie on the periphery

of the A means. EEG recorded with frontal electrodes is prone to these two kinds of artefacts, usually

absent in central EEG, and this could explain the outliers in the 2-D projections of the patterns for LED

subjects 4, 5, 6, 7 and 10. In contrast, subjects 1, 2, 3 and 9 produced EEG patterns which are very

likely to come from the same distribution as the vigilance database, given their projections in the 2-D

NEUROSCALE map. Subject 8’s patterns represent the other end of the range, his 2-D projection showing

a high percentage of patterns lying far away from the A and D means, towards the lower-right corner of

the map.

Qualitative correlation with the MLP results

Subject 1 Most of his 0-1 missed hits patterns lie in the region of overlap between alertness and

drowsiness. There is a tendency towards the area in the map which represents drowsiness, so that it can


−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6



−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


(a) LED subject 3

−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6



−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


(b) LED subject 5


be said that this subject was mainly drowsy during the tests. The histogram in Fig. 9.25 for the 0-1 missed

hits MLP output is unimodal, with a peak near to zero, indicating that this subject performs well while

drowsy, although he is the second most severe case of OSA in the database according to the overnight

sleep study, the severity of the disorder being mostly assessed here according to the number of dips in

oxygen saturation per hour (see Appendix D).

Subject 2 This subject has a similar distribution of data points over the map, with more vectors in the

alertness region than subject 1, corresponding to the second peak in the bimodal distribution of the 0-1

missed hits MLP output values in Fig. 9.25. From this, it can be said that this subject is generally drowsy

but sometimes alert. The MLP output time courses reveal that the subject only presents these “alert”


patterns during the first test, then he seems to have trained himself to perform in “automatic mode”1, i.e.

while being deeply drowsy. His overnight study suggests that he also has a severe case of OSA.

Subject 3 This subject’s patterns are evenly distributed over the region of overlap on the map, corre-

lating with the uniform distribution of the 0-1 missed hits MLP outputs in Fig. 9.25. These results and

the MLP output time courses suggest that the level of vigilance for this subject varies between drowsiness

and alertness. He had an average number of oxygen de-saturations the previous night.

Subject 4 This subject’s patterns in general and especially for 0-1 missed hits lie mostly within the

region of alertness area and within the region of overlap (suggesting that he was alert during the tests),

with a significant number of outliers. The histogram for the MLP output in Fig. 9.25 shows unimodal

distribution with a peak at 1.0. This peak is due not only to the patterns within the region of alertness

but also to the outliers. This subject is the second mildest case of OSA in the database.

Subject 5 His patterns are distributed in the region of overlap area on the map, with a tendency

towards alertness, so that it can be said that he was slightly more alert than drowsy. The histogram in

Fig. 9.25 shows a unimodal distribution with a mean at around 0.3. This subject represents the mildest

degree of OSA in the database.

Subject 6 The patterns for this subject present the particularity of lying mostly in the lower centre area

of the map (a region of overlap) with a large number of outliers. The distribution for the MLP output for

0-1 missed hits in Fig. 9.25 shows a peak which could be due to outliers, the distribution being uniform

otherwise (and suggesting a level of vigilance between drowsy and alert). Indeed, the MLP output time

courses show continuous fluctuations between drowsiness and alertness. The sleep study categorised this

subject as having a serious case of OSA.

1Automatic behaviour is a phenomenon reported by the sleep-deprived in which they perform relatively routine behaviourwithout having any memory of doing so [31].


Subject 7 Although they lie mostly in the region of overlap area between drowsiness and alertness,

a proportion of the patterns for 0-1 missed hits lies in the alertness area, explaining the second peak in

the 0.9-1.0 bin of the histogram in Fig. 9.25. The first peak occurs around 0.25. This suggests that this

subject was more alert than drowsy, but is also able to perform well when drowsy, as his first and third

test reveal in the MLP time courses.

Subject 8 Note the completely different scale used for this subject because a large proportion of his

patterns are outliers, with a few lying in the alertness dominated region. The impulse-like histogram in

Fig. 9.25 probably owes its peak at 1.0 to the outliers.

A visual inspection of subject 8’s EEG reveals a signal rich in high frequencies. The raw signal was strongly

contaminated with mains interference, removed by the filtering process prior to analysis. Nevertheless,

the filtered signal still shows frequencies in the upper β band, i.e. as high as 25-30 Hz, characteristic of

a very alert state. The vigilance database was obtained from normal subjects who were sleep deprived,

and who were probably drowsy enough not to show up the higher β frequencies in their EEG. These are

therefore absent in the training database and appear as outliers in the NEUROSCALE map of Fig. 9.30.

Subject 9 The few patterns available from this subject lie in the region of overlap on the map, with some

of them spreading towards the alertness region. This correlates well with the nearly uniform distribution

for the histogram of MLP outputs for 0-1 missed hits in Fig. 9.25, and suggests that this subject’s vigilance

was somewhere between alertness and drowsiness. This subject, who fell asleep very quickly in every

test, rated his level of sleepiness as the worst possible (EES in Table D.2, which also shows a large number

of oxygen de-saturations during the night (severe OSA)).

Subject 10 Almost all the patterns for this subject lie in the alertness area of the map, including those

for 5-7 missed hits. This suggests that while subject 10 was mostly alert during the tests, he failed to

respond to 5, 6 and 7 LED flashes when alert! Although his histogram of MLP outputs for 0-1 missed

9.4 Discussion 215

hits is as expected, the scatter plot in Fig. 9.24 shows no correlation between the MLP output and the

performance measure. This corresponds to what is shown in the NEUROSCALE map. It is important to

note that this subject fell asleep very quickly each time and is the most severe case of OSA in the database,

with a rate of oxygen de-saturations more than double the second most severe case, and with the highest

number of movement arousals during the night.

9.4 Discussion

The results of the visualisation analysis have shown that for some subjects in the LED database, as with

subject 8 and to a lesser extent subjects 4 and 6, the EEG does not have the same characteristics as

the EEG in the training database, and hence the results from the MLP are not reliable. Also, the low

average value of the MLP output for most of the subjects may be an influential factor in the decreasing

exponential trend in the scatter plots of subjects 1, 2, 5, 7 and 9, as the statistical significance of the

plot decreases (fewer data points) with an increase in the number of consecutive missed hits. Except for

subject 6, the projection in the NEUROSCALE map shows little difference in the distribution of the feature

vectors for 0–1, 2–4 and 5–7 missed hits. Although this does not necessarily imply the same overlap in

the 10-D feature space, it is another factor to bear in mind in the interpretation of the MLP results and

when considering the correlation between the MLP output and the performance measure for the subjects

in the LED database.

9.5 Summary and conclusions

The MLP trained with the 7-subject vigilance database has been used to analyse new data from a vigilance

study in OSA patients. The study, consisting of 4 to 6 vigilance tests, provided frontal EEG recordings

and a performance measure at regular intervals. The MLP output, representing a continuum between

drowsiness (0) and alertness (1), was calculated for each test.

Visual inspection of the MLP output time courses does not show consistent correlation with the perfor-


mance measure. Scatter plots of the MLP output against the performance measure reveal that the MLP

output is generally low when the performance measure indicates deep drowsiness, as expected, but takes

any value between 0 and 1 when the performance measure suggests alertness. Similar results have been

found by other researchers in a random visual stimulus response test [88][32]. OSA patients seem to

perform relatively well even when their electrophysiological signals indicate drowsiness or even light

sleep. Kecklund and Akerstedt have also found that lorry drivers seem to be able to drive in spite of the

appearance of alpha activity in their EEG [83]. However, the reduction in the statistical significance of

the correlation between MLP output and performance measure as the performance deteriorates prevents

us from making any strong statement about the “drowsy” EEG of OSA subjects.

The NEUROSCALE visualisation technique was also applied to the database tested in this chapter in order

to validate the MLP results. The analysis strongly suggests that the EEG of one of the subjects in the

database, subject 8, is very different from that in the vigilance database and therefore the MLP results

for this subject should be discarded, as the MLP produces no reliable results when it extrapolates. Form

the rest of the subjects in the database, 7 out of 9 seem to have EEG patterns belonging to the same

distribution as that found in the vigilance database, and hence the MLP trained with normal subjects can

be used in the study of these OSA patients.


−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6



−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


(a) LED subject 7

−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6



−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


(b) LED subject 9

−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6



−6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8


(c) LED subject 10

Figure 9.28: Patterns from LED subject 7, 9 and 10 projected onto the 7-subject vigilance NEUROSCALEmap


−6 −4 −2 0 2 4 6 8−15

−10

−5

0

5


−6 −4 −2 0 2 4 6 8−15

−10

−5

0



−6 −4 −2 0 2 4 6 8−15

−10

−5

0

5


(a) LED subject 4

−6 −4 −2 0 2 4 6 8−15

−10

−5

0

5


−6 −4 −2 0 2 4 6 8−15

−10

−5

0



−6 −4 −2 0 2 4 6 8−15

−10

−5

0

5


(b) LED subject 6



−5 0 5 10 15−30

−25

−20

−15

−10

−5

0

5

10


−5 0 5 10 15−30

−25

−20

−15

−10

−5

0

5



−5 0 5 10 15−30

−25

−20

−15

−10

−5

0

5

10


Figure 9.30: Patterns from LED subject 8 projected onto the 7-subject vigilance NEUROSCALE map

Chapter 10

Conclusions and future work

10.1 Overview of the thesis

The main objective of the research described in this thesis has been to develop neural network methods

to study the sleep and wake EEG of subjects with the severe breathing disorder OSA. In chapter 6, which

describes the analysis of sleep studies, an MLP neural network was trained with AR model reflection

coefficients as inputs. These were extracted from a single-channel of EEG recorded during the sleep of

normal subjects and the network was trained to track the sleep-wakefulness continuum. An automated

system based on this MLP output and a set of logical rules was developed and tested with OSA sleep EEG.

The results, validated against scores from a human expert, show that the automated system is able to

detect most of the μ-arousals in the EEG of these patients with accuracy, not only in occurrence (with a

median sensitivity of 0.97, and a median positive predictive accuracy of 0.94), but also in starting time

and duration (with a median correlation index of 0.82). In chapter 7 visualisation algorithms applied

to the sleep EEG database and the wake EEG database (acquired from sleep-deprived normal subjects),

showed the need for another MLP network to analyse the wake EEG, as its characteristics differ from

those of the EEG in the sleep database. The vigilance analysis, covered in chapters 8 and 9, was again

carried out by training MLP neural networks with AR model reflection coefficients at the input. This time,

these were extracted from a single-channel of EEG recorded from normal sleep-deprived subjects and the

network was trained to track the alertness-drowsiness continuum. The trained MLPs were tested with

data from normal sleep-deprived subjects as well as from OSA patients performing a visual vigilance task.

10.2 Discussion of results 221

The test on normal subjects was correlated with a human expert assessment of the EEG, and the mean

correlation was found to be 52.0% (sd 5.9%). A performance measure was used to evaluate the MLP

output on the EEG from OSA subjects. The results of this analysis, although not totally conclusive, have

raised important questions about the effect of OSA on the EEG.

10.2 Discussion of results

While the sleep studies yielded very good results in general, the correlation between the MLP output

and the performance measure in OSA subjects was highly variable. It is well known, however, that the

effectiveness of performance measures in the assessment of sleepiness depends largely on the task char-

acteristics. There is no perfect task to evaluate the decrease of vigilance. The physiological-behavioural

link is not straightforward, the task itself being intrusive in the natural process of drowsiness. Many fac-

tors such as motivation, circadian rhythm and habituation can make a very drowsy subject perform well

or better than otherwise. Pivik [130] stressed the relevance of long-practice effects, which can improve

the performance on a given task without an improvement in the physiological condition. Also, not all

investigations of sleep loss have shown adverse effects in performance [139]. The effects of sleep loss are

similar to those of OSA, as the latter fragments the sleep and diminishes its total time. Dinges et al. re-

view the literature in the area [42] and conclude that performance variance increases with sleep loss, that

habituation to a repetitive task is augmented in a sleepy brain, that performance depends non-linearly

on sleep loss and time of the day (related to the circadian rhythm), and that motivation or “willingness

to perform” may have a distinct effect on the capacity to perform. The attentional task used to build

the LED database is repetitive and could have caused habituation to the task after a few minutes of the

first test or in subsequent tests as is the case of subjects 1 and 2. Variance in performance as the subject

gets drowsy might explain the poor correlation found for subjects 3, 6 and 10 (see Figs. 9.23 and 9.24).

Circadian rhythm and/or motivation may explain why subject 7 fell asleep after two and a half minutes

in his first test at midmorning, and performed without falling asleep for 40 minutes in the last test, early

in the afternoon.

10.3 Main research results 222

EEG subject inter-variability was a problem encountered many times in this thesis. The μ-arousal au-

tomated system results were very satisfactory for 5 out of 7 subjects, but disappointing for two of the

subjects. One of these two subjects presents mixed frequency EEG at the time of the μ-arousal and the

other one shows an EEG with much higher content in the upper frequency bands than for the rest of the

database. In a recent paper [46] Drinnan et al. have found that μ-arousal inter-scorer agreement tends

to be poor when the μ-arousal occurs embedded in high-frequency EEG. Also, in the vigilance EEG study

(chapter 7), a case of an α+ subject highlighted the variation of wake EEG patterns in the general popu-

lation, and showed the need for special considerations of these subjects who represent a non-negligible

fraction of the total population.

10.3 Main research results

As mentioned in section 1.1, there had been no prior work in the computerised analysis (using the same

framework) of both sleep disturbance and vigilance from the EEG before the research described in this

thesis. In the course of this research, several findings have been made. Amongst these are:

1. A compromise was found between the stationarity requirements of AR modelling and the variance

of the AR reflection coefficients by using a 3-second analysis window with a 2-second overlap.

The AR model is still able to follow rapid changes whose duration is 3 seconds or more. To our

knowledge, this is the first time that AR modelling has been used in μ-arousal detection.

2. A μ-arousal may cause an increase in the δ rhythms of the EEG at the same time as it causes an

increase in the amplitude of the higher frequencies (α and/or β bands). Hence a μ-arousal is

not necessarily just a “shift” in frequency, as often described in the literature related to μ-arousal

detection.

3. The visualisation analysis described in chapter 7 revealed that Alertness and Drowsiness in vigilance

tests are not the same as Wakefulness and REM/Light Sleep in a sleep-promoting environment.

4. MLP analysis and visualisation techniques applied to wake EEG in OSA subjects show that these

10.4 Conclusions 223

patients can present “drowsy” EEG while performing well during a visual vigilance test.

5. The MLP analysis of the wake EEG in OSA patients has shown that the transition to Drowsiness may

occur progressively as well as in sudden dips.

6. The Alertness EEG patterns of OSA may not be the same as the Alertness patterns of normal sleep-

deprived subjects. Instead they seem to resemble more closely the Drowsiness patterns of these

normal subjects.

10.4 Conclusions

From the sleep study results we conclude that the automatic scoring system, based upon a neural network

trained with normal sleep EEG, can be reliably used as a supporting diagnosis tool in the detection of μ-

arousals in OSA patients.

As for the vigilance study, the neural network proved useful in the assessment of the EEG of these patients

in relation to their performance during the task. More work needs to be done to improve the statistical

significance of these results and to validate them against the scores from a human expert, as it is well

known that correlation of EEG with performance measures is not consistent, the task performance being

influenced by many factors other than physiological sleepiness.

In summary, the use of neural network methods, namely the NEUROSCALE algorithm for EEG data vi-

sualisation, and the MLP network for description of the EEG state in sleep and in vigilance, have led to

a better understanding of the effects of the OSA disorder on the sleep EEG, as we have obtained more

insight into the changes during a μ-arousal. Also, we have found that the EEG alertness patterns of OSA

subjects may present similar characteristics to those of the drowsiness patterns in normal subjects (after

minor sleep deprivation).

10.5 Future work 224

10.5 Future work

Several issues are left open, which should be considered if further work is going to be carried out in the

analysis of the EEG in OSA patients. The most important of these are:

1. There is a missing link between alertness in sleep-deprived normal subjects, alertness in OSA sub-

jects and wakefulness in normal subjects prior to sleep onset. To fill this gap, a study should be

carried out to acquire EEG data from normal fully alert subjects performing vigilance tasks.

2. The various databases used in the work described in this thesis come from three different sleep

laboratories. This has some repercussions on the results as the databases differ in several aspects.

For instance, the wake EEG signals acquired with the frontal electrode montage show variations

in the α and θ content of the signal with respect to those acquired with the central electrode

montage. The vigilance training database was acquired using the central electrode montage while

the test database was recorded using frontal electrodes. This could have a significant effect on the

assessment of alertness by the neural network. Human expert scoring based on the same scale

as used for the training database is desirable on the LED database, in order to validate the MLP

results, given the controversy surrounding the reliability of performance measures in the evaluation

of drowsiness. Also, the task should be redesigned in order to control the habituation factor, and to

increase the amount of data for low performance.

3. The wake EEG from some OSA subjects presented patterns which differ in some degree from nor-

mality. Is the EEG of OSA patients when they are awake different from that of normal subjects? In

other words, is their alertness EEG more like the drowsiness EEG for normal individuals? Are they

constantly drowsy, behaving as if they were alert as a result of habituation? Little or no attention

has so far been given to this issue which has a major effect on the quality of life for many people.

4. There is a good deal of controversy surrounding the treatment of OSA [173]. Some evidence has

been found to support the use of nasal continuous positive pressure (nCPAP) therapy [59][32] as a


means of keeping the airways open during sleep. Pre- and post-treatment analysis of the EEG and

its relation to performance is suggested to validate nCPAP as an effective therapy for OSA. Once

this is done, the results could be used to find out if the EEG recovers after treatment (so that both

alertness and drowsiness patterns become similar to those of normal subjects) or whether there is

an irreversible long-term effect on the EEG.

5. The α content of the wake EEG has been described as distractive by clinicians [154], as it largely

differs across the population, and shows inconsistent variations from alertness to drowsiness. Some

of the work done in vigilance assessment [170] (and reviewed in section 3.4) uses α power with

eyes open and eyes closed, as a reference. We suggest that this procedure should be considered

(making sure that the subject is alert during this “calibration”), in order to pave the way for subject-

adaptive vigilance analysis. This would imply re-training the neural network using eyes-open and

eyes-closed α power as reference in order to adjust the results for inter-subject α differences.

6. The work in this thesis has taken the application of linear model as far as possible, but the assump-

tion of stationarity must break down regularly for wake EEG. Therefore, non-linear features, such

as complexity as in the work of Rezek et al. [137], time-delay embedding and ICA as in the work

of Lowe [98], should be investigated, as they have been reported as giving better discrimination

than linear methods in preliminary studies on the changes in the EEG of subjects either asleep or

performing vigilance tasks.

7. A more generalised approach in learning theory, called Support Vector Machine (SVM), has been

used for regression and classification [169], reporting better generalisation than neural networks

[144]. SVMs have been discarded for the work presented in this thesis as they lack probabilistic

outputs. However, a Bayesian framework has recently been developed for SVM, introducing the

Relevance Vector Machine (RVM) which does not suffer from the above disadvantage, and demon-

strates comparable generalisation performance to SVM [162]. Future work should explore RVM as

an alternative to the use of neural networks in posterior probability estimation for the sleep and the


vigilance problems.

Appendix A

Discrete-time stochastic processes

A.1 Definitions

The definitions presented in this section have been taken from [63] and [62].

Stochastic Process: A statistical phenomenon that evolves in time according to probabilistic laws. Fromthe definition of a stochastic process one may be confused and interpret it as a function of the discretevariable time n1, when indeed it represents an infinite number of different realisations ξ of the processu(n, ξ). An ensemble represents a set of realisations of the same process.

Time Series: A realisation ξo of a discrete-time stochastic process is called a time series, u(n), consist-ing of a set of observations generated sequentially in time. A time series of interest is a sequence ofobservations u(n), u(n − 1), ..., u(n − M) generated at discrete and uniformly spaced instants of time,n, n − 1, ..., n − M1.

Statistical description of a discrete-time stochastic process: Consider a stochastic process repre-sented by the ensemble shown in Fig. A.1. Each time series ui(n) represents a random variable along thetime axis, but a set of observations at a specific time n1 represents a random variable as well, in this caseacross the ensemble.

First and second order moments may be defined across the process (ensemble). The mean-value functionof the process is defined as:

μ(n) = E[u(n)] (A.1)

The autocorrelation function of the process may be defined as:

r(n, n − k) = E[u(n)u(n − k)], k = 0,±1,±2, ... (A.2)

Another second order moment, the autocovariance function is defined as:

c(n, n − k) = E[(u(n) − μ(n))(u(n − k) − μ(n − k))], (A.3)

for k = 0,±1,±2, ...

The autocorrelation and the autocovariance functions are related by:

c(n, n − k) = r(n, n − k) − μ(n)μ(n − k) (A.4)

1For convenience time is normalised with respect to the sampling period.

Stochastic processes 228

u1

u2

u3

u4

u5

sample in time

iu (n )1

n1n

(n)

(n)

(n)

(n)

(n)

Figure A.1: Stochastic process ensemble

So, for partial characterisation of a stochastic process through its first and second moments it will besufficient to specify the mean value and either the autocorrelation or the autocovariance function.

Stationary process: A stochastic process will be stationary in the strict sense if all of its moments areconstant. For example, the mean value will be:

μ(n) = μ (A.5)

For such a process the autocorrelation and autocovariance functions depend only on the lag k.

r(n, n − k) = r(k)

c(n, n − k) = c(k)(A.6)

Note that for a stationary process the autocorrelation function at k=0 equals the mean-square value:

r(0) = E[ | u(n) |2 ] (A.7)

and the autocovariance for k=0 equals the variance:

c(0) = E[ | u(n) − μ |2 ] = σ2u (A.8)

If the first and second moments of a process satisfy the conditions described above, it is at least stationaryto the second order, and if the variance is finite, wide-sense stationarity conditions will be satisfied.

Ergodicity Consider a stationary (in the wide-sense) process in which the time moments are constantas well and equal to their equivalents across the process. This is very convenient because it allows us tocharacterise the process with suitable measurements of one of its time series.

We may estimate the mean of the process computing the time average of one of its realisations using:

μ(N) =1N

N−1∑n=0

u(n) (A.9)


where N is the number of observations or samples of the time series u(n). We expect that this timeaverage will converge to the ensemble mean as N increases. The mean-square error defines a criterionfor this convergence:

limN→∞

[ (μ − μ(N))2 ] = 0 (A.10)

If we repeat the estimation for some more realisations and find the expected value of the square error,we may find that:

limN→∞

E[ | μ − μ(N) |2 ] = 0 (A.11)

In this case it can be said that the process is mean ergodic. In other words,a wide-sense stationary processwill be mean ergodic in the mean-square error sense if the mean-square value of the error between theensemble mean μ and the time average μ(N) approaches zero as the number of samples N approachesinfinity.

This criterion may be extended to other time averages of the process. The estimate used for the autocor-relation function is:

r(k,N) =1N

N−1∑n=0

u(n)u(n − k) (A.12)

for 0 ≤ k ≤ N − 1.

In this case, the process will be correlation ergodic in the mean-square error sense if the mean-squarevalue of the difference between the ensemble autocorrelation r(k) and the time estimate r(k,N) ap-proaches zero as the number of samples approaches infinity.

Transmission of a discrete-time stationary process through a linear filter Let the time series y(n) bethe output of a discrete time shift invariant linear filter with unit-sample response h(n) and input u(n).Assume that u(n) represents a single realisation of a wide-sense stationary discrete-time process. Then,y(n) also represents a single realisation of a stationary wide-sense discrete-time stationary process withautocorrelation ry(k) given by:

ry(k) =∞∑

i=−∞

∞∑�=−∞

h(i)h(k)ru(� − i + k) (A.13)

Correlation matrix If the M × 1 vector u(n) represents the time series as:

u(n) = [ u(n), u(n − 1), ..., u(n − M + 1) ]T (A.14)

where the superscript T denotes transposition. The M × M correlation matrix R may be defined as:

R = E[ u(n)uT (n) ] (A.15)

Expanding this expression:

R(n) =

⎡⎢⎢⎢⎣

[E[u(n)u(n)] E[u(n)u(n − 1)] . . . E[u(n)u(n − M + 1)]E[u(n − 1)u(n)] E[u(n − 1)u(n − 1)] . . . E[u(n − 1)u(n − M + 1)]

......

. . ....

E[u(n − M + 1)u(n)] E[u(n − M + 1)u(n − 1)] . . . E[u(n − M + 1)u(n − M + 1)]]

⎤⎥⎥⎥⎦

(A.16)


If the process is stationary in the wide-sense:

R =

⎡⎢⎢⎢⎣

r(0) r(1) . . . r(M − 1)r(−1) r(0) . . . r(M − 2)

......

. . ....

r(−M + 1) r(−M + 2) . . . r(0)

⎤⎥⎥⎥⎦ (A.17)

From the property of the autocorrelation function of a wide-sense stationary process:

r(−k) = r(k) (A.18)

we find that the matrix R is symmetric. According to this, only M values of the autocorrelation functionr(k) are needed to calculate the correlation matrix R.

R =

⎡⎢⎢⎢⎣

r(0) r(1) . . . r(M − 1)r(1) r(0) . . . r(M − 2)

...... . . .

...r(M − 1) r(M − 2) . . . r(0)

⎤⎥⎥⎥⎦ (A.19)

As can be seen from Eq. A.19 the correlation matrix of a wide-sense stationary process is Toeplitz, i.e. allthe elements along the main diagonals are equal. A Toeplitz correlation matrix guarantees wide-sensestationarity.

A general property, valid for all stochastic processes is that the correlation matrix is always nonnegativedefinite and almost always positive definite. If it is positive definite it will be nonsingular as well. The rarecondition of a singular correlation matrix represents linear dependency between the elements of the timeseries. This arises only when the process u(n) consists only of a sum of K ≤ M sinusoids. Although thissituation is almost impossible in practice, the correlation matrix may be ill-conditioned if its determinantis very close to a zero value.

Gaussian processes A particular strictly stationary stochastic process, common in the physical sciences,is the Gaussian process, which has the property that it can be fully statistically characterised with onlythe first and second moments. We may call a process, u(n), Gaussian if any linear functional of u(n) is aGaussian distributed random variable. A linear functional is defined by Eq. A.20 as,

Y =∫ T

0

g(t)u(t)dt (A.20)

where g(t) is a weighting function such that the mean-square value of the random variable Y is finite.

For a discrete-time stochastic process, the linear functional becomes a linear function of all the samplesof u(n) up to the time n, Y =

∑ni=0 gix(i). The Gaussian distribution probability density function fY (y)

is shown in Eq. A.21.

fY (y) =1√

2πσ2Y

exp

(− (y − μY )2

2σ2

)(A.21)

where μY is the mean and σ2Y is the variance of the random variable Y . Usually a Gaussian process will

be denoted as N (μ,R). As the mean is a constant value that can be subtracted from the time series, wewill consider only zero mean Gaussian processes, N (0,R).

The joint probability density function of N samples of a Gaussian process is described by:

fU(u) =1

(2π)N/2det(R)1/2exp(−1

2uTR−1u) (A.22)


Note that fU(u) is N -dimensional for a real-valued process. For the case N = 1, matrix R becomes thevariance of the process, σ2. One particularly interesting property of a Gaussian process, derived from itsdefinition, is that if a Gaussian process u(n) is applied to a stable linear filter, then the output of the filteris a Gaussian process as well.

Appendix B

Conjugate gradient optimisationalgorithms

The description of the algorithms in this Appendix can be found in [19]. For further details the reader isdirected to [53] and [113].

B.1 The conjugate gradient directions

Let us assume line searching takes place along the direction d(τ). At the minimum the derivative of E inthe direction d(τ) vanishes:

d

dλE(w(τ) + λd(τ)) = 0 (B.1)

Let us set the new weight vector w(τ+1) at this minimum in d(τ). Eq. B.1 implies that the gradient vectorin w(τ+1) is orthogonal to the searching direction. By adopting g≡∇E as a short-hand notation for thegradient of the error function, we can write the orthogonality property as:

g(τ+1)Td(τ) = 0 (B.2)

We would like to find a new searching direction d(τ+1) such that the property described in Eq. B.2 holdsfor all the points in this new direction:

g(w(τ+1) + λd(τ+1))Td(τ) = 0 (B.3)

By using the first order expansion of g around w(τ+1):

(g(τ+1) + g′ (τ+1)Tλd(τ+1))T d(τ) = 0

⇒ g(τ+1)Td(τ) + λd(τ+1)Tg′ (τ+1)d(τ) = 0(B.4)

The first term on the left hand side vanishes as a result of the property given in Eq. B.2 and g′ is noneother than the Hessian matrix, so we can write Eq. B.4 as:

d(τ+1)THd(τ) = 0 (B.5)

The directions d(τ+1)T and d(τ)T are said to be non-interfering or conjugate. Suppose that we can find aset of W vectors which are mutually conjugate with respect to H so that:

dTj Hdi = 0, i �= j (B.6)

Optimisation Algorithms 233

It can be shown [19, pp.277] that these vectors are linearly independent, if H is positive definite, andthat they form a complete, non-orthogonal basis set in W. Starting at w1, we can write the differencebetween the minimum w∗ in W and the point w1 as:

w∗ −w1 =W∑i=1

αidi (B.7)

If we define wj as:

wj = w1 +j−1∑i=1

αidi (B.8)

an iterative equation can be written in the form:

wj+1 = wj + αjdj (B.9)

Eq. B.9 represents a succession of line searching steps in the conjugate directions, with the jth step lengthcontrolled by the parameter αj . To find the parameters αj let us assume the quadratic form for the errorfunction:

E(w) = EQ(w) = c + bTw +12wT Hw (B.10)

with constant parameters c, b and H, where the latter is a positive definite matrix, and gradient g(w) isgiven by:

g(w) = b + Hw (B.11)

which vanishes at the minimum in w∗. For this error function, let us pre-multiply Eq. B.7 by dTj H:

dTj Hw∗ − dT

j Hw1 =W∑i=1

αidTj Hdi (B.12)

Given that b + Hw∗ = 0, and by using the orthogonality property described in Eq. B.6, we can writeEq. B.12 as:

−dTj (b + Hw1) = αjdT

j Hdj (B.13)

from which we can express the αj as:

αj = −dT

j (b + Hw1)dT

j Hdj(B.14)

By proceeding in a similar way with Eq. B.8 we find the relationship:

dTj Hwj = dT

j Hw1 (B.15)

which can be used in the numerator in the expression for αj to yield:

αj = −dT

j (b + Hwj)dT

j Hdj

= −dT

j g(wj)dT

j Hdj

(B.16)


By noting that:

gj+1 − gj = H(wj+1 −wj)

= αjHdj

(B.17)

and substituting the value found in Eq. B.14 in Eq. B.17 and premultiplying by dTj , we find that:

dTj gj+1 = 0 (B.18)

Similarly, if we pre-multiply Eq. B.17 by dTk , with k < j ≤ W , we get:

dTk gj+1 = dT

k gj, for k < j ≤ W (B.19)

It can be found easily by induction that:

dTk gj = 0, for k < j ≤ W (B.20)

Eq. B.20 shows that, for a quadratic error function, at every step the gradient at wj is orthogonal to theprevious conjugate directions dk and the minimum is reached in W steps.

Using the relationships found for this quadratic error function, we can find a set of mutually conjugatedirections by choosing the first one as the negative gradient:

d1 = −g1 (B.21)

Once the minimum w1 on d1 is found, the next direction can be chosen as a linear combination of theprevious one and the gradient at w1:

dj+1 = −gj+1 + βjdj (B.22)

The parameters βj can be found by pre-multiplying Eq. B.22 by dTj H:

βj =gj+1Hdj

dTj Hdj

(B.23)

To avoid the computation of the Hessian, we can use Eq. B.17 in the equation for the βj , getting:

βj =gj+1(gj+1 − gj)dT

j (gj+1 − gj)(B.24)

This expression can be simplified further by using Eq. B.21 and the orthogonality property in Eq. B.20:

βj =gj+1(gj+1 − gj)

gTj gj

(B.25)

This last formula, known as the Polak-Ribiere form, gives better results than the other ones because ittends to reset the conjugate direction in the direction of the gradient if the algorithm is making littleprogress (i.e. gj+1 ≈ gj), restarting in this way the conjugate gradient procedure. A caveat of thisalgorithm is that the Hessian matrix can be negative definite in some regions of weight space for ageneral non-linear error surface. In this case, a robust procedure should make sure that the error doesnot increase at any step.


B.1.1 The conjugate gradient algorithm

A description of the algorithm follows:

1. Choose an initial set of weights w1

2. Evaluate the gradient g1

3. Set d1 =−g1

4. Initialise j=1

5. Find the minimum of the error function along dj and call this point wj+1

6. If E(wj+1) < ε, stop the procedure and set the neural network weights to wj+1,otherwise continue.

7. Evaluate the gradient gj+1

8. If j is a multiple of W then reset the procedure by setting dj+1 =−gj+1 andgo to step 11.

9. Compute βj using the Polak-Ribiere formula (Eq. B.25)

10. Calculate the new direction as dj+1 =−gj+1 + βjdj

11. Increment j by one and go back to step 5.

B.2 Scaled conjugate gradients

In Eq. B.14, the product of the vector dj with the Hessian matrix (defined as H ≡ ∇(∇E)) can beapproximated by substituting v for dj in the following equation:

vT H = vT∇(∇E) =∇E(w + εv) −∇E(w)

ε+ O(ε) (B.26)

where v O(ε) is a residual term of the order of ε. This residual term can be reduced by one order by usingcentral differences:

vT H = vT∇(∇E) =∇E(w + εv) −∇E(w − εv)

2ε+ O(ε2) (B.27)

However, in the case of a non-quadratic error function, the conjugate gradient approach can lead to anincrease in the error if the Hessian matrix is not positive definite. In such a case, the product vT Hv willnot be positive. To make sure that the denominator of Eq. B.14 remains positive for a negative definiteHessian, the matrix H can be replaced by:

Hmod = H + λI (B.28)

where I is the identity matrix and λ is a scaling factor. The condition over λ is to make the denominatorin Eq. B.14 positive:

dTj Hdj + λ ‖dj ‖2> 0 (B.29)

Since the size of the step αj depends inversely on the scaling factor, λ also controls the step size. If thevalue of λ is too small, the searching region is large. This can be a problem if the error function is farfrom being quadratic in the searching region. If the quadratic approximation is not valid, the conjugategradient formulae may not be effective in the search for the minimum. In such a case the step sizeshould be reduced. Conversely, if the approximation is good the step size can be safely increased. Hence,the scaling factor will have two functions: to make sure that the error decreases when the Hessian is


negative definite, and to control the searching region based on a measure of the goodness of the quadraticapproximation. Its value will be adjusted at each iteration j.

Starting with λ1 = 0, the denominator of Eq. B.14 can be written as:

DENj = dTj Hjdj + λj ‖dj ‖2 (B.30)

Note that the Hessian now has subindex j indicating that in general it is not constant as its value maychange with each step.

If DENj < 0 the value of λj should be increased. Denoting the new values for λj and the denominatorwith an upper bar, we have:

DEN j = dTj Hjdj + λj ‖dj ‖2

= DENj + (λj − λj) ‖dj ‖2(B.31)

To make DEN j >0 the new scaling factor should be:

λj > λj −DENj

‖dj ‖2(B.32)

By choosing double the value of the right hand side in inequality B.32 we get:

DEN j = −dTj Hjdj (B.33)

Then, replacing the denominator in Eq. B.14 by the right-hand side of Eq. B.33, we can calculate the valueof the step αj . To check if the quadratic approximation is valid, the following index has been proposed[53]:

Δj =E(wj) − E(wj + αjdj)E(wj) − EQ(wj + αjdj)

(B.34)

where EQ(w) is the local quadratic approximation of the error function in the neighbourhood of wj ,given by:

EQ(wj + αjdj) = E(wj) + αjdTj gj +

12α2

jdTj Hjdj (B.35)

It is clear from the above equation that Δj will be close to 1 if the approximation is good, and close tozero if the error function differs largely from the quadratic assumption made. If the approximation isgood then the value of the scaling factor can be decreased for the next iteration. On the contrary, if thevalue of Δj is very small, then the value of λ for the next iteration should be increased. A negative indexΔj indicates that the step will move the weights to a point where the Hessian matrix is negative definite,therefore the weights should not be updated and the value of λ should be decreased accordingly1 tore-calculate the step αj .

Recalling the definition of αj (Eq. B.14) the expression of EQ(wj + αjdj) can be written as:

EQ(wj + αjdj) = E(wj) +12αjdT

j gj (B.36)

Substituting this expression in Eq. B.34 yields:

Δj =2{E(wj) − E(wj + αjdj)}

αjdTj gj

(B.37)

1A decrease given by λj = λj + DENj1−Δj

‖dj‖2 has been suggested [113]


Lower and upper thresholds for Δj can be 0.25 and 0.75 respectively, for example [53]. The increase ordecrease in λ is also arbitrarily chosen. An example of the quadratic approximation quality check couldbe:

• If Δj >0.75, the approximation is good, decrease the scaling factor, λj+1 = λj/4

• If Δj <0.25, the approximation is poor, increase the scaling factor, λj+1 = 4λj

• If 0.25≤Δj ≤0.75, leave the scaling factor as it is, λj+1 = λj

• If Δj <0, the Hessian has become negative definite with the step αj , increase the scalingfactor as shown in footnote (1) on page 236, then recalculate the modified Hessian andαj and check Δj again.

The scaling technique has been called the model trust region method because the model, in this case thequadratic, is only trusted in a region defined by the scaling factor.

B.2.1 The scaled conjugate gradient algorithm

The scaled conjugated algorithm can be summarised as follow:

1. Choose an initial set of weights w1

2. Set λ1 = 0

3. Choose a very small value for ε

4. Evaluate the gradient g1

5. Set d1 =−g1

6. Initialise j=1

7. Estimate dTj H by central differences

8. Evaluate the denominator of DENj; if negative, increase λj to yield DEN j

9. Calculate αj

10. Check the quality of the quadratic approximation and modify λj correspondingly.If Δj <0 go back to step 8, otherwise continue.

11. If E(wj+1) < ε, stop the procedure and set the neural network weights to wj+1,otherwise continue.

12. Evaluate the gradient gj+1

13. If j is a multiple of W then reset the procedure by setting dj+1 =−gj+1

and go to step 16.

14. Compute βj using the Polak-Ribiere formula (Eq. B.25)

15. Calculate the new direction as dj+1 =−gj+1 + βjdj

16. Increment j by one and go back to step 7.

Appendix C

Vigilance Database

The central channel EEG from eight healthy young adults, performing various vigilance tasks for morethan 2 hours, was recorded and digitised with 12-bit precision and 256 Hz sampling rate. The sub-jects were asked to stay awake the night before and to abstain from caffeine or any other stimulatorysubstances 24 hours before and during the tests. Subject age and gender are shown in Table C.1. Therecording montage consisted of electrode pairs C4−A1 (central right), C3−A2 (central left) and A1−A2

(mastoid), EOG left, EOG right and submental EMG.

Subject ID number Gender Age [years]1 3 female 202 4 female 193 6 male 244 7 male 185 9 female 236 10 female 217 11 male 208 12 female 21

Table C.1: Bristol subjects

The test consisted of three different vigilance tasks, a tracking task, a reaction time task and a serialattention task. In the tracking task the subject is asked to follow a rectangle on a computer screen bymoving a pointing device. The rectangle moves randomly. In the reaction time task, the subject has topress the space bar of the computer keyboard every time a 3x3 mm red square appears on the screen.The square appears at random intervals at an average rate of 18 times per minute. The serial attentiontask consists of a digit display with values within the [−9,+9] interval. The value decreases or increasesat random times and the subject has to hit the left or right button of the mouse to keep it at zero value.Performance indices taken for these tasks are:

• Tracking error for the tracking task, or deviation of the position indicator from the rectangle.

• Reaction time or time-interval in milliseconds from appearance of the red rectangle and the pressingof the space bar in the reaction task.

• Missed stimuli, number of times when the subject did not react when the red rectangle showed upin the reaction time task.

• Serial attention task error, the absolute value of the display in the serial attention task.

A previous study [50] found very little or no correlation between these performance indices and theexpert scoring of the EEG. For instance, the reaction time remains almost constant for all the vigilance sub-categories, while the increase in the tracking error is not significant as the subject gets drowsy. Although

The vigilance database 239

it is well known that lapses of alertness due to sleepiness or fatigue lead to decreased performance,quantifying the loss of performance and correlating it with physiogical measures of sleepiness have provedto be a difficult task [7][157]. Non-related factors like motivation and distractions may affect the results[121][130][35][42]. Therefore the performance indices in the vigilance database are not used in thisthesis.

Appendix D

LED Database

D.1 Method

A frontal-channel EEG1 was recorded from ten OSA patients, performing a behavioural version of theMaintenance of Wakefulness Test (MWT), and digitised with 12-bit precision at a sampling rate of 128Hz. Each subject performed at least four tests, on the same day at 9:00, 11:00, 13:00 and 15:00 hours, ina darkened room with the subject lying on a couch at 45 degrees. The subjects were asked to stay awakefor as long as possible. Each test lasts for a maximum of 40 minutes. A light emitting diode flashes ared light which is displayed for approximately one second every three seconds throughout the test. Thesubject is asked to touch a button on a hand-piece every time the light flashes. Each flash that a subjectfails to respond to is recorded. When seven flashes in succession are not responded to (total time is21s) the test is terminated automatically and the subject is considered to have fallen asleep. The subjectwears headphones through which white noise is played on a pre-recorded tape. This is to reduce anyinterference due to background noise. All subjects were asked to abstain from alcohol for 24 hours priorto the study and coffee and tea for 12 hours prior to the study. Also, subjects were asked not to sleepduring the day of testing. Results from the tests are shown in Table D.1.

D.2 Demographic data

The patients have a mean age of 50.4 years (standard deviation -sd- of 11.3 years), average body massindex (BMI)2 41.1 (sd 8.6). All 10 subjects had been diagnosed with OSA, with an Epworth SleepinessScale (ESS) score greater than 10 indicating subjective daytime sleepiness, mean ESS 16.7 (sd 4.6), anda positive overnight sleep study (performed the night before the vigilance tests) with the number ofoxygen saturation (SaO2) dips of greater than 4% per hour, mean SaO2 dips/hour of 30.4 (sd 19.7), anda number of movements per hour of sleep with a mean of 76.9 (sd 40.4). Full data for each subject canbe found in Tables D.2 and D.3.

1Other channels recorded are right and left Mastoid and reference on either mastoid.2BMI is determined by dividing the weight in kilograms by the square of the height in metres

The LED database 241

TestSubject 09:00 11:00 13:00 15:00 comments

1 06:21 (A) 29:12 (B) 16:57 (C) 26:00 (D)2 25:57 (A) 17:39 (B) 15:00 (C) 12:09 (D)3 03:06 (A) 10:00 (B) 07:12 (D) 05:12 (E) Repeat as the patient said

08:27 (C) he didn’t fall asleep4 40:00 (A) 40:00 (B) 40:00 (C) 40:00 (D)5 20:03 (A) 13:00 (B) 17:12 (C) 19:21 (D)6 13:27 (A) 10:24 (B) 11:39 (C) (D)

10:45 (E) Repeat as the patient fell asleep at start7 02:33 (A) 08:45 (B) 07:33 (C) 40:00 (D)8 21:30 (A) 32:30 (B) 09:03 (C) 31:30 (D)9 07:51 (A) -a (B) 00:21 (C) 00:51 (D) Falling asleep all the time10 05:18 (A) 00:21 (B) 02:57 (D) 05:21 (F) Repeat as the patient said

06:03 (C) 01:33 (E) he didn’t fall asleep

atoo short

Table D.1: Time of falling asleep (in mm:ss) measured by the clinician from the start of the MWT test.The letter used in this thesis to refer to a given test is shown in brackets

subject age [years] height [m] weight [Kg] BMI [Kg2/m] ESS1 57 1.75 152.40 50 152 33 1.85 112.94 33 203 56 1.78 118.40 37 124 55 1.83 159.00 48 115 41 1.73 111.00 37 166 52 1.78 115.20 36 207 53 1.83 150.32 45 198 72 1.75 87.90 29 109 37 1.70 127.50 44 24

10 48 1.75 110.00 36 20

Table D.2: Subject demographic details

Subject O2 dip rate [hr−1] Movement [hr−1]1 55 772 30.5 123 16.3 1075 13.7 124 14.3 616 30.1 1007 17.4 588 18.0 1079 36.4 116

10 72.7 119

Table D.3: Overnight sleep study results

Bibliography

[1] H.D.I. Abarbanel, T.W. Frison, and L.Sh. Tsimring. Obtaining order in a world of chaos. IEEE SignalProcessing Magazine, pages 49–65, May 1998.

[2] P. Achermann, R. Hartmann, A. Gunzinger, W. Guggenbuhl, and A.A. Borbely. All-night sleep EEGand artificial stochastic control signals have similar correlation dimensions. Electroencephalogr.Clin. Neurophysiol., 90(5):384–7, May 1994.

[3] L.A. Aguirre, V.C. Barros, and A.V. Souza. Nonlinear multivariable modeling and analysisof sleep apnea time series. Comput Biol Med, 29(3):207–28, 1999. Abstract available at:http://www.websciences.org/cftemplate/NAPS/indiv.cfm?ID=19992088.

[4] T. Akerstedt. Work hours, sleepiness and the underlying mechanisms. J. Sleep Res., 4(Suppl.2):15–22, Apr 1995.

[5] T. Akerstedt and M. Gillberg. Subjective and objective sleepiness in the active individual. Int. J.Neurosci., 52(1-2):29–37, May 1990.

[6] C. Alford. EEG, performance and subjective sleep measures are not the same: implications forassesment of daytime sleepiness. In Abstracts: British Sleep Society 4th Annual Meeting, page 10.British Sleep Society, 1992.

[7] C. Alford, C. Idzikowski, and I. Hindmarch. Are electrophysiological measures of sleep tendencyrelated to subjective state and performance? In Abstracts: British Sleep Society 3rd Annual Meeting,page 31. British Sleep Society, 1991.

[8] C. Alford, N.Rombaut, J. Jones, S. Foley, and C. Idzikowski. Acute effects of hydroxyzine onnocturnal sleep and sleep tendency the following day: a C-EEG study. Human Psychopharmacology,7, 1992.

[9] P. Anderer, S. Roberts, A. Schlogl, G. Gruber, G. Klosch, W. Herrmann, P. Rappelsberger, O. Filz,M.J. Barbanoj, G. Dorffner, and B. Saletu. Artifact processing in computerized analysis of sleepEEG - a review. Neuropsychobiology, 40(3):150–7, Sep 1999.

[10] N.O. Andersen. On the calculation of filter coeficients for maximum entropy spectral analysis.Geophysics, 19(1):69–72, 1970.

[11] Atlas task force of the American Sleep Disorders Association, EEG arousals: Scoring rules andexamples. Sleep, 15(2):174–184, 1992.

[12] Kemp B. A proposal for computer-based sleep/wake analysis. J. Sleep Res., 2(3):179–85, 1993.Consensus Report.

[13] I.N. Bankman, V.G. Sigillito, R.A. Wise, and P.L. Smith. Feature-based detection of the K-complexwave in the human electroencephalogram using neural networks. IEEE Transactions on BiomedicalEngineering, 39(12):1305–10, Dec 1992.

[14] J. S. Barlow. Methods of analysis of nonstationary EEGs, with emphasis on segmentation tech-niques: A comaprative review. Journal of Clinical Neurophysiology, 2(3):267–304, 1985.

Bibliography 243

[15] R. Baumgart-Schmitt, W.M. Herrmann, and R. Eilers. On the use of neural network techniques toanalyze sleep EEG data. third communication: robustification of the classificator by applying analgorithm obtained from 9 different networks. Neuropsychobiology, 37(1):49–58, 1998.

[16] R. Baumgart-Schmitt, W.M. Herrmann, R. Eilers, and F. Bes. On the use of neural network tech-niques to analyse sleep EEG data. first communication: application of evolutionary and geneticalgorithms to reduce the feature space and to develop classification rules. Neuropsychobiology,36(4):194–210, 1997.

[17] M.A. Bedard, J. Montplaisir, F. Richer, and J. Malo. Nocturnal hypoxemia as a determinant ofvigilance impairment in sleep apnea syndrome. Chest, 100(2):367–70, Aug 1991.

[18] L.S. Bennett, B.A. Langford, J.R. Stradling, and R.J.O. Davies. Sleep fragmentation indices aspredictors of daytime sleepiness and NCPAP response in OSA. The Osler Chest Unit, ChurchillHospital, Headington, Oxford, England.

[19] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.

[20] M.H. Bonnet and D.L. Arand. We are chronically sleep deprived. Sleep, 18(10):908–11, Dec 1995.

[21] G. E. P. Box and G. M. Jenkins. Time series analysis : forecasting and control. Holden-Day series intime series analysis. Holden-Day, San Francisco, rev. edition, 1976.

[22] G. Bremer, J.R. Smith, and I. Karacan. Automatic detection of the K-complex in sleep electroen-cephalograms. IEEE Transactions on Biomedical Engineering, 17(4):314–23, Oct 1970.

[23] D.M. Brittenham. Artifacts: Activities not arising from the brain. In Daly and Pedley [37].

[24] P. Brown and C.D. Marsden. What do the basal ganglia do? The Lancet, 351:1801–4, June 1998.

[25] J. P. Burg. Maximum entropy spectral analysis. PhD thesis, Stanford University, Stanford, California,1975.

[26] J. P. Burg, D. G. Luenberger, and D. L. Wenger. Estimation of structured covariance matrices.Proceedings of the IEEE, 70(9):963–974, Sep 1982.

[27] M.A. Carskadon and W.C. Dement. Daytime sleepiness: quantification of a behavioral state. Neu-rosci. Biobehav. Rev., 11(3):307–17, 1987.

[28] R Caton. The electric currents of the brain. British Medical Journal, (2):278, 1875.

[29] K. Cheshire, H. Engleman, I. Deary, C. Shapiro, and N.J. Douglas. Factors impairing daytimeperformance in patients with sleep apnea/hypopnea syndrome. Arch. Intern. Med., 152(3):538–41, Mar 1992.

[30] S. Chokroverty, editor. Sleep disorders medicine: basic science, technical considerations, and clinicalaspects. Butterworth-Heinemann, Oxford, 2nd edition, 1999.

[31] Circadian Technologies Inc. Alertness Technologies, 2000. Available at: http://www.circadian.com.

[32] R. Conradt, U. Brandenburg, T. Penzel, J. Hasan, A. Varri, and J.H. Peter. Vigilance transitions in re-action time test: a method of describing the state of alertness more objectively. Clin. Neurophysiol.,110(9):1499–509, Sep 1999.

[33] R. Conradt, T. Penzel, U. Brandenburg, and J.H. Peter. Description of vigilance in the EEG dur-ing reaction time test in patients with sleep apnea. In Proceeding of the European Medical andBiological Engineering Conference EMBEC’99, volume 1, pages 414–15, Vienna, Austria, Nov 1999.International Federation for Medical and Biological Engineering.

[34] J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series.Mathematics of Computation, 19(90):297–301, Apr 1965.

[35] M. Corsi-Cabrera, J. Ramos, C. Arce, M.A. Guevara, M. Ponce de Leon, and I. Lorenzo. Changes inthe waking EEG as a consequence of sleep and sleep deprivation. Sleep, 15(6):550–5, Dec 1992.

Bibliography 244

[36] A.C. da Rosa, A. L. N. Fred, and J. M. N. Leitao. Stochastic model of awake and sleep EEG.In M. Holt, C. Cowan, P. Grant, and W. Sandham, editors, Signal Processing VII: Theories andApplications. European Association for Signal Processing, 1994.

[37] D.D. Daly and T.A. Pedley, editors. Current practice of clinical electroencephalography. Raven Press,1990.

[38] R.S. Daniel. Alpha and theta EEG in vigilance. Perceptual and Motor Skills, 25:697–703, 1967.

[39] R.J. Davies, P.J. Belt, S.J. Roberts, N.J. Ali, and J.R. Stradling. Arterial blood pressure responsesto graded transient arousal from sleep in normal humans. J. Appl. Physiol., 74(3):1123–30, Mar1993.

[40] F. De, Carli, L. Nobili, P. Gelcich, and F. Ferrillo. A method for the automatic detection of arousalsduring sleep. Sleep, 22(5):561–72, Aug 1999.

[41] D. F. Dinges. An overview of sleepiness and accidents. J. Sleep Res., 4(Suppl. 2):4–14, 1995.

[42] D.F. Dinges and N. Barone-Kribbs. Performing while sleepy: effects of experimentally-inducedsleepiness. In Monk [114], pages 97–128.

[43] K. Doghramji. Maintenance of wakefulness test. In Chokroverty [30].

[44] N. J. Douglas. The sleep apnoea/hypopnoea syndrome and snoring. In C. M. Shapiro, editor, ABCof Sleep Disorders. BMJ, 1993.

[45] N. J. Douglas. The sleep apnoea/hypopnoea syndrome. In R. Cooper, editor, Sleep. Chapman andHall Medical, 1994.

[46] M.J. Drinnan, A. Murray, G.J. Gibson, and C.J. Griffiths. Interobserver variability in recognizingarousal in respiratory sleep disorders. Am. J. Respir. Crit. Care Med., 158(2):358–62, 1998.

[47] M.J. Drinnan, A. Murray, J.E. White, A.J. Smithson, G.J. Gibson, and C.J. Griffiths. Evaluation ofactivity-based techniques to identify transient arousal in respiratory sleep disorders. J. Sleep Res.,5:173–180, 1996.

[48] M.J. Drinnan, A. Murray, J.E. White, A.J. Smithson, C.J. Griffiths, and G.J. Gibson. Auto-mated recognition of EEG changes accompanying arousal in respiratory sleep disorders. Sleep,19(4):296–303, 1996.

[49] J. Durbin. The fitting of time series models. Revue de l’Institut international de statistique, 28:233–44, 1960.

[50] M. Duta. The Study of Vigilance using Neural Networks Analysis of EEG. PhD thesis, University ofOxford, 1998.

[51] Nervous system. In Encyclopdia Britannica Online, page<http://search.eb.com/bol/topic?eu=119939& sctn=1>. Encyclopdia Britannica, Inc., 1994-2000. [Accessed 23 June 2000].

[52] J. Fell, J. Roschke, K. Mann, and C. Schaffner. Discrimination of sleep stages: a comparisonbetween spectral and nonlinear EEG measures. Electroencephalogr. Clin. Neurophysiol., 98(5):401–10, May 1996.

[53] R. Fletcher. Practical methods of optimization. Wiley, Chichester, 2nd edition, 1987.

[54] J.M. Gaillard, M. Krassoievitch, and R. Tissot. Automatic analysis of sleep by a hybrid system: newresults. Electroencephalography and Clinical Neurophysiology, 33(4):403–10, Oct 1972.

[55] I. Gath and E. Bar-On. Computerized method for scoring of polygraphic sleep recordings. Comput.Programs Biomed., 11(3):217–23, Jun 1980.

[56] C. F. George and A. Smiley. Sleep apnea and automobile crashes. Sleep, 22(6):790–5, 1999.

[57] C.J. Goeller and C.M. Sinton. A microcomputer-based sleep stage analyzer. Computer Methods andPrograms in Biomedicine, 29(1):31–6, May 1989.

Bibliography 245

[58] C. Guilleminault, M. Partinen, M.A. Quera, Salva, B. Hayes, W.C. Dement, and G. Nino-Murcia.Determinants of daytime sleepiness in obstructive sleep apnea. Chest, 94(1):32–7, Jul 1988.

[59] M. Hack, R.J. Davies, R. Mullins, S.J. Choi, S. Ramdassingh-Dow, C. Jenkinson, and J.R. Stradling.Randomised prospective parallel trial of therapeutic versus subtherapeutic nasal continuous posi-tive airway pressure on simulated steering performance in patients with obstructive sleep apnoea.Thorax, 55(3):224–31, 2000.

[60] P. Halasz, O. Kundra, P. Rajna, I. Pal, and M. Vargha. Micro-arousals during nocturnal sleep. ActaPhysiologica Academia Scientiarum Hungaricae, 54(1):1–12, 1979.

[61] J. Hasan, K. Hirvonen, A. Varri, V. Hakkinen, and P. Loula. Validation of computer analysedpolygraphic patterns during drowsiness and sleep onset. Electroencephalogr. Clin. Neurophysiol.,87(3):117–27, Sep 1993.

[62] S. S. Haykin. Communication Systems. Wiley, New York, 3rd edition, 1994.

[63] S. S. Haykin. Adaptive Filter Theory. Information and systems sciences series. Prentice-Hall, NewJersey, 3rd edition, 1996.

[64] H. Head. The conception of nervous and mental energy II. vigilance: a physiological state of thenervous system. Br. J. Psychol., 14:125–147, 1923.

[65] R. Hess. The electroencephalogram in sleep. Electroenceph. clin. Neurophysiol., 16:44–55, 1964.

[66] S.L. Himanen and J. Hasan. Limitations of the Rechtschaffen and Kales. Sleep Medicine Reviews,4(2):149–67, Apr 2000.

[67] B. Hjorth. EEG analysis based on time domain properties. Electroencephalography and ClinicalNeurophysiology, 29:306–310, 1970.

[68] C.A. Holzmann, C.A. Perez, C.M. Held, M. San Martin, F. Pizarro, J.P. Perez, M. Garrido, andP. Peirano. Expert-system classification of sleep/waking states in infants. Medical and BiologicalEngineering and Computing, 37(4):466–76, 1999.

[69] J. Horne. Why we sleep : the functions of sleep in humans and other mammals. Oxford UniversityPress, Oxford, 1988.

[70] J.A. Horne. Dimensions to sleepiness. In Monk [114], pages 169–96.

[71] E. Huupponen, A. Varri, J. Hasan, J. Saarinen, and K. Kaski. Sleep arousal detection with neuralnetwork. Medical & Biological Engineering & Computing, 34(suppl.1):219–20, 1996.

[72] K. Inoue, K. Kumamaru, S. Sagara, and S. Matsuoka. Pattern recognition approach to human sleepEEG analysis and determination of sleep stages. Memoirs of the Faculty of Engineering, KyushuUniversity, 42(3):177–95, Sep 1982.

[73] Wu J., E.C. Ifeachor, E.M. Allen, and N.R. Hudson. A neural network based artefact detectionsystem for EEG signal processing. In Proceedings of the International Conference on Neural Net-works and Expert Systems in Medicine and Healthcare, pages 257–66, Plymouth, UK, 1994. Univ.Plymouth.

[74] B. H. Jansen. Time series analysis by means of linear modelling. In R. Weitkunat, editor, DigitalBiosignal Processing. Elsevier Science Publishers, 1991.

[75] B.H. Jansen, A. Hasman, and R. Lenten. Piecewise analysis of EEGs using AR-modeling and clus-tering. Comput. Biomed. Res., 14(2):168–78, Apr 1981.

[76] H.H. Jasper. The 10-20 system of the international federation. Electroencephalography and ClinicalNeurophysiology, 10:371–5, 1958.

[77] G. M. Jenkins and D. G. Watts. Spectral analysis and its applications. Holden-Day series in timeseries analysis. Holden-Day, San Francisco, 1968.

Bibliography 246

[78] M. Jobert, H. Escola, E. Poiseau, and P. Gaillard. Automatic analysis of sleep using two param-eters based on principal component analysis of electroencephalography spectral data. BiologicalCybernetics, 71(3):197–207, 1994.

[79] T. Jokinen, T. Salmi, A. Ylikoski, and M. Partinen. Use of computerized visual performance test inassessing day-time vigilance in patients with sleep apneas and restless sleep. Int. J. Clin. Monit.Comput., 12(4):225–30, 1995.

[80] T.P. Jung, S. Makeig, M. Stensmo, and T.J. Sejnowski. Estimating alertness from the EEG powerspectrum. IEEE Transactions on Biomedical Engineering, 44(1):60–69, 1997.

[81] S. M. Kay and S. L. Marple. Spectrum analysis-a modern perspective. Proceedings of the IEEE,69(11):1380–1419, November 1981.

[82] S.M. Kay. Recursive maximum likelihood estimation of autoregressive processes. IEEE Transactionson Acoustics, Speech, and Signal Processing, 31(1):56–65, Feb 1983.

[83] G. Kecklund and T. Akerstedt. Sleepiness in long distance truck driving: an ambulatory EEG studyof night driving. Ergonomics, 36(9):1007–17, Sep 1993.

[84] S.A. Keenan. Polysomnographic technique: An overview. In Chokroverty [30].

[85] P. Kellaway. An orderly approah to visual analysis: characteristics of the normal EEG of adults andchildren. In Daly and Pedley [37].

[86] B. Kemp, E. W. Groneveld, A. J. M. W. Jansen, and J. M. Franzen. A model-based monitor of humansleep stages. Biological Cybernetics, 57:365–378, 1987.

[87] L. G. Kiloh, A. G. McComas, and J. W. Osselton. Clinical Electroencephalography. Butterwoths,fourth edition, 1981.

[88] K. Kinnari, J.H. Peter, A. Pietarinen, L. Groete, T. Penzel, A. Varri, P. Laippala, A. Saastamoinen,W. Cassel, and J. Hasan. Vigilance stages and performance in OSAS patients in a monotonousreaction time task. Clinical Neurophysiology, 111(6):1130–6, 2000.

[89] J.R. Knott, F.A. Gibbs, and C.E. Henry. Fourier transform of the electroencephalogram during sleep.J. Exp. Psychol., 31:465–77, 1942.

[90] T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics,43:59–69, 1982.

[91] M.H. Kryger and P.J. Hanly. Cheyne-Stokes respiration in cardiac failure. In Sleep and Respiration,pages 215–26. Wiley-Liss, Inc., 1990.

[92] M. Kubat, G. Pfurtscheller, and D. Flotzinger. AI-based approach to automatic sleep classification.Biol. Cybern., 70(5):443–8, 1994.

[93] St. Kubicki, W.M. Herrmann, and L. Holler. Critical comments on the rules by rechtschaffen andkales concerning the visual evaluation of EEG sleep records. In St. Kubicki and W.M. Herrmann,editors, Methods of sleep research, pages 19–35. Gustav Fischer Verlag, Stuttgart, 1985.

[94] A. Kumar. A real-time system for pattern recognition of human sleep stages by fuzzy systemanalysis. Pattern Recognition, 9(1):43–6, Jan 1977.

[95] N. Levinson. The Wiener RMS (root-mean-square) error criterion in filter design and prediction.Journal of Mathematics and Physics, 25:261–278, 1947.

[96] A.L. Loomis, E.N. Harvey, and G.A. Hoart III. Cerebral stages during sleep, as studied by humanbrain potentials. J. exp. Psychol., 21:127–144, 1937.

[97] I. Lorenzo, J. Ramos, C. Arce, M.A. Guevara, and M. Corsi-Cabrera. Effect of total sleep deprivationon reaction time and waking EEG activity in man. Sleep, 18(5):346–54, Jun 1995.

[98] D. Lowe. Feature space embeddings for extracting structure from single channel wake EEG usingRBF networks. In Neural Networks for Signal Processing VIII. Proceedings of the 1998 IEEE SignalProcessing Society Workshop, pages 428–37, New York, 1998. IEEE.

Bibliography 247

[99] R. Luthringer, R. Minot, M. Toussaint, F. Calvi-Gries, N. Schaltenbrand, and J.P. Macher. All-nightEEG spectral analysis as a tool for the prediction of clinical response to antidepressant treatment.Biol. Psychiatry, 38(2):98–104, Jul 1995.

[100] P.M. Macey, J.S. Li, and R.P. Ford. Deterministic properties of apnoeas in an abdominal breathingsignal. Med. Biol. Eng. Comput., 37(3):335–43, May 1999.

[101] P.M. Macey, J.S.J. Li, and R.P.K. Ford. Expert system for the detection of apnoea. EngineeringApplications of Artificial Intelligence, 11(3):425–38, Jun 1998.

[102] D.J.C. MacKay. The evidence framework applied to classification networks. Neural Computation,4(5):720–36, Sep 1992.

[103] D.J.C. MacKay. A practical bayesian framework for backpropagation networks. Neural Computa-tion, 4(3):448–72, May 1992.

[104] S. Makeig and T.P. Jung. Changes in alertness is principal component of variance in the EEGspectrum. NeuroReport, 7:213–216, 1995.

[105] J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):561–580, 1975.

[106] M. Matsuura, K. Yamamoto, H. Fukuzawa, Y. Okubo, H. Uesugi, T. Kojima M. Moriiwa, and Y. Shi-mazono. Age development and sex differences of various EEG elements in healthy children andadults, quantification by a computerized wave form recognition method. Electroencephalogr. Clin.Neurophysiol., 60(5):394–406, May 1985.

[107] W.T. McNicholas. Sleep apnoea and driving risk. european respiratory society task force on ”publichealth and medicolegal implications of sleep apnoea” [editorial]. Eur. Respir. J., 13(6):1225–7,Jun 1999.

[108] L. T. McWhorter and L. L. Scharf. Nonlinear maximum likelihood estimation of autoregressivetime series. IEEE Transactions on Signal Processing, 43(12):2909–2919, 1995.

[109] R.G. Miller. The jackknife, a review. Biometrika, 61(1):1–15, Apr 1974.

[110] A. Mitchell. Liquid genius. New Scientist, 13 March 1999.

[111] M.M. Mitler, K.S. Gujavarty, and C.P. Browman. Maintenance of wakefulness test: a polysomno-graphic technique for evaluation treatment efficacy in patients with excessive somnolence. Elec-troencephalogr. Clin. Neurophysiol., 53(6):658–61, 1982.

[112] M.M. Mitler, J.S. Poceta, and B.G. Bigby. Sleep scoring technique. In Chokroverty [30].

[113] M. Møller. A scaled conjugated gradient algorithm for fast supervised learning. Neural Networks,6(4):525–33, 1993.

[114] T.H. Monk, editor. Sleep, sleepiness and performance. Human performance and cognition. JohnWiley & Sons, Chichester, England, 1991.

[115] M. Moore-Ede. We have ways of keeping you alert. New Scientist, pages 30–5, Nov. 13th 1993.

[116] MTI Research’s Alertness Technology. Alertness Monitor Technical summary. Available at:http://www.mti.com.

[117] S. S. Narayan and J. P. Burg. Spectral estimation of quasi-periodic data. IEEE Transactions onAcoustics, Speech, and Signal Processing, 38(3):512–518, March 1990.

[118] R.D. Ogilvie, D.M. McDonagh, S.N. Stone, and R.T. Wilkinson. Eye movements and the detectionof sleep onset. Psychophysiology, 25(1):81–91, Jan 1988.

[119] M.M. Ohayon and C. Guilleminault. Epidemiolgy of sleep disorders. In Chokroverty [30].

[120] B.S. Oken and K.H. Chiappa. Short-term variability in EEG frequency analysis. Electroencephalogr.Clin. Neurophysiol., 69(3):191–8, Mar 1988.

[121] J. P. Howe on behalf of the Council of Scientific Affairs. Fatigue, sleep disorders, and motor vehiclecrashes. Technical Report CSA Report 1-A-96, American Sleep Disorders Association, 1996.

Bibliography 248

[122] A.V. Oppenheim and R.W. Schafer. Digital Signal Processing. Prentice-Hall, Englewood Cliffs, NJ.,1975.

[123] J. Pardey, S. J. Roberts, L. Tarassenko, and J. Stradling. A new approach to the analysis of thehuman sleep-wakefulness continuum. Journal of Sleep Research, pages 201–210, 1996.

[124] B. Parks, M. Olsen, and P. Resnik. WordNet: A machine-readable lexical database organized bymeanings. Available at: http://work.ucsd.edu:5141/cgi-bin/http webster, 1991-98.

[125] T.A Pedley and R.D. Traub. Physiological basis of the EEG. In Daly and Pedley [37].

[126] T. Penzel and R. Conradt. Computer based sleep recording and analysis. Sleep Medicine Reviews,4(2):131–48, Apr 2000.

[127] T. Penzel and J. Petzold. A new method for the classification of subvigil stages, using the Fouriertransform, and its application to sleep apnea. Comput. Biol. Med., 19(1):7–34, 1989.

[128] P. Philip, J. Taillard, C. Guilleminault, M.A. Quera-Salva, B. Bioulac, and M. Ohayon. Long distancedriving and self-induced sleep deprivation among automobile drivers. Sleep, 22(4):475–80, Jun1999.

[129] D. Pitson, N. Chhina, S. Knijn, M. van Herwaaden, and J. Stradling. Changes in pulse transit timeand pulse rate as markers of arousal from sleep in normal subjects. Clin. Sci. Colch., 87(2):269–73,Aug 1994.

[130] R.T. Pivik. The several qualities of sleepiness: psychophysiological considerations. In Monk [114],pages 3–37.

[131] W. H. Press, S. A. Teukolsky, W. T Vetterling, and B. P. Flannery. Numerical Recipes in C The Art ofScientific Computing. Cambridge Uinversity Press, Cambridge, 2nd edition, 1994.

[132] J.C.. Principe, S.K.. Gala, and T.G. Chang. Sleep staging automaton based on the theory of evi-dence. IEEE Transactions on Biomedical Engineering, 36(5):503–9, May 1989.

[133] J.C. Principe and J.R. Smith. SAMICOS a sleep analyzing microcomputer system. IEEE Transactionson Biomedical Engineering, 33(10):935–41, Oct 1986.

[134] P.F. Prior and D.E. Maynard. Monitoring cerebral function: long-term monitoring of EEG and evokedpotentials. Elseview, 1986.

[135] Cooper R., C.D Binnie, and Fowler C.J. Origins and technique. In C. D. Binnie and J. W. Osselton,editors, Clinical Neurophysiology: EMG, nerve conduction and evoked potentials / EEG technology.Butterworth-Heinemann Ltd, Oxford, 1995.

[136] A. Rechtschaffen and A. Kales. A Manual of Standardized Terminology, Techniques and ScoringSystem for Sleep Stages of Human Subjects. Public Health Service, U.S. Government Printing Office,Washington D.C., 1968.

[137] I. A. Rezek and S. J. Roberts. Stochastic complexity measures for physiological signal anal-ysis. IEEE Transactions on Biomedical Engineering, 45(9):1186–91, 1998. Available at:http://www.robots.ox.ac.uk/ sjrob/pubs.h.

[138] B D Ripley. Statistical theories of model fitting. volume 168 of NATO ASI series. Series F, Computerand systems sciences, Cambridge, U.K., August 1998. NATO Advanced Study Institute on General-ization in Neural Networks and Machine Learning, Springer.

[139] S. Roberts, I. Rezek, R. Everson, H. Stone, S. Wilson, and C. Alford. Automated assessment ofvigilance using committees of radial basis function analysers. IEE Proceedings Science, Technologyand Measurement, 147(6):333–338, 2000.

[140] T. Roth, T.A. Roehrs, and L. Rosenthal. Measurement of sleepiness and alertness: Multiple sleeplatency test. In Chokroverty [30].

[141] J. W. Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers,C-18(5):401–409, 1969.

Bibliography 249

[142] J. Santamaria and K.H. Chiappa. The EEG of drowsiness in normal adults. J. Clin. Neurophysiol.,4(4):327–82, Oct 1987.

[143] N. Schaltenbrand, R. Lengelle, and J.P Macher. Neural network model: application to automaticanalysis of human sleep. Comput. Biomed. Res., 26(2):157–71, Apr 1993.

[144] B. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing supportvector machines with gaussian kernels to radial basis function classifiers. IEEE Trans. Sign. Pro-cessing, 45:2758–65, 1997. Available at http://www.kernel-machines.org/papers/AIM-1599.ps.

[145] F.W. Sharbrough. Electrical fields and recording techniques. In Daly and Pedley [37].

[146] F.Z. Shaw, R.F. Chen, H.W. Tsao, and C.T Yen. Algorithmic complexity as an index of corticalfunction in awake and pentobarbital-anesthetized rats. J. Neurosci. Methods, 93(2):101–10, Nov1999.

[147] D.K. Siegwart, L. Tarassenko, S.J. Roberts, J.R. Stradling, and J. Partlett. Sleep apnoea analysisfrom neural network post-processing. In Proceeding of Fourth International Conference on ‘ArtificialNeural Networks‘, pages 427–32, London, UK, 1995. IEE.

[148] D.W. Skagen. Estimation of running frequency spectra using a Kalman filter algorithm. Journal ofBiomedical Engineering, 10(3):p.275–9, May 1988.

[149] J. R. Smith. Automated analysis of sleep EEG data. In F. H. Lopes da Silva, W. Storm van Leeuwen,and A. Remond, editors, Handbook of Electroencephalography and Clinical Neurophysiology, vol-ume 2. Elsevier Science Publishers, 1986.

[150] J.R. Smith and I. Karacan. EEG sleep stage scoring by an automatic hybrid system. Electroen-cephalography and Clinical Neurophysiology, 31(3):231–7, Sep 1971.

[151] J.R. Smith, I. Karacan, and M. Yang. Automated analysis of the human sleep EEG. Waking andSleeping, 2:75–82, 1978.

[152] E. Stanus, B. Lacroix, M. Kerkhofs, and J. Mendlewicz. Automated sleep scoring: a comparativereliability study of two algorithms. Electroencephalogr. Clin. Neurophysiol., 66(4):448–56, Apr1987.

[153] M.B. Sterman, G.J. Schummer, T.W. Dushenko, and J.C. Smith. Electroencephalographic correlatesof pilot performance: simulation and in-flight studies. In Electric and Magnetic Activity of theCentral Nervous System: Research and Clinical Applications in Aerospace Medicine, pages 31/1–16,Neuilly sur Seine, France, Feb 1988. AGARD.

[154] J.R. Stradling. Personal communication.

[155] J.R. Stradling. Handbook of Sleep-Related Breathing Disorders. Oxford University Press, Oxford,1993.

[156] J.R. Stradling, D.J. Pitson, L. Bennett, C. Barbour, and R.J.O. Davies. Variation in the arousal pat-tern after obstructive events in obstructive sleep apnea. Am. J. Respir. Crit. Care. Med., 159(1):130–6, Jan 1999.

[157] K. Swingler and L.S. Smith. Producing a neural network for monitoring driver awareness. NeuralComputing and Applications, 4:96–104, 1996.

[158] Shimada T., Shiina T., and Saito Y. Detection of characteristic waves of sleep EEG by neuralnetwork analysis. IEEE Transactions on Biomedical Engineering, 47(3):369–79, 2000.

[159] L. Tarassenko. A Guide to Neural Computing Applications. Arnold, London, 1998.

[160] L. Tarassenko, J. Pardey, S. Roberts, H. Chia, and M. Laister. Neural network analysis of sleepdisorders. In Proceedings of ICANN’95, Paris, Oct 1995. European Neural Network Society.

[161] J. Teran-Santos, A. Jimenez-Gomez, and J. Cordero-Guevara. The association between sleep apneaand the risk of traffic accidents. Cooperative group Burgos-Santander. New England Journal ofMedicine, 340(11):847–51, 1999.

Bibliography 250

[162] M.E. Tipping. The relevance vector machine. In S.A. Solla, T.K. Leen, and K-R. Muller, editors,Advances in Neural Information Processing Systems, volume 12. MIT Press, Cambridge, Mass, 2000.Available at http://www.kernel-machines.org/papers/upload 10444 rvm nips.ps.

[163] M.E. Tipping and D. Lowe. Shadow targets: a novel algorithm for topographic projections byradial basis functions. Neurocomputing, 19(1-3):211–22, Mar 1998.

[164] L. Torsvall and T. Akerstedt. Extreme sleepiness: Quantification of EOG and spectral EEG parame-ters. Intern J. Neuroscience, 38:435–441, 1988.

[165] N. Townsend and L. Tarassenko. Micro-arousals in human sleep: An initial evaluation of auto-matic detection. Robotics Research Group, Department of Engineering Science, Oxford University,Oxford, 1996.

[166] U. Trutschel, R. Guttkuhn, C. Ramsthaler, M. Golz, and M Moore-Ede. Automatic detection ofmicrosleep events using a neuro-fuzzy hybrid system. In 6th European Congress on Intelligent Tech-niques and Soft Computing. EUFIT’98, volume 3, pages 1762–6, Verlag Mainz, Aachen, Germany,1998.

[167] S. Uchida, I. Feinberg, J.D. March, Y. Atsumi, and T Maloney. A comparison of period amplitudeanalysis and FFT power spectral analysis of all-night human sleep EEG. Physiol. Behav., 67(1):121–31, Aug 1999. I haven’t read it all.

[168] S. Uchida, M. Matsuura, S. Ogata, T. Yamamoto, and N. Aikawa. Computerization of Fujimori’smethod of waveform recognition. a review and methodological considerations for its applicationto all-night sleep EEG. J. Neurosci. Methods, 64(1):1–12, Jan 1996.

[169] V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regres-sion estimation, and signal processing. In M. Mozer, M. Jordan, and T. Petsche, editors, Advancesin Neural Information Processing Systems, volume 9, pages 281–287. MIT Press, Cambridge, Mass,1997. Available at http://www.kernel-machines.org/papers/vapgolsmo96.ps.

[170] A. Varri, K. Hirvonen, J. Hasan, P. Loula, and V. Hakkinen. A computerized analysis system forvigilance studies. Comput. Methods Programs Biomed., 39(1-2):113–24, Sep-Oct 1992.

[171] R. Venturini, W.W. Lytton, and T.J. Sejnowski. Neural network analysis of event related potentialsand electroencephalogram predicts vigilance. In J.E. Moody, S.J. Hanson, and R.P. Lippmann,editors, Advances in Neural Information Processing Systems 4, pages 651–658. Morgan KaufmannPublishers, San Mateo, CA, 1992.

[172] M.L. Vis and L.L. Scharf. A note on recursive maximum likelihood for autoregressive modeling.IEEE Transactions on Signal Processing, 42(10):2881–3, Oct 1994.

[173] J. Wright, R. Johns, I. Watt, A. Melville, and T. Sheldon. Health effects of obstructive sleep apnoeaand the effectiveness of continuous positive airways pressure: a systematic review of the researchevidence. British medical journal, 314:851–60, Mar 1997.

[174] G.U. Yule. On a method of investigating periodicities in disturbed series, with special reference toWolfer’s sunspot numbers. Philosophical transactions of the Royal Society of London, A226:267–98,1927.

[175] M. Zamora. How disturbed is your sleep? the study of arousals using neural networks. In NeuralComputing Application Forum Meeting, Oxford, England, Sep 1998. NCAF.

[176] Tarassenko L. Zamora, M. The study of micro arousals using neural network analysis of the eeg.In IEE Ninth International Conference on Artificial Neural Networks, volume 2, pages 625–30, Edin-burgh, Scotland, Sep 1999. IEE.

The Study of the Sleep and Vigilance Electroencephalogram ......and REM/Light Sleep in a...

Documents

Transcript of The Study of the Sleep and Vigilance Electroencephalogram ......and REM/Light Sleep in a...