The Study of the Sleep and Vigilance Electroencephalogram ......and REM/Light Sleep in a...
Transcript of The Study of the Sleep and Vigilance Electroencephalogram ......and REM/Light Sleep in a...
The Study of the Sleep and VigilanceElectroencephalogram Using Neural Network
Methods
Mayela E. Zamora
St Cross College
Supervisor: Prof. L. Tarassenko
Sponsor: Universidad Central de Venezuela
�A thesis is submitted to the
Department of Engineering Science,University of Oxford,
in fulfilment of the requirements for the degree ofDoctor of Philosophy.
Hilary Term, 2001
Declaration
I declare that this thesis is entirely my own work, and except where otherwise stated, describes my ownresearch.
M. E. Zamora,St Cross College
Mayela E Zamora Doctor of PhilosophySt Cross College Hilary Term, 2001
The Study of the Sleep and Vigilance Electroencephalogram
Using Neural Network Methods
Abstract
This thesis describes the use of neural network methods for the analysis of the electroencephalogram
(EEG), primarily in subjects with a severe sleep disorder known as Obstructive Sleep Apnoea (OSA). This
is a condition in which breathing stops briefly and repeatedly during sleep, causing frequent awakening
as the subject gasps for breath. Day-time sleepiness is the main symptom of OSA, but the actual methods
to assess the level of drowsiness are time-consuming (e.g. scoring the EEG) or not reliable (e.g. subjective
measuring of the person’s sense of sleepiness, performance in vigilance tasks, etc). The work presented
in this thesis is two-fold. In the first part, a method for the automatic detection of micro-arousals from
features extracted from single-channel EEG, is developed and tested. AR modelling is the method of
extracting the features from the EEG. A compromise was found between the stationarity requirements of
AR modelling and the variance of the AR estimates by using a 3-second analysis window with a 2-second
overlap. The EEG features are then used as the inputs to a multi-layer perceptron (MLP) neural network
trained to track the sleep-wake continuum. It was found that a micro-arousal may cause an increase in the
slow rhythms (δ band) of the EEG at the same time as it causes an increase in the amplitude of the higher
frequencies (α and/or β bands). The automated system shows high sensitivity Se (median 0.97) and
positive predictive accuracy PPA (median 0.94) when validated against a human expert’s scores. This
is the first time that AR modelling has been used in micro-arousal detection. Visualisation analysis of the
EEG features revealed that Alertness and Drowsiness in vigilance tests are not the same as Wakefulness
and REM/Light Sleep in a sleep-promoting environment. The second part of the thesis describes the
application of another MLP neural network, trained to track the alertness-drowsiness continuum from
single-channel EEG, on OSA patients performing a visual attentional task. It was found that OSA subjects
may present “drowsy” EEG while performing well during the visual vigilance test. Also, the MLP analysis
of the wake EEG with these subjects showed that the transition to drowsiness may occur progressively
as well as in sudden dips. Correlation of the MLP output with a measure of task performance and
visualisation of EEG patterns in feature space show that the alertness EEG patterns of OSA subjects may
be closely related to the drowsiness EEG patterns of normal sleep-deprived subjects.
Acknowledgments
I am most grateful to Prof Lionel Tarassenko for supervising this work. Many thanks to my collaborators
at the Osler Chest Unit, Churchill Hospital, Dr John Stradling, Dr Melissa Hack, Dr Robert Davies and Dr
Lesley Bennett for providing the test data and the valuable clinical support. I would also like to thank Dr
Chris Alford for his helpful comments on the clinical aspects of this work.
To the Consejo de Desarrollo Cientifico y Humanistico de la Universidad Central de Venezuela, I extend
my sincere gratitude for the finacial support, and to the staff of its Departamento de Recursos Humanos
for the quality of service that they gave me during my stay in the UK.
Also, I am very appreciative to all my fellow labmates, especially Dr Mihaela Duta, Dr Ruth Ripley, David
Clifton, Gari Clifford, Dr Simukai Utete, Dileepan Joseph, Iain Strachan, Dr Steve Collins, Dr Taigang He
and Dr Neil Townsend for their friendship and help. Special thanks to Jan Minchington for the efficient
office support and natural kindness. To all my friends in Oxford, and in Caracas, a million thanks.
Finally, and most importantly, endless gratitude to my parents, to my sisters, and to Neal for their conti-
nous support, cheering and love.
working hard on vigilance...
Contents
1 Introduction 1
1.1 Overview of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Sleep and day-time sleepiness 3
2.1 Sleep, wakefulness, sleepiness and alertness . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 The process of falling asleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.3 Going on to a deeper sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Breathing and sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Normal sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Obstructive Sleep Apnoea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Daytime sleepiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Sleepiness/fatigue related accidents . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Correlation between OSA and accidents . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Measuring the sleep-wake continuum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Measuring sleepiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Measuring sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Previous work on EEG monitoring for micro-arousals and day-time vigilance 12
3.1 The EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Origin of the brain electrical activity . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Description of the EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Recording the EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.4 Extracerebral potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Analysis of the EEG during sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Changes in the EEG from alert wakefulness to deep sleep . . . . . . . . . . . . . . . 20
i
ii
3.2.2 Visual scoring method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Computerised analysis of the sleep EEG . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Analysis of the EEG for the detection of micro-arousals . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Cortical arousals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 ASDA rules for cortical arousals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 Computerised micro-arousal scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.4 Using physiological signals other than the EEG . . . . . . . . . . . . . . . . . . . . 33
3.3.5 Using the EEG in arousal detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Analysis of the EEG for vigilance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Changes in the EEG from alertness to drowsiness . . . . . . . . . . . . . . . . . . . 35
3.4.2 EEG analysis in vigilance studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 Vigilance monitoring algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Parametric modelling and linear prediction 43
4.1 Spectrum estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Deterministic continuous in time signals . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.2 Stochastic signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 AR parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Asymptotic stationarity of an AR process . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Yule-Walker equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.3 Using an AR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Wiener Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.2 Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Maximum entropy method (MEM) for power spectrum density estimation . . . . . . . . . 66
4.6 Algorithms for AR modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6.1 Levinson-Durbin recursion to solve the Yule-Walker equation . . . . . . . . . . . . . 67
4.6.2 Other algorithms for AR parameter estimation . . . . . . . . . . . . . . . . . . . . . 72
4.6.3 Sensitivity to additive noise of the AR model PSD estimator . . . . . . . . . . . . . 78
4.7 Modelling the EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
iii
5 Neural network methods 81
5.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1.1 The error function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.2 The decision-making stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1.3 Multi-layer perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 Optimisation algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.2 Conjugate gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 Model order selection and generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3.1 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.2 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.3 Performance of the network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Radial basis function neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Training an RBF network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.2 Comparison between an RBF and an MLP . . . . . . . . . . . . . . . . . . . . . . . 108
5.5 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5.1 Sammon map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5.2 NeuroScale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6 Sleep Studies 115
6.1 Using neural networks with normal sleep data: benchmark experiments . . . . . . . . . . 115
6.1.1 Previous work on normal sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1.2 Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.1.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1.4 Assembling a balanced database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.1.5 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.1.6 Training a Multi-Layer Perceptron neural network . . . . . . . . . . . . . . . . . . . 123
6.1.7 Sleep analysis using the trained neural networks . . . . . . . . . . . . . . . . . . . 126
6.2 Using the neural networks with OSA sleep data . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2.1 Data description, pre-processing and feature extraction . . . . . . . . . . . . . . . . 130
6.2.2 MLP analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.3 Detection of μ-arousals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.4 The choice of threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
iv
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7 Visualisation of the alertness-drowsiness continuum 146
7.1 The vigilance database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.1.2 Visualising the vigilance database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.2 Visualising vigilance and sleep data together . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8 Training a neural network to track the alertness-drowsiness continuum 164
8.1 Neural Network training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.1.1 The training database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.1.2 The neural network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.1.3 Choosing training, validation and test sets . . . . . . . . . . . . . . . . . . . . . . . 165
8.1.4 Optimal (n − 1)-subject MLP per partition . . . . . . . . . . . . . . . . . . . . . . . 167
8.2 Testing on the nth subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.2.1 Qualitative correlation with expert labels . . . . . . . . . . . . . . . . . . . . . . . . 169
8.2.2 Quantitative correlation with expert labels . . . . . . . . . . . . . . . . . . . . . . . 170
8.3 Training an MLP with n subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.4 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9 Testing using the vigilance trained network 181
9.1 Vigilance test database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.2 Running the 7-subject vigilance MLP with test data . . . . . . . . . . . . . . . . . . . . . . 182
9.2.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.2.2 MLP analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.3 Visualisation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.3.1 Projection on the 7-subject vigilance on the NEUROSCALE map . . . . . . . . . . . . 210
9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
v
10 Conclusions and future work 220
10.1 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.2 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.3 Main research results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
A Discrete-time stochastic processes 227
A.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
B Conjugate gradient optimisation algorithms 232
B.1 The conjugate gradient directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
B.1.1 The conjugate gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
B.2 Scaled conjugate gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
B.2.1 The scaled conjugate gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . 237
C Vigilance Database 238
D LED Database 240
D.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
D.2 Demographic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
List of Figures
2.1 The human brain showing its main structures . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 A conventional all night sleep classification plot from one normal subject . . . . . . . . . . 11
3.1 A simplified neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 The 10-20 International System of Electrode Placement . . . . . . . . . . . . . . . . . . . . 17
3.3 Conventional electrode positions for monitoring sleep . . . . . . . . . . . . . . . . . . . . . 19
3.4 Sleep EEG stages (taken from [69]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Apnoeic event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 Stochastic process model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Autoregressive filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Moving Average filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Moving Average Autoregressive filter (b0 = 1, q = p − 1) . . . . . . . . . . . . . . . . . . . 53
4.5 Time series of the synthetised AR process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Autocorrelation function of the synthetised AR process . . . . . . . . . . . . . . . . . . . . 59
4.7 Second order AR process generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.8 Second order AR process analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.9 AR coefficients estimates’ mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.10 Filter problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.11 Prediction filter of order p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.12 Prediction-error filter of order p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.13 Prediction-filter filter of order p rearranged to look as an AR analyser . . . . . . . . . . . . 65
4.14 Lattice filter of first order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.15 Lattice filter of first order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 The classification process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 An artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Hyperbolic tangent and Sigmoid functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
vi
vii
5.4 A I−J−K neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6 A radial basis function network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.1 The neural network’s wakefulness P(W ), REM/light sleep P(R) deep sleep P(S) outputs;and measure of sleep depth P(W )-P(S) (from Pardey et al. [123]) . . . . . . . . . . . . . . 117
6.2 Mean error and covariance matrix trace for reflection coefficients computed with the Burgalgorithm (wakefulness and Sleep stage 4) vs data length N . . . . . . . . . . . . . . . . . 121
6.3 Sammon map for the balanced sleep dataset; classes W, R and S . . . . . . . . . . . . . . . 123
6.4 NEUROSCALE map for the balanced sleep dataset; classes W, R and S . . . . . . . . . . . . 124
6.5 Average performance of the MLPs vs number of hidden units . . . . . . . . . . . . . . . . . 125
6.6 Performance of the 10-6-3 MLP vs regularisation parameters . . . . . . . . . . . . . . . . . 127
6.7 MLP outputs, P(W ), P(R) and P(S) for subject 9’s all-night record, showing a 12-minutesegment detailed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.8 Sleep database subject 9 P(W )-P(S), raw (a) and 31-pt median filtered (b) compared tohuman expert scored hypnogram (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.9 OSA sleep MLP outputs for subjects 3 and 8 . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.10 [P(W )-P(S)] output for OSA sleep subjects 3 (top) and 8 (middle); and for normal sleepsubject 9, (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.11 μ-arousal detection procedure. Upper trace: [P(W)-P(S)] and a 0.5 threshold; middletrace: thresholding result; lower trace: μ-arousal automatic score with ASDA timing criteria 133
6.12 μ-arousal validation Upper trace: automated score for 0.7 threshold; middle trace: auto-mated score for 0.8 threshold; lower trace: visually scored signal . . . . . . . . . . . . . . 134
6.13 Se, PPA and Corr vs threshold for OSA subjects . . . . . . . . . . . . . . . . . . . . . . . 137
6.14 [P(W )-P(S)] output for OSA sleep subjects 2 (top) and amplitude histogram showing thetwo main clusters, surrounded by a circle of one standard deviation, and the EDM thresh-old (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.15 Se, PPA and Corr for the best threshold (blue), the EDM threshold (red), and a 0.5 fixedthreshold (green) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.16 OSA subject 5 EEG and [P(W )-P(S)] output during a typical μ-arousal for this subject (24s) 141
6.17 Spectrogram of the EEG segment shown in Fig. 6.16 calculated with 1s resolution using10th-order AR modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.18 OSA subject 5 EEG and [P(W )-P(S)] output during a μ-arousal missed by the automatedscoring system (24s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.19 OSA subject 8 [P(W )-P(S)] output and human expert scores (2 minutes) . . . . . . . . . . 144
6.20 Sleep database subject 9 raw P(W )-P(S) using a 1-s analysis window (a) and using a 3-sanalysis window (b), compared to the human expert scored hypnogram (c) . . . . . . . . 145
7.1 Vigilance Sammon map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
viii
7.2 Vigilance NEUROSCALE map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3 Vigilance Sammon map showing subject’s distribution (Alertness in red and Drowsiness inblue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.4 Vigilance NEUROSCALE map projections for each subject (Alertness in magenta and Drowsi-ness in blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.5 Vigilance NEUROSCALE map trained with all subjects, including the α+ subject . . . . . . . 158
7.6 Vigilance NEUROSCALE trained with all subjects, including α+ subject (Alertness in ma-genta and Drowsiness in blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.7 Subject 8 reflection coefficient histogram (green) in relation to the rest of the subjects inthe training set (magenta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.8 Vigilance and sleep NEUROSCALE map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.9 Vigilance and sleep NEUROSCALE projections for all the patterns in each class (colour code:W, cyan; R, red; S, green; A, magenta; and D, blue) . . . . . . . . . . . . . . . . . . . . . . 161
7.10 Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta;and D, blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.11 Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta;and D, blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.1 Average misclassification error for the validation set vs. number of hidden units J for the(n − 1)-subject MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.2 Average misclassification error on the validation set with respect to regularisation param-eters (νz,νy) for the (n− 1)-subject MLP with J = 3 (linear interpolation used between 12values) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3 Time course of the MLP output for vigilance subject 1 . . . . . . . . . . . . . . . . . . . . 173
8.4 Time course of the MLP output for vigilance subject 2 . . . . . . . . . . . . . . . . . . . . 174
8.5 Time course of the MLP output for vigilance subject 3 . . . . . . . . . . . . . . . . . . . . 175
8.6 Time course of the MLP output for vigilance subject 4 . . . . . . . . . . . . . . . . . . . . 176
8.7 Time course of the MLP output for vigilance subject 5 . . . . . . . . . . . . . . . . . . . . 177
8.8 Time course of the MLP output for vigilance subject 6 . . . . . . . . . . . . . . . . . . . . 178
8.9 Time course of the MLP output for vigilance subject 7 . . . . . . . . . . . . . . . . . . . . 179
8.10 Average misclassification error for the validation set vs. number of hidden units J for the7-subject MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.1 LED subject 1 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 186
9.2 LED subject 1 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 187
9.3 LED subject 2 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 188
9.4 LED subject 2 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 189
9.5 LED subject 3 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 190
9.6 LED subject 3 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 191
ix
9.7 LED subject 3 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 192
9.8 LED subject 4 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 193
9.9 LED subject 4 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 194
9.10 LED subject 5 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 195
9.11 LED subject 5 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 196
9.12 LED subject 6 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 197
9.13 LED subject 6 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 198
9.14 LED subject 6 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 199
9.15 LED subject 7 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 200
9.16 LED subject 7 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 201
9.17 LED subject 8 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 202
9.18 LED subject 8 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 203
9.19 LED subject 9 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 204
9.20 LED subject 9 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 205
9.21 LED subject 10 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . 206
9.22 LED subject 10 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . 207
9.23 LED subjects MLP output vs missed hits scatter plots . . . . . . . . . . . . . . . . . . . . . 208
9.24 LED subjects MLP output vs missed hits scatter plots . . . . . . . . . . . . . . . . . . . . . 209
9.25 LED subjects no-missed hits MLP output histogram . . . . . . . . . . . . . . . . . . . . . . 210
9.26 Patterns from LED subject 1 and 2 projected onto the 7-subject vigilance NEUROSCALE map 211
9.27 Patterns from LED subject 3 and 5 projected onto the 7-subject vigilance NEUROSCALE map 212
9.28 Patterns from LED subject 7, 9 and 10 projected onto the 7-subject vigilance NEUROSCALE
map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
9.29 Patterns from LED subject 4 and 6 projected onto the 7-subject vigilance NEUROSCALE map 218
9.30 Patterns from LED subject 8 projected onto the 7-subject vigilance NEUROSCALE map . . . 219
A.1 Stochastic process ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
List of Tables
3.1 The Rechtschaffen and Kales standard for sleep scoring. . . . . . . . . . . . . . . . . . . . 22
3.2 The vigilance sub-categories and their definition . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 AR coefficients estimates’ mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Feedback coefficients in terms of the reflection coefficients . . . . . . . . . . . . . . . . . . 72
4.3 Reflection coefficients in terms of the feedback coefficients . . . . . . . . . . . . . . . . . . 72
6.1 Mean error and trace of covariance matrix for synthesised EEG reflection coefficients(wakefulness) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2 Mean error and trace of covariance matrix for synthesised EEG reflection coefficients (stage4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3 Misclassification error (expressed as a percentage) for the best three MLPs . . . . . . . . . 126
6.4 Se, PPA and Corr per subject for various threshold values . . . . . . . . . . . . . . . . . . 136
6.5 Optimal threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.6 Equi-distance to means (EDM) threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.7 Fixed (0.5) threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.1 Alford et al. vigilance sub-categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2 Number of patterns per subject per class in vigilance training database . . . . . . . . . . . 149
7.3 Number of patterns per subject per class in K-means training set . . . . . . . . . . . . . . 150
8.1 Partitions and distribution of patterns in training (Tr) and Validation (Va) sets . . . . . . . 166
8.2 Optimum MLP parameters per partitions and percentile classification error for training (Tr)and validation (Va) sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.3 Percentage correlation between 1-s segments of the 15-pt median filtered MLP output and15s-based expert labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
C.1 Bristol subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
D.1 Time of falling asleep (in mm:ss) measured by the clinician from the start of the MWT test.The letter used in this thesis to refer to a given test is shown in brackets . . . . . . . . . . 241
x
xi
D.2 Subject demographic details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
D.3 Overnight sleep study results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Chapter 1
Introduction
Obstructive Sleep Apnoea (OSA) is a condition in which breathing stops briefly and repeatedly during
sleep, causing frequent awakening as the subject gasps for breath. Day-time sleepiness is the main symp-
tom of OSA. Diagnosis of the disorder includes an over-night sleep study to count the number of arousals,
and a day-time sleepiness assessment. Changes from wakefulness to deep sleep and from alertness to
drowsiness are reflected in many physiological signals and behavioural measures. Among the physiologi-
cal variables, the electroencephalogram (EEG) is one of the most relevant, but traditional methods, based
on visual assessment of this signal (for example, counting the number of micro-arousals during sleep)
are time-consuming or not reliable. Most of the changes in the EEG associated with the transition from
alertness to drowsiness and to sleep are in the frequency domain, and many attempts to computerise the
EEG analysis are based on frequency-domain methods.
1.1 Overview of thesis
The focus of this thesis will be on sleep disturbance (micro-arousals in OSA patients) and its effect on
day-time performance, as assessed with vigilance monitoring. Definitions of terms used in this thesis
and a description of the OSA disorder and its implications in society can be found in chapter 2. Clinical
background and a literature review on computerised methods are presented in chapter 3. Section 3.2.3
of that chapter show that little research has been done on the computerised analysis of disturbed sleep.
Furthermore, there is no prior work on the computerised analysis of both sleep disturbance and vigilance
1.1 Overview of thesis 2
from the EEG. This thesis will describe the research undertaken in order to develop such a framework
using AR modelling for frequency-domain analysis and neural network methods for clustering and for
classification.
AR modelling theory and algorithms are the subject of chapter 4. Chapter 5 is a review of neural net-
work methods. Experiments carried on AR modelling to find a compromise between the stationarity
requirements and the variance of the AR estimates are described in chapter 6, which also presents the
use of neural networks with sleep data to track the sleep-wake continuum from single-channel EEG. An
automated system is developed to detect micro-arousals in OSA sleep EEG, based on the neural network
outputs. Results, compared with an expert’s scores, show high sensitivity with a low number of false pos-
itives, and a good similarity in starting time and duration. A case study shows that the EEG may present
a mixed-frequency pattern during a micro-arousal, instead of a shift in frequency as usually described in
the literature.
In chapter 7 we explain the reasons why a different network is needed to map the alertness-drowsiness
continuum. Visualisation analysis of the EEG features revealed differences between Alertness and Drowsi-
ness in vigilance tests, with respect to Wakefulness and REM/Light Sleep in a sleep-promoting environ-
ment. Chapter 8 deals with the training of neural networks with vigilance data to track the alertness-
drowsiness continuum using single-channel EEG only. Finally, chapter 9 presents the results of the trained
network with data from OSA patients performing a visual attentional task. This study shows that OSA
subjects may present “drowsy” EEG while performing well. Also, the MLP analysis of the wake EEG with
these subjects shows that the transition to Drowsiness may occur progressively as well as in sudden dips.
Correlation of the MLP output with a measure of task performance and visualisation of EEG patterns in
feature space show that the alertness EEG patterns of OSA subjects may be more closely related to the
drowsiness EEG patterns of normal sleep-deprived subjects than to the alertness patterns of these.
Chapter 2
Sleep and day-time sleepiness
2.1 Sleep, wakefulness, sleepiness and alertness
2.1.1 Definitions
Although the above words are part of almost everyone’s daily conversations we will define them in the
sense that they are to be used in this thesis. Sleep is a natural and periodic state of rest during which
consciousness of the world is suspended while its counterpart, wakefulness, is a periodic state during
which one is conscious and aware of the world [124]. Between sleep and wakefulness is the transitional
state of sleepiness [130], which has been defined as a physiological drive towards sleep [4], usually
resulting from sleep deprivation, or as a subjective feeling or state of sleep need. Wrongly used as a
synonym for wakefulness, alertness is the process of paying close and continuous attention, a state of
readiness to respond [124], an optimal activated state of the brain [115]. Vigilance, another word of
similar meaning, was first introduced in the literature by Head in 1923, who differentiated the stages of
awareness [64].
2.1.2 The process of falling asleep
Two theories try to explain how we fall asleep. Oswald in 1962 suggested that the fall of cerebral vigilance
does not occur as a steady decline but occurs briefly over and over again, with frequent surges of cerebral
vigilance to, once more, a high level. This depicts sleep onset as a punctuated rather than gradual
process. However more and more evidence has been found recently using recordings of the brain’s
2.2 Breathing and sleep 4
electrical activity and respiration signals that show a gradual oscillatory descent into sleep [6][130].
Sleepiness has an ultradian (> once/day) modulation with three times when this condition is most com-
mon: just after awakening in the morning, in mid afternoon (the so called “post-lunch dip” which is
nevertheless not related to the ingestion of food) and just prior to sleep. The post-lunch dip correlates
with the occurrence of siestas and an increase in the incidence of automobile accidents [166][41][130].
The drive to sleep can be overridden by motivation, especially in life-threatening situations, but it cannot
be suppressed indefinitely [42].
2.1.3 Going on to a deeper sleep
Once the sleep state is reached, physical signs of this condition are lack of movement, reduced postural
muscle tone, closed eyes, lack of response to limited stimuli, and more regular and relaxed breathing,
usually accompanied with an increase in upper airway noise. During sleep, the eyes can move repeat-
edly and rapidly. This condition is called rapid eye movement (REM) sleep and is usually associated
with the act of dreaming. A normal subject follows cycles or periodic patterns of REM and non-REM
(NREM) sleep during the night, going from the wakefulness stage to the deep sleep stage and then to
REM sleep, for a time longer than 20 minutes but usually no more than 1 hour, descending again into a
deep sleep stage, and repeating the REM-NREM sleep 90-minute cycle for about 4 or 5 times (see Fig 2.2
in section 2.4.2)[155].
2.2 Breathing and sleep
2.2.1 Normal sleep
When a normal subject is awake, ventilation is controlled by two pathways, one driven by the brain-
stem respiratory control centre and the other by the cortex (see Fig. 2.1). The one which is controlled
by the brain-stem is a vagal reflex and is more related to oxygen and carbon dioxide concentration
control. During sleep, this respiratory centre remains active but the cortex drive disappears causing
regular breathing as well as a fall in ventilation and a rise in the CO2 concentration. The reduction
2.2 Breathing and sleep 5
in muscular tone causes a similar effect. The intercostal muscles stop their breathing motion and the
tubular pharynx muscle, which relies on tonic and phasic respiration to stay open, is narrowed when it
and related muscles lose tone. This pharyngeal narrowing increases the upper airway resistance. Even so,
the loss in tone of the intercostal muscles increases the chest wall compliance, allowing the diaphragm to
elevate the rib-cage more easily. The overall effect is that the breathing looks more relaxed and the ratio
of abdominal contribution to rib-cage contribution decreases, at least in NREM sleep [155].
Figure 2.1: The human brain showing its main structures
The further reduction in tone experienced by the intercostal muscles during tonic REM sleep brings
another fall in ventilation followed by a recovery in phasic REM sleep, when the randomly excited cortex
is able to drive the breathing again, making it less regular. The abdominal contribution increases to a
higher level than when the subject is awake [155].
2.2.2 Obstructive Sleep Apnoea
An obstructive apnoea is a condition which occurs when the air flow in the ventilation system stops for
more than 10s, due to an obstruction in the upper airways. A hypopnoea occurs when the normal flow is
reduced by 50% or more for more than 10s [91]. The number of apnoea and hypopnoea events per hour,
called the respiratory disturbance index (RDI) or apnoea/hypopnoea index (AHI), is used to determine
whether breathing patterns are normal or abnormal. Usually, an AHI of 5 or more is considered abnormal
[119].
2.2 Breathing and sleep 6
Some subjects develop a sleep disorder called Obstructive Sleep Apnoea (OSA) in which apnoea or hypop-
noea events occur when the upper airway, usually crowded by obesity, enlarged glands, or other kinds of
obstruction, collapses under the negative pressure created by inspiration as the muscles lose their tone.
Then, the subject increases his respiratory efforts gradually until the intrathoracic pressure drops to a
subatmospheric value. Only when the carbon dioxide level rises and the oxygen level falls enough to
awake the cortex respiratory mechanism, does the returning tone unblock the upper airways and restore
ventilation. Recently, some studies [155] have pointed out the possibility that the increase in respiratory
effort is responsible for the cortex arousal. Whatever is the cause, this arousal is short in time, sometimes
referred in the literature as a “micro-arousal”1, and the patient rarely is conscious of it [45].
If the apnoea/hypopnoea event is followed by an overshoot of hyper-ventilation, then the threshold of
the carbon dioxide level to provoke spontaneous ventilation can fall, and the next apnoea will have a
period when no respiratory effort is being made [155].
Micro-arousals
An arousal is a mechanism of the organism to increase the level of alertness in order to respond more
effectively to danger, whether it be external or internal and whether actual or perceived. In terms of
sleep, arousal not only refers to waking up but also to a series of physiological changes in autonomic
balance (i.e. heart rate, blood pressure, skin potential) and brain cortex activity [45].
Arousals caused by an apnoea/hypopnoea event are a short duration response caused by an internal
stimulus. Their length can be from just 3 or 5 seconds to 20 seconds [11], and can be barely noticeable
or can end in a choking sensation or panic [45]. A number of 15 or more micro-arousals per hour are
enough to diagnose OSA with confidence, but the number of arousals can be greater than 400 during the
night [155], and some studies found up to 100 per hour [45]. This fragmentation decreases the quality of
the sleep by diminishing the effective sleep time. Progressive sleepiness during daytime is a consequence,
starting with some loss of vigilance when the subject is performing a boring task, but soon leading him or
1The term micro-arousal was first introduced by Halasz in 1979 [60]
2.3 Daytime sleepiness 7
her to fall asleep while doing other activities such as reading, watching TV, sitting as a passenger in a car
or train or taking a bath. In the worst case, the subject may fall asleep while driving a machine at work or
a car, causing shunting accidents, and more serious crashes [155]. The deterioration in daytime function
correlates with the frequency of the micro-arousals rather than the extent of the reduced arterial oxygen
saturation [44].
Arousals can have causes other than obstructive sleep apnoea (OSA), for instance, ageing, leg movements,
pain, some forms of insomnia, but the most common cause is OSA [30]. OSA has a prevalence of 1-4% in
the overall population, 85% of the sufferers being males, and is highest in the 40-59 year age group, the
percentage of those affected rising to 4-8% [107] [44]. The problem usually arises in middle age, when
the muscles becomes less rigid and a decrease in activity increases the weight [155].
2.3 Daytime sleepiness
2.3.1 Causes
Sleep deprivation is one of the most common causes of sleepiness in our society. Studies on sleep de-
privation have found that a reduction in nocturnal sleep of as little as 1.3 to 1.5 hours per night results
in a reduction of daytime alertness by as much as 32% as measured by the multiple sleep latency test
(see section 2.4.1 for a description of this test)[20]. Physiological and psychological functions deteriorate
progressively over accumulating hours of sleep loss as well as over periods of fragmented sleep [35][97].
A second cause of sleepiness is OSA, the most common sleep disorder to cause day-time sleepiness,
even though the subjects affected by this disorder often report sleeping quite well [70] [107]. The
sleepiness of OSA sufferers is reflected in neuro-physiological impairment in originality, logical order
in visual scanning, recent memory, word fluency, flattening of affect in speech, and spatial orientation.
They become easily distracted by irrelevant stimuli, and have difficulties in ordering temporally changing
principles (card sorting or digit symbol substitution) [70]. It has been recommended that diagnosis of
OSA should not only depend on the AHI but also on functional sleepiness [107].
2.3 Daytime sleepiness 8
2.3.2 Sleepiness/fatigue related accidents
Fatigue and sleepiness are often used as synonyms. The term fatigue is also used to indicate the effects
of working too long, or taking too little rest, and being unable to sustain a certain level of performance
on a task [41]. Fatigue as well as sleepiness is related to motivation; the capability of performing a given
task; and past, cumulative day-by-day arrangements and durations of sleep and work periods [121].
Loss of performance usually means decreased ability to maintain visual vigilance and to have quick re-
actions, as well as to respond to unique, emergency-type situations. The loss of performance brought
by fatigue and sleepiness can be fatal when driving, piloting, monitoring air traffic control or radar or
when operating dangerous machinery. It appears that the incidence of sleepiness-related fatal crashes
may be as high as 40% of all the accidents on long stretches of motorway [41]. 20-25% of drivers having
motorway accidents appear to do so as a result of falling asleep at the wheel [107]. Sleepiness influences
people’s perception of risk [41]. Drivers do not always recognise the signs of fatigue/drowsiness or may
choose to ignore them [121]. Evidence has been found that lorry drivers on 11-hour hauls show in their
physiological signals increased signs of marked drowsiness during the last three hours of their drive [83].
Long-distance driving, youth and sleep restriction are frequently associated with sleep-related accidents
[128].
Sleepiness is the major complaint of shift-workers. Displaced hours of work are in conflict with the
basic biological principles regulating the timing of rest and activity (i.e. the circadian and homeostatic
regulatory systems) [4]. It may be the cause of more than 2% of all the serious accidents in industry
[41].
2.3.3 Correlation between OSA and accidents
As OSA is one of the most common causes of day-time sleepiness [161], the link between this sleep
disorder and motorway accidents is obvious. OSA patients show a high dispersion in reaction times [79],
and evidence has been found that OSA impairs driving [59]. Recent polls have revealed that 24% of
2.4 Measuring the sleep-wake continuum 9
OSA patients reported falling asleep at least once per week while driving [107], so it is not a surprise to
find that OSA sufferers have a 5 to 7 fold greater risk of road accidents than normal subjects. Long-haul
lorry drivers belong to the highest-risk group [107]. Lorry drivers with OSA have twice as many crashes
per mile driven as the normal group [121]. However, more recent studies have noted that increased
automobile accidents in OSA sufferers may be restricted to cases with severe apnoea (AHI> 40) [56].
2.4 Measuring the sleep-wake continuum
2.4.1 Measuring sleepiness
Many attempts to measure sleepiness/alertness have been made, and several scales are currently in
use. Subjective measures like the Stanford Sleepiness Scale (SSS), with 7 statements of feelings of
sleepiness from “wide awake” to “cannot stay awake” [130]; the Visual Analogue Scale (VAS), that
uses 10cm-lines anchored between the extremes of the states or moods under study [5][130]; and the
Activation-Deactivation Adjective Check List (ADACL), which consists of a series of adjectives describing
feelings at the moment and a four point scale – definitely feel, feel slightly, cannot decide and definitely
do not feel – have been used in a wide range of vigilance studies [130], in parallel with more objective
measures that provide means of verifying the subjective feelings of loss of alertness [5].
Several tests have been developed to provide an objective, repeatable quantification of sleepiness like
the multiple sleep latency test (MSLT) [27][140], that places the subject in a sleep promoting situation
and measures the latency to onset of sleep. In the MSLT, subjects in a sleep-promoting environment are
instructed to try to fall asleep while other similar tests differ in the instructions, like the maintenance of
wakefulness test (MWT) [111][43], which instructs the subject to resist sleep.
Loss of alertness or sleepiness has been related to diminished response capability, in which a decrease in
performance will indicate the presence of this condition. Therefore, quantifiable behavioural responses
or performance measures have been used also as objective ways to measure vigilance. The most popular
ones are reaction time, tracking error and stimulus detection error. The use of vigilance tasks to measure
2.4 Measuring the sleep-wake continuum 10
sleepiness has the problem that the tasks are intrusive with respect to the natural process of sleepiness
[130]. Task complexity and knowledge of results (feedback) can mitigate the effects of sleep loss [42].
Other non-task related factors that affect the process are motivation, distraction and comprehension of
instructions [130][35][42].
Physiological measures
As a result of the degree of isomorphism between physiological and behavioural systems, diminished
response capabilities associated with sleepiness will be reflected in distinctive variations in physiological
measures. Below is a list of some of the changes in physiological variables associated with sleepiness
[130][118]:
• slower, more periodic breathing,
• decrease in cardiovascular activity (heart rate, blood pressure),
• decreased eye blinks and increased slow eye movements,
• decreased but variable skin conductance responses,
• decreased body temperature,
• electroencephalogram (EEG) changes in amplitude, frequency and patterning.
2.4.2 Measuring sleep
Loomis and collaborators first showed in 1937 that sleep is not a uniform or steady state and they there-
fore classified sleep in stages [96]. Following this sleep classification was further refined until in 1968, a
committee chaired by Rechtschaffen and Kales (R & K) compiled a set of rules that soon became the stan-
dard in sleep staging [136]. From wakefulness or REM sleep to deep sleep, R & K analysis distinguishes
four intermediate stages for NREM sleep (see Fig. 2.2 for a typical all-night sleep classification plot which
is known as a hypnogram). Visual assessment of the subject is not enough for the characterisation of these
stages. Physiologically, the EEG, the electromyogram (EMG) and the electrooculogram (EOG) provide a
2.4 Measuring the sleep-wake continuum 11
higher level of quantification in the description of the different sleep stages. Measures of sleepiness based
on the EEG and details of the sleep stages are given in chapter 3.
Hours of sleep
1 3 4 5 6 7 82
Awake
REM
1
2
3
4
Sle
ep s
tag
es
Figure 2.2: A conventional all night sleep classification plot from one normal subject
Chapter 3
Previous work on EEG monitoring formicro-arousals and day-time vigilance
As we have seen in chapter 2, many physiological processes change at the time of sleep onset. Monitoring
these changes provides means of detecting arousals during sleep for OSA diagnosis (see section 2.2.2)
and day-time sleepiness. However, the organ that shows the clearest changes during sleep and from
alertness to sleepiness is the brain. Not only is the brain the organ that contains the mechanisms for
sleeping and being awake, its electrical activity is relatively easy to monitor and reflects the changes in
the sleep/wake continuum [69].
3.1 The EEG
The electroencephalogram or EEG is a graphical record of the electrical activity of the brain which was
first measured non-invasively in humans and described in 1929 by Hans Berger1. It can be measured with
electrodes located near, on or within the cortex. Depending on the location of the recording electrodes
the EEG can be called scalp EEG, cortical EEG or depth EEG. The first one is recorded with electrodes
placed on the scalp while the last two refers to electrodes in contact with the brain cortex [145]. From
now on we will use EEG to mean scalp EEG.
1The first recording of the electrical activity of the brain was made by Caton in 1875 using rabbits, monkeys and other smallanimals[28]
3.1 The EEG 13
3.1.1 Origin of the brain electrical activity
The human nervous system is responsible for taking the information from internal and external or en-
vironmental changes, analysing it and acting upon it in order to preserve the integrity, well-being, and
status quo of the organism. Its most prominent and important organ is the brain (see Fig. 2.1). The hu-
man brain contains approximately 109 nerve cells or neurons interconnected in a very intricate network
within which the information is transmitted by electro-chemical impulses [51]. Most neurons consist of
a cell body, or soma, with several receiving processes, or dendrites, which prolongs to a nerve fibre, or
axon that branches at the other end (see Fig. 3.1).
axon
soma
dendrites
Figure 3.1: A simplified neuron
As in any other cell in the human body, there is an electrical potential difference between the inner and
the outer side of the neuron. This potential, called the resting potential, is due to differences in extracel-
lular and intracellular ion concentration, maintained by the cell membrane structure and ion pumping
mechanisms. Neurons can respond to stimuli, strong enough to initiate a series of charge changes that
leads to membrane depolarisation and reverse polarisation that reaches a peak and repolarises back to
the resting potential. This sudden activity resembles a spike in shape and is called an action potential.
Typically it has a peak to peak amplitude of 90 mV and a duration of 1ms.
Neurons also interact with each other by chemical secretions in the dendrite-axon gaps (synapses) be-
tween them. The action potential in the pre-synaptic neuron (transmitting neuron) travels from the soma
along the axon. When it reaches the end it releases a chemical neurotransmitter at the axon terminals,
which are very close to the dendrites of other neurons. Then the post-synaptic neuron (receiving neuron)
receptors for this chemical release ions inside the cell that change the membrane polarisation, originating
a post-synaptic potential. Post-synaptic potentials are much lower in amplitude than the action potentials,
3.1 The EEG 14
but they last much longer (15 - 200 ms or more) and the extracellular current flow associated with them
is much more widely distributed than that corresponding to action potentials. It has been estimated that
one neuron can influence up to 5000 of its neighbours. For these reasons it is believed that the EEG re-
flects the summation of post-synaptic potentials of the pyramidal cells rather than the spatial summation
of individual action potentials [135] [145] [125]. Pyramidal cells are neurons located very close and
perpendicularly to the cortex surface, so the ion current flow generates electrical potential changes that
are maximum in the plane parallel to the cortex [135].
If post-synaptic potentials coming from the dendrites of one neuron, summed in time and space, exceed
a certain threshold, the soma generates a new nerve impulse, an action potential, that is then transmit-
ted to the neurons at the end of its axon, passing in this way the stimulus response from one neuron to
another [51] [135]. Post-sypnatic potentials can be of varied peak amplitude but in general, a single one
is not enough to trigger the action potential [135]. Because of their chemical origin, the potentials gen-
erated in the brain are very limited in amplitude and the ionic currents travel slower (1ms per synapse)
than currents in metals. The axon membrane is not a perfect insulator, some extracellular current flows
and diffuses the information around the neuron, speeding up the signal transmission [135] [110]. The
cerebro-spinal fluid and the dura membrane act as strong attenuators for the EEG, with the scalp itself
having less effect. EEG waves seen at the scalp, therefore, represent a kind of a “spatial average” of
electrical activity from a limited area of the cortex [125].
3.1.2 Description of the EEG
The EEG is a very complex quasi-rhythmical spatio-temporal signal within a time-frequency band of 0.1
- 100 Hz and an amplitude of the order of hundreds of microvolts at the scalp [135]. The effective
frequency range is 0.5 - 50 Hz and is divided for clinical reasons in four main bands in which the power
of the signal is concentrated, namely [87][69][24]:
1. Delta (δ) activity: [0.5 - 3.5] Hz2
2δ rhythm is limited to the [0.5 - 2) Hz range in sleep studies
3.1 The EEG 15
2. Theta (θ) activity: [4 - 8) Hz
3. Alpha (α) activity: [8 -13] Hz
4. Beta (β) activity: [15 - 25] Hz
5. Gamma (γ) activity: [30-50] Hz
with “]” meaning “inclusive” and “)” meaning “exclusive”.
EEG records are sometimes described as just “slow” or “fast” if the dominant frequency is below or above
the α band. The amplitude of the waves tends to drop as the frequency increases. Although there are
indications of several sources of rhythmical activity in the brain, their role in the generation of the EEG
rhythms is not yet fully understood [145] [125]. Clear oscillatory behaviours in the nervous system occur
in various situations, like in rhythmic motor functions (chewing, swimming) as well as in pathological
conditions (clonic muscular jerking, rhythmic eye blinks), but most of them serve unknown functions.
Some may be related to biological clocks or establishing windows of time during which information flows
[125]. The bands described above correspond to the main frequencies of these physiological pacemakers.
These frequencies do not tend to overlap with the frequency content of the neighbouring bands, hence
the gaps between some of the bands.
It has been suggested that the distributed, but related, cortical γ activity in the forebrain provides the
physiological basis for focused attention that links input to output, i.e. relating voluntary effort and/or
sensory input to a calling up and operation of a sequence of movements or thoughts. This form of
attention occurs normally during wakefulness, but can also be present during disordered sleep, in patients
who talk or walk during sleep [24]. Activity over 50 Hz is not considered of clinical value in scalp EEG
because it is mostly masked by background noise. Apart from the background rhythmical activity, there
are other components in the EEG of transient nature, usually described in terms of their duration and
waveform. For instance, a monophasic wave of less than 80ms duration is called a spike, while one of
80-200ms is called a sharp wave. Other transient forms are the spindles and K-complexes (see Fig. 3.4 later
in this chapter). All EEG components fluctuate spontaneously in response to stimuli or as a consequence
3.1 The EEG 16
of changes in the subject’s state of mind (i.e. sleep/wake control and psychoaffective status) and brain
metabolic status. They can also be changed by the use of drugs or by traumas or pathological conditions
[87] [85].
EEG patterns are different from one individual to another. Factors like gender, early stimuli, minor or
major brain damage, etc. can affect the development of the EEG. Once a subject reaches adulthood, their
EEG characteristics “stabilize” over time. This means that the EEG patterns for different conditions such
as eyes open, eyes closed, auditory stimulation and task performance remain remarkably similar for the
same individual as their age increases [85].
3.1.3 Recording the EEG
The EEG is recorded by amplifying the potential differences between two electrodes located on the scalp.
An electrode is a liquid-metal junction used to make the connection between the conducting fluid of the
tissue in which the electrical activity is generated and the input circuit of the amplifier [135]. The most
commonly used system for the placement of the electrodes is the so-called “10-20 International System
of Electrode Placement”, [76], represented in Figure 3.2. An orderly array of EEG channels constitutes
a montage. When all the channels are referenced to the same electrode (usually mastoid processes A1
for the right side of the scalp, and A2 for the left side of the scalp, or a common site located at the nose
or at the chin) the montage is called “referential”. If all the channels represent the difference potential
between two consecutive electrodes on the scalp, the montage is said to be “bipolar” [145].
The EEG signal is traditionally recorded on paper, or, more commonly now, electronically. It is subse-
quently analysed in order to extract useful information about the physiology or pathology of the brain.
This analysis is usually done by an expert by visual inspection of the signal.
3.1.4 Extracerebral potentials
In addition to the EEG, the scalp electrodes can also pick up other signals whose sources are not in the
brain, but are near or strong enough to interfere with its electrical activity. These signals can totally
3.1 The EEG 17
T3
F8
RightLeft
Nasion
Inion
2
5
O1 O
6TT
A1 A2
C 3 C4 T4
F7
FP1 FP2
P3 P4
C Z
FZF3 F4
PZ
Fp1,2
C 3,4
F3,4
T3,4
T5,6
A 1,2
pre-frontal
frontal
central
mid-temporal
posterior temporal
mastoid
P3,4
O 1,2
F7,8
FZ
C Z
PZ
parietal
occipital
anterior
frontal mid-line
central vertex
parietal mid-line
Figure 3.2: The 10-20 International System of Electrode Placement
obscure the EEG, making the recording uninterpretable. They can subtly mimic normal EEG activity or
distort normal activity, leading to misinterpretation [23]. Although called artefacts (or artifacts) they do
not always come from man-made devices. The main sources of artefacts are:
1. The recording instrument
2. The interface between the recording instruments and the scalp
3. Extraneous environmental sources
4. Other bio-electrical signals that do not originate from the brain and are not of interest in this
context, and can therefore be considered to be unwanted influences.
Muscle and heart activity as well as eye and tongue movements are among the bio-electrical signals
which, in this context, are considered artefacts because they obscure the EEG. They are classified as:
1. Electrocardiographic (ECG) signals and signals due to breathing
2. Electrooculographic (EOG) signals (signals due to eye movement)
3. Glossokinetic signals (signals from the movement of the tongue)
4. Electromyographic (EMG) signals (signals induced by muscle activity)
3.1 The EEG 18
5. Electrodermal signals due to altered tissue impedance (see above).
Usually the above influences appear within the EEG frequency range and consequently cannot be elim-
inated by filtering. If the interference renders the EEG useless, then the affected sections of EEG are
ignored, unless the presence of the interfering signal gives important information about the brain status
as is sometimes the case in visual scoring to determine alertness or in visual sleep staging.
Artefacts during sleep
Blink artefacts can occur only during wakefulness and in combination with slow eye movements during
drowsiness. Rapid eye movements (REM) are seen during waking but are characteristic of the “dreaming”
sleep stage that was named after them. For EEG recorded with the reference electrode positioned on
the opposite side of the body, vertical eye movements affect mostly the frontopolar sites (Fp1 and Fp2
electrodes), with an exponential decrease of the effect towards the occipital sites, while, for horizontal
eye movements, the maximum effect is found at the frontotemporal sites. EMG artefacts are uniformly
distributed within REM sleep, but are concentrated at the beginning and the end of non-REM sleep
periods. As expected, the deeper the sleep stage the lower the EMG activity, although REM sleep is
marked by skeletal muscle atonia. ECG artefacts may or may not be present during sleep as they do not
depend on the non-REM sleep stage. Phasic electrodermal artefacts can occur upon sudden arousal from
light sleep stages. Chest movements due to respiration may induce head movements that compress some
of the electrodes against the pillow, resulting in slow potential shifts in them [9].
The best way of dealing with artefacts is avoiding or minimising their occurrence during the recording
[23] [9]. When this is not possible (e.g. after the recording) other alternatives like digital filtering may be
applied. However, digital low-pass filtering for reducing muscle and mains artefacts, or high-pass filtering
for reducing sweating and respiration artefacts may severely distort both the EEG and the artefact signal.
EMG artefacts may resemble cerebral activity after filtering (mostly β activity, but also epileptic spikes and
rhythmic α activity). The last alternative is to reject EEG segments contaminated with artefacts [9]. This
is performed in most sleep laboratories by visual inspection, but some automatic detection can also be
3.2 Analysis of the EEG during sleep 19
performed, like out-of-range checks, and lately using some more sophisticated methods of identification
based on artefact-free models.
3.2 Analysis of the EEG during sleep
The R & K [136] technique for sleep scoring has become the gold standard throughout the world since
its publication in 1968. The scoring is based on the recording of several physiological signals, called the
polysomnograph (PSG). Typically a PSG record consists in 5 to 11 signals, including 2 EEG channels, one
mentalis-submentalis (chin) EMG channel, 1 or 2 EOG channels and one ECG channel (see Fig. 3.3). In
a clinical study to detect sleep related breathing disorders, special transducers are used to include nasal-
oral airflow, respiratory effort recorded both at the level of the chest and the abdomen, and oximetry
(oxygen saturation levels). When the number of channels is restricted to one, one of the EEG channels
C4 − A1, C3 − A2 or Cz − Oz is recommended for a single-channel EEG recording. Paper or magnetic
tape used to be the outputs of a PSG device, but are nowadays replaced by digital storage and display of
the digital PSGs [84].
referencefor EOGs
referencefor EEG
EMG
R-EOG
L-EOG
C4
EMG EEG EOG EOG
channels
Figure 3.3: Conventional electrode positions for monitoring sleep
A description of the R & K rules for sleep scoring is given briefly in Table 3.4 and Fig. 3.4, and in more
detail in the following section.
3.2 Analysis of the EEG during sleep 20
3.2.1 Changes in the EEG from alert wakefulness to deep sleep
At Wakefulness, EEG waves in an adult show a low amplitude, high frequency and apparently random
characteristic, generally contaminated with muscular activity from the temporal or other skeletal muscles.
When the subject closes his eyes and relaxes or when he becomes drowsy, there is usually a reduction
in any muscle and eye movement potentials, plus an increase in the EEG α activity. The slow (< 1 Hz)
rolling of the eyes upwards and the shutting and opening of the eyelids a few times are also signs of
drowsiness [69].
As the subject becomes more drowsy, the α rhythm may be interrupted by periods of relatively low voltage
during which slow lateral eye movements often occur. The slightest stimulus during these periods of
low voltage EEG activity will cause immediate reappearance of the α rhythm. Note that this indicates
not drowsiness but an increase in alertness, for which reason this is called a paradoxical α response.
Alternating periods of low voltage activity and of higher voltage α activity occur for a few minutes,
with the duration of the former progressively increasing until the latter no longer appears, along with a
progressive increase in θ activity indicating that the subject is lightly asleep. Stimuli insufficient to cause
arousal may be strong enough to produce an electronegative sharp wave at the top of the head or vertex
(V-wave). This is defined by R & K as Sleep Stage 1 [87].
Stage 2 is characterised by the appearance of sleep spindles, short (0.5-3s) bursts of 12-14Hz activity con-
sisting of approximately 6-25 complete waves, as well as K-complexes. A K-complex is a large amplitude
biphasic wave of approximately one second duration, maximal at the vertex. K-complexes can have two
different origins. One is as a response to an external stimulus (e.g. a noise) and the other is as an early
manifestation of the slow waves typical of deeper sleep stages [155].
The other two stages, Sleep Stage 3 and Sleep Stage 4, are well distinguished from the rest by the appear-
ance of high amplitude (≥ 75μVpp) δ waves. This feature gives to these stages the name of slow wave
sleep (SWS) or δ sleep. The difference between the two stages lies in the percentage of this slow pattern
in the analysed segment, being from 20% to 50% for the third stage and greater than 50% for the fourth
3.2 Analysis of the EEG during sleep 21
Figure 3.4: Sleep EEG stages (taken from [69])
[155]. Figure 3.4 shows typical EEG segments for each sleep stage.
So-called REM sleep can be divided into three phases. The first phase is characterised by the decrease
or even total disappearance of the EMG activity, which has already experienced a decline in going from
wakefulness to deep NREM sleep. After a few minutes the slow waves, spindles and K-complexes in the
EEG are replaced by rapid, low-amplitude waves, as in wakefulness or in the first sleep stage, with the
exception that the α mode does not dominate the EEG. This is the second phase, which only lasts a few
3.2 Analysis of the EEG during sleep 22
minutes, giving way to the third phase that comes with a burst of rapid eye movements, spikes in EMG
and sometimes visible twitching of the limbs. When REM sleep has a high density of these bursts of eye
movements it is known as phasic REM, while a low density type of REM sleep has received the name of
tonic REM. Tonic REM typically occurs at the beginning of the night whilst phasic REM is usually found
late at night. REM and non-REM periods alternate on a 90-minute cycle through the night, although the
duration of REM increases across the night.
Sleep stage CharacteristicsWakefulness Low amplitude, high frequency EEG activity(β and α activity);
EEG sometimes with EMG artefactSleep stage 1 Increased θ activity; slow eye movements (SEM); vertex
sharp waves; transition stage that lasts only few minutesSleep stage 2 EEG presents spindles (bursts of α activity) and K-complexesSleep stage 3 EEG with high amplitude, low frequency activity;
δ activity appearsSleep stage 4 EEG is dominated by δ activityREM sleep EEG presents high frequency, low amplitude waves; EMG
generally inhibited; bursts of rapid eye movements (REM) also appeartogether with spikes in the EMG
Table 3.1: The Rechtschaffen and Kales standard for sleep scoring.
The hypnogram shown in Fig. 2.2, section 2.4.2, illustrates the transitions between sleep stages, the main
features of the NREM-REM cycle, and the proportion of each stage found in a young adult. Sleep stage 1,
being a transitional stage between wakefulness or drowsiness and true sleep (stage 2 or deeper), usually
occupies only 5% of the night. The bulk of human sleep, around 45% of it, is made up of stage 2. Stage
3, another transitional phase, constitutes only about 7% of the sleep, while stage 4 makes up about 13%.
The rest of the total sleep time (20%–30%) is taken up by REM sleep [69].
3.2.2 Visual scoring method
The transition from fully alert wakefulness to deep sleep is a gradual process and it would be very difficult
to determine what the level of sleep is at any moment without dividing the PSG record into epochs of
duration which may be anything from 10s to 2min. The standardised use of 15mm/s and 10mm/s as the
PSG paper speed made the use of 20s or 30s epochs quite convenient, as each epoch is then one page
long. The scorer uses the R & K set of rules to determine the sleep stage per epoch, regardless of the
3.2 Analysis of the EEG during sleep 23
level of sleep in the previous record or in subsequent records. Eight hours of sleep produce about 400m
of paper. If the record is segmented in 20-30s epochs, this gives approximately 1000-1400 epochs to be
scored visually, which takes an experienced technician over 2 hours, or more if the record has transient
pathological events [155].
Limitations of the R & K scoring rules
Inspite of being widely used, the R & K rules have never been appropriately validated, and they were
never designed for scoring pathological sleep [66]. They suffer from major limitations like:
• the 6-value discrete scale to represent a process that is essentially continuous,
• the 20-30s time scale offering a very poor resolution so that transient events shorter than it are
missed,
• the bias introduced due to the failure to address non-sleep related individual variability of charac-
teristics such as α rhythm,
• the failure to address important sleep/wake related physiological processes such as respiratory and
cardiovascular processes and corresponding disorders.
Obviously, the rules have to be adapted and extended. In 30 years the methods of analysis have changed
due to the advent of the personal computer era. The task group on Signal Analysis of the European
Community Concerted Action “Methodology for the analysis of the sleep-wakefulness continuum” [12],
generated guidelines for a computer-based sleep analyser that would overcome the limitations of the
manual standard scoring and the R & K standard set of rules. They proposed a 1s time resolution, and
the tracking of the NREM sleep/wake process along a continuous scale with the 0% level corresponding to
wakefulness and the 100% level corresponding to the deepest SWS, as well as an on/off output indicating
REM sleep. They felt that quantification of REM/NREM should be based only on EEG, EOG and chin-
EMG in order to avoid bias between the inter-individual and intra-individual non-sleep-related differences
3.2 Analysis of the EEG during sleep 24
such as α rhythm, vertex and sawtooth3 waves and slow eye movements. They also considered additional
outputs to complement the REM/NREM sleep/wake process, such as a micro-arousal on/off output.
3.2.3 Computerised analysis of the sleep EEG
The need for automatic classification has been widely recognised [12]. Attempts to classify sleep EEG
automatically were made soon after the release of the R & K rules for manual scoring [150]. Different
approaches have been developed, most of them trying to emulate the R & K standard, with or without
overcoming its limitations. Many of them include the analysis of several PSG signals, like the EEG, EOG
and EMG [150] [151] [55] [133] [86] [152] [57] [132] [143] [68], and cardiorespiratory signals [54]
[92].
A computerised classification system is usually fed with artefact-free PSG signal segments of fixed or
adaptive length (typical range 1s-30s), or has an artefact marking/rejection procedure prior to the anal-
ysis block. Sometimes the EEG is the only input signal used [94] [72] [75] [78] [147] [160] [123] [16].
A number of features are then extracted using one or more of several methods (time domain, frequency
domain, non-linear dynamics, etc). A classification block combines the features and estimates the sleep
level using a set of rules (decision tree, linear discriminant, fuzzy logic), or an equivalent procedure
(neural networks).
Most of the approaches to classification in the past have used either period analysis (time domain) [22]
[94] [149] [133] [67] [13] [92] [48] [68] [167] or spectral analysis (frequency domain) [164] [57]
[132] [143] [78] [99] [16] [158]. Alternatively other techniques have been introduced, such as wavelet
analysis, autoregressive (AR) modelling [55] [75] [72] [86] [74] [123], principal component analysis
(PCA) [78] and more recently, nonlinear dynamic analysis [2] [52] [137]. The parameters, or features,
obtained have been combined in many different ways to yield classification. One of the most used is
the knowledge-based approach [150] [94] [55] [72] [132] [57] [92] [68]. A Markov-chain maximum
3Normally related to visual activity while scanning a picture, these random electropositive waves of 20μV of amplitude or less,sawtooth waves (or λ-like waves are sometimes seen while the subject is in REM/light sleep
3.2 Analysis of the EEG during sleep 25
likelihood model was developed in 1987 by Kemp and collaborators [86], and cluster analysis has also
been investigated [75], with neural network techniques joining the list in the last decade [13] [143]
[147] [160] [123] [16] [158].
Most of these systems show a reasonable discrimination for sleep stage 2 and slow-wave sleep, but all of
them are poor at discriminating REM from wake and stage 1. Holzmann et al. found a high percentage
of disagreement in light sleep scoring for experts revisiting their scoring (intra-rater)[68]. Many authors
use EOG and/or EMG to help the identification of REM [126]. The percentage of agreement with visual
scorers varies between 67% and 90% with a typical value of 83% in artefact-free segments. Some studies
aggregate sleep stages 3 and 4 together and that elevates the percentage of agreement to over 90%
[72] [143]. Another problem, probably inherited from the R & K set of rules that most of the systems
try to emulate, is that classification systems work almost perfectly in healthy subjects but do not work
sufficiently well in sleep-disturbed patients [152] [155] [126]. If an automated system is able to give the
same level of intra-rater and inter-rater agreements as the clinical experts manage (usually about 86% for
the inter-rater agreement, and 91% for intra-rater agreement) then it can be said to be of use for clinical
purposes. Although several commercially available systems can perform sleep staging, visual scoring is
the only reliable method available at the moment when scoring disrupted sleep EEG [112]. It is clear
that computerised analysis cannot fully replace expert opinion, therefore results of an automatic scoring
system require inspection by a trained polysomnographer [126].
Time-domain analysis
Visual analysis of the EEG is based on the identification of patterns and the assessment of mean amplitude
and dominant frequency. Therefore, time-domain measures of the EEG have been among numerous
features used in computerised analysis. Zero-crossing and maximum peak-to-peak amplitude are the
most popular of this kind of descriptors [94] [133]. The zero-crossing count is the number of times
that the signal crosses the base-line and is related to the EEG mean frequency. In 1973, Hjorth [67]
presented several EEG descriptors calculated directly from the time series as an alternative to frequency
3.2 Analysis of the EEG during sleep 26
analysis. The descriptors are calculated from the derivatives of the EEG signal and have a correspondence
to spectral descriptors. Hjorth descriptors, which measure the standard deviation of the signal (activity),
and ratios of the standard deviations of the signal and its first two derivatives (mobility and complexity)
have often been used in the analysis of the sleep EEG [92] [47]. The signal is band-pass filtered prior
to the calculation of time-domain descriptors which are then used to detect particular types of sleep
patterns [133]. Bankman et al. [13] added measures of slope to the already mentioned time-domain
features for K-complex detection. More recently, Uchida and co-workers [168] have investigated the use
of histogram methods of waveform recognition in sleep EEG. The method, which measures the period
and amplitude of a wave, has the advantage of detecting the frequency, amplitude and duration of single
and superimposed waves.
Uchida and collaborators used period-amplitude analysis of the sleep EEG and compared their analysis
with spectral methods. They found that for some frequency bands the time-domain method does not
detect the waves while the Fourier transform methods perform very well over the entire EEG frequency
range [167]. In contrast, Holzmann et al. found that a zero-crossing strategy gives better results for
slow-wave (< 2Hz and > 75μVpp) detection than Fourier transform methods [68].
Frequency-domain analysis (bank of filters, FFT, AR modelling)
Given that the sleep process involves gradual shifts in the EEG dominant frequency (see section 3.2.1),
the power spectrum of the signal conveys useful information. There are several approaches to estimate
the power spectrum of a stationary signal ([63] pp.147-8). Although the EEG is non-stationary, it may
be considered piece-wise stationary (for a detailed description of this issue, see chapter 4). The most
popular methods to estimating the power spectrum are the Fourier transform and AR modelling. A bank
of band-pass filters is another popular approach to obtain the power of the EEG frequency bands (see
section 3.1.2) [150].
The Fourier transform has been in use for nearly six decades in the spectral analysis of sleep EEG [89].
Its use increased dramatically in the 60’s with the development of the Fast Fourier Transform algorithm
3.2 Analysis of the EEG during sleep 27
(FFT) [34] that speeds the calculation up by a factor of N2
NlogN , with optimum performance when N is a
power of two 4. It has a disadvantage when N is small, as the variance of the spectrum estimate is high
for low values of N , but this can be improved by the use of smoothing windows and averaging. Another
disadvantage is that the Fourier transform is only calculated for discrete values of frequency, multiples
of fs/N , where fs is the sampling frequency. Features can be taken directly from the power density
spectrum, as coefficients for frequencies with the highest variance in the sleep continuum [158], or as
peak-frequencies, but the practical norm is to calculate the power (absolute or relative) accumulated in
the EEG bands [57] [132] [143] [99] [16]. PCA has also been applied to the spectrum coefficients to
find out which are the most significant [78].
AR modelling offers a more interesting alternative to FFT methods for power density spectrum estimation.
It yields a lower variance estimate if the model order is kept low, and is continuous in frequency. It
combines the versatility of picking up broad band signals and pure tones with relatively high accuracy,
which makes it suitable for the analysis of the EEG, a signal that may present bursts of waves as well
as background activity. Its relatively high computational complexity is not a problem anymore with the
state of the art in computing technology. Features can be extracted from the power spectrum estimate as
relative or absolute powers in EEG bands [55] [72], or directly from the model parameters [75] [152]
[147] [160] [123] [137]. Smoothing is usually applied to the coefficients to get a better estimate when
the number of samples has to be kept low (stationarity requirement). Chapter 4 will cover this method
in detail.
Non-linear analysis
Although some evidence has been found that mathematically the EEG signal resembles much more a
stochastic process with changing conditions than a non-linear deterministic process with a chaotic attrac-
tor [2], chaos theory offers ways to determine signal complexity.
Shaw et al. used an algorithmic complexity measure as an index of cortical function in rats [146]. Rezek
4N is the number of signal samples
3.2 Analysis of the EEG during sleep 28
and Roberts [137] compared four stochastic complexity measures for the EEG, namely AR model order,
spectral entropy, approximate entropy and fractional spectral radius, obtaining best results with the last
one when attempting to detect disturbed sleep with the central EEG channel.
Fell et al. found that non-linear measures discriminate better between sleep stages 1 and 2, while spectral
measures do so with sleep stage 2 and SWS. None of the investigated measures were able to discriminate
between REM sleep and sleep stage 1. The measures were relative δ power, spectral edge, spectral
entropy and first spectral moment (spectral measures), and correlation dimension D2, largest Lyapunov
exponent L2 and approximate Kolmogorov entropy K2 (non-linear methods).
Classification techniques
The set of features extracted from either frequency-domain or time-domain analysis or a mixture of
both are usually combined in a deterministic way to determine the R & K sleep stage. However, other
approaches involving self-learning classifiers have also been investigated. In 1981, Jansen and co-workers
used AR features and cluster analysis for sleep staging [75]. Later, Kemp et al. developed a model based
maximum likelihood classifier [86]. Kubat and collaborators presented in 1994 an artificial intelligence
approach with automatic induction of decision trees [92]. At the same time, several investigators have
used probability-based approaches such as Bayesian classifiers [72] and neural network classifiers, with
the introduction of a new approach in tracking the sleep continuum in the work of Pardey et al. [123].
Knowledge-base methods We have already pointed out that many of the attempts to perform au-
tomatic sleep scoring emulate the visual scoring process of the R & K rules. As a result, numerous
knowledge-based classification systems have been developed. The implementation varies from hybrid
analog-digital logic arrays of “ANDs” and “ORs” [22] [150], or algorithmic “IF-ELSE” rules [132], to
fuzzy-logic systems [94] [55] [68]. Most of the systems extract additional information from the experts,
but heuristic approaches like the one developed by Smith and co-workers, who tried different adjustments
to increase the agreement with visual scoring [151], can be found in the literature.
3.3 Analysis of the EEG for the detection of micro-arousals 29
Neural network methods Neural networks have been used in the detection of characteristic sleep
waves (i.e. spindles, K-complexes, etc) which can then be used to help automatic sleep staging. Shimada
et al. have trained a 3-layer neural network for the detection of these waves using a time-frequency 2D
array, consisting of 11 sets of 12 FFT coefficients in a 3.84s window [158]. Wu and co-workers developed
EEG artefact rejection by training a neural network to recognise the typical artefact patterns [73].
Neural networks have also been used to classify sleep according to the R & K scale. Baumgart-Schmitt
and collaborators [16] [15] used a mixture of experts to classify sleep using 31 power spectral features
and nine 3-layer neural networks, each one trained with data from a different healthy subject. They
obtained good discrimination of REM with respect to Wakefulness and Sleep Stage 1.
Previous work in the group in which the research described in this thesis has been carried out [123] has
used a neural network to track the dynamic development of sleep on a continuous scale from deep-
sleep (stage 4) to wakefulness on a second-by-second basis. The neural network output has the ability
to pinpoint short-time events and more cyclic events like Cheyne-Stokes respiration using only one EEG
channel and AR features.
3.3 Analysis of the EEG for the detection of micro-arousals
3.3.1 Cortical arousals
There are several types of arousals. Some sleep disturbances do not reach the brain cortex, they are
called “sub-cortical” or “autonomic” arousals and they can be detected by monitoring the heart rate and
the beat-to-beat blood pressure, looking for an increase in the pulse rate and blood pressure along with
an increase in the respiratory effort, following an apnoea/hypopnoea event. There are also those which
arise in the brain cortex, the so-called “cortical” or EEG arousals, which as their name suggests, can be
detected by monitoring the EEG. Cortical arousals can have several causes, not all related to OSA, for
instance external noise, changes in light, snoring, leg movements, bowel disturbance, bladder distension
and gastroesophageal reflux to mention a few of them. Pain and some forms of insomnia can also be
3.3 Analysis of the EEG for the detection of micro-arousals 30
causes of arousals, but the most common cause is OSA. Ageing is another strong factor in the tendency
to arousal [155].
During a PSG all the external variables like light and noise can be reasonably controlled, and monitoring
leg movements (by transducers located on the legs, or video recording) and snoring (by a microphone
taped on the neck) may help to discard those arousals produced by causes other than OSA [112].
Bennet and colleagues [18] found that detection of autonomic activation is as good as detecting cortical
arousal for predicting daytime sleepiness in OSA patients, but it does not convey any extra information.
As a rule patients with sleep disorders or excessive daytime sleepiness have normal electrophysiological
EEG characteristics both in frequency and amplitude [61]. The OSA sleep disorder does not alter the
physiology of sleep but has a pronounced effect on the sequence of states. Recent studies [123] [48]
have claimed that the EEG provides sufficient information to identify most micro-arousals. A cortical
arousal caused by an apnoeic/hypopnoeic event usually looks like an increase in the frequency in the
EEG (see Fig. 3.5). Note that we will discuss this in more detail in section 6.2.5.
Figure 3.5: Apnoeic event
R & K criteria for sleep scoring allow the scoring of arousals longer than 10s as well as the so-called
“movement arousals”, but the set of rules was not designed for the scoring of transient events (shorter
than 10s). If a 30s epoch has more than 15s of slow waves plus a short arousal, the epoch will be scored
as stage 3 or 4, ignoring the presence of the arousal. In this way, a night sleep record may look “normal”,
3.3 Analysis of the EEG for the detection of micro-arousals 31
whereas the fact is that the subject has experienced hundreds of micro-arousals [155].
The American Sleep Disorders Association (ASDA) attempted to overcome the R & K deficiency in the
scoring of micro-arousals when it published a set of rules for EEG arousal scoring. The rules are indepen-
dent of the R & K criteria, and are summarised in the next section.
3.3.2 ASDA rules for cortical arousals
Technically, ASDA defined an arousal as an abrupt shift in EEG frequency, which may include θ, α and/or
frequencies greater than 16Hz but not spindles, subject to the following summary of rules and conditions
[11]:
1. A minimum of 10 continuous seconds of sleep in any stage must occur prior to an EEG arousal for
it to be scored as an arousal. That is a consequence of the first and second rules of the EEG arousal
scoring set of rules. The first one establishes that the subject must be asleep. The second one is
such as to prevent the scoring of two related arousals as independent arousals.
2. The minimum duration is 3 seconds. There is both a physiological basis and a methodological
reason for this choice: reliable scoring of events shorter than this is difficult to achieve visually.
3. To score an arousal in REM sleep there must be a concurrent increase in submental EMG.
4. Artefacts (including pen blocking or saturations), K-complexes or δ waves are not scored as arousals
unless accompanied by a frequency shift in another EEG channel. If they precede the frequency
shift, they are not included in the three seconds criterion. Indeed, δ wave bursts are not necessarily
related to arousals and as a result, more evidence should be used, for example, respiratory tracing.
5. To score 3 seconds of α sleep as an arousal, it must be preceded by 10 or more seconds of α-free
sleep.
6. Transitions from one stage to another must meet the criteria indicated above to be scored as an
arousal.
3.3 Analysis of the EEG for the detection of micro-arousals 32
The time scale for arousal scoring is much shorter (changes of 3s or more) than the 20-30s for visual
sleep scoring, and therefore arousal visual scoring is more time-consuming than visual sleep staging, and
also more inaccurate. In spite of the efforts of ASDA, the scoring of micro-arousal events is still difficult
as inter-rater variability is very high, especially when the μ-arousal occurs during REM or light sleep
[46]. Townsend and Tarassenko [165] evaluated the agreement between three scorers of EEG micro-
arousals on an 11-patient database and found very little agreement (0-10%) over a mean of 70 arousals
per patient when counting the number of arousals scored. Indeed the figure got worse if the starting time
and duration of the arousals were also considered, as for some recordings none of the experts scored the
same event as an arousal.
3.3.3 Computerised micro-arousal scoring
As can be deduced from the above rules, the detection of arousals is not easy. Arousals may occur during
any sleep stage, and are particularly difficult to detect in REM sleep when the EEG is the only signal
used in the analysis. There is also some controversy concerning the comprehensiveness of the ASDA
rules. Townsend and Tarassenko [165] as well as Drinnan and co-workers [48] have questioned the
absence of a “gold-standard” definition for arousals. While the ASDA definition is in widespread use,
EEG changes which do not meet the criteria have been associated with daytime sleepiness [29]. Other
signals have been suggested as indices of arousals, like blood pressure [129] [39], but these indicators
correlate well with EEG arousal. Hypoxemia (reduced arterial oxygen saturation) has been found to
play a role in the capacity to stay awake, rather than in the propensity to fall asleep, while indices of
sleep disruption correlate with both [17]. Guilleminault et al. [58] evaluated the role of respiratory
disturbance, oxygen saturation, body mass and nocturnal sleep time in daytime sleepiness, but did not
find significant correlation between them. They concluded that the best predictor of the excessive daytime
sleepiness frequently found in OSA patients is the nocturnal PSG and the sleep structure abnormalities
found in the brain activity recording.
Stradling and collaborators [156] found that the relationship between the severity of the OSA measured
3.3 Analysis of the EEG for the detection of micro-arousals 33
by a sleep study and the daytime sleepiness of the subject is poor. They suggested that the importance of
a micro-arousal is related to both its duration and the depth of sleep prior to the arousal. Accordingly, it
would be desirable to extract this information automatically from a computerised arousal scoring system.
3.3.4 Using physiological signals other than the EEG
Recently, Aguirre and co-workers [3] modelled blood oxygen saturation, heart rate and respiration sig-
nals from a patient with OSA, using a nonlinear AR moving average model with exogenous inputs in
which the blood oxygen saturation is the output of the model and the other two signals the inputs. They
reconstructed successfully the respiration signal from the other two, suggesting that the dynamics un-
derlying these signals are nonlinear and deterministic. However, it seems that while these signals are
very well correlated, there is not a unique relationship with the changes in the EEG following an apnoeic
event as was found by Townsend and Tarassenko [165] who investigated pulse transit time (a measure
of beat-to-beat blood pressure) and heart rate for micro-arousal detection. They found that increases in
heart rate and decreases in pulse transit time appear to occur relatively regularly during the night many
times, independently of the occurrence of micro-arousals.
Drinnan et al. [47] investigated the relation between movement or respiration signals (wrist movement,
ankle movement, left and right tibial electromyogram and phase change in ribcage-abdominal move-
ment) and cortical arousals. Their conclusions were that arousal was accompanied by movement only
on a minority of occasions; in some subjects, the number of movement events exceeded the number of
arousals, and some arousals were accompanied by more than one movement. This may explain the poor
relationship that they found between movement signals and arousals. Ribcage-abdominal phase was the
only index which showed a significant relation with cortical arousals, but despite the high correlation, in
some obese subjects the sensitivity and the positive predictive accuracy for phase were as poor as for the
other investigated signals, due to the loose coupling between the used sensors and the diaphragmatic mo-
tion. Other subjects showed phase changes opposite to those expected. Macey and collaborators [100]
[101] found similar results when using time-domain features and neural network methods to detect
3.3 Analysis of the EEG for the detection of micro-arousals 34
apnoea events from the abdominal breathing signal in infants with central apnoeas.
3.3.5 Using the EEG in arousal detection
Drinnan and collaborators [48] investigated 10 possible indices of arousal using the EEG derivation
Cz/Oz. Two of the indices were related to amplitude: Hjorth’s activity, CFM (cerebral function monitor)
and eight to frequency: α power, zero crossing rate, δ crossing rate (zero crossing rate of EEG’s first
derivative), Hjorth’s mobility, frequency peak, frequency mean, frequency mean of the CFM-filtered EEG
and Hjorth’s complexity. From all these they found that three of them offered good discrimination in
terms of identifying arousals: the zero crossing rate, Hjorth’s mobility and frequency mean.
Huupponen et al. [71] used a single channel of EEG and a neural network to detect arousals. Second-
by-second changes in the power of the EEG bands relative to the average power in a 30s segment were
chosen as input features for the neural network. Amongst the detected arousals, there were only 41%
true positives and a very high number of false positives. A recent study, performed by Di Carli and co-
workers [40] used two EEG channels and one EMG channel for the automatic detection of arousals in
a set of 11 patients with various pathologies, including OSA. They used wavelets to analyse the EEG
in the time-frequency domain, then measured the relative powers in the EEG bands and calculated the
ratios between these powers, computed for both short-term and long-term averages. These indices were
used along with the other measures from the EMG as the inputs to a linear discriminant function, whose
free parameters were set to maximise the sensitivity and selectivity for detecting arousals previously
scored by two experts. They made a distinction between “definite” arousals and “possible” arousals in
the visual scoring, and also post-analysed their results correlating the starting time and duration of the
micro-arousals detected by the computer with the ones detected by the experts. The automatic detection
yielded an average 57% agreement with the experts, while the agreement between experts reached 69%.
The percentage of man-machine agreement increased to approximately 75% when only definite arousals
were considered. Both Huupponen’s and Di Carli’s detectors took into account the context, mimicking
the visual scoring according to the ASDA rules.
3.4 Analysis of the EEG for vigilance monitoring 35
3.4 Analysis of the EEG for vigilance monitoring
The study of the wake EEG is much more difficult than that of the sleep EEG as the signals are more
prone to artefacts and show subtle changes as the level of alertness of the individual varies. It is also
very difficult to validate the alertness measures derived from the EEG with others, like task performance,
because training and motivation play an important role in the ability of the subject to perform well while
drowsy.
3.4.1 Changes in the EEG from alertness to drowsiness
The fully awake, responsive state is associated in the EEG with the absence of any rhythmic activity.
The EEG has low amplitude and a random pattern. Also, multiple EMG artefacts are present. The
physiological explanation for this is that the responses to alerting stimuli are mediated by the ascending
reticular activating system of the brain stem [65] which also desynchronises the cortical activity [134].
As the individual relaxes, rhythmical activity appears, most commonly as α wave activity, the amplitude
of the EEG increases and the muscle activity diminishes. The α rhythm is almost always found in the
EEG of healthy, awake, unanaesthetised subjects. Its amplitude, however, is usually very low and it is
only picked up by recorders when it becomes strong as the person becomes drowsy or closes their eyes.
The relationship between the occurrence of an α wave and the brain status is intricate; most often, the α
rhythm appears in individuals relaxed and prone to sleepiness, i.e. drowsy. With further advance towards
drowsiness there is an α activity drop, the α sequences becoming less and less continuous, eventually
giving way to θ activity at the onset of sleep. θ activity is most commonly found in the 6-7 Hz band and
is stronger at the onset of drowsiness [85].
Spatially, the most important changes are in the amplitude of the α activity which occur predominantly
at the occipital sites, while an increase in the slow, mainly θ, activity is more diffuse [142]. EEG changes
do not appear until the subjective symptoms of sleepiness become manifest [5].
Slow eye movements (SEM), is probably the most sensitive variable to allow differentiation between
3.4 Analysis of the EEG for vigilance monitoring 36
sleepiness and alertness [88] [118][142][170]. However, in practice it is very difficult to score SEM
since blinks and rapid eye movements interfere [5]. An increase in motor activity (EMG) is shown in
subjects struggling against imminent drops in alertness [32].
The appearance of α rhythm does not necessarily indicate complete eye closure or blurring vision, some-
times it may be associated with the perception of “being sleepy with open eyes” [88]. α rhythm is
particularly problematic for reasons not totally clear yet. Conradt and co-workers found differences in
“fast” α, “low” α activities and reaction times, but the differences are very difficult to detect [33]. The
changes in α activity on a small time scale are somewhat different depending on whether the eyes were
initially open or closed [32].
Individual EEG differences
A particular problem for vigilance studies is the difference between individuals. Almost all the vigilance
studies using EEG report problems with a proportion of the subjects exhibiting abnormal EEG. Some
individuals are unable to maintain α activity for more than 30s with closed eyes while others show much
α activity with eyes open even when at maximum alertness. Moreover these “α-plus” subjects do not
experience the normal increase in α activity when losing alertness, instead their α waves decrease with
sleepiness. Sometimes their α activity amplitude spreads into the θ band. These observations suggest the
need for individual calibration of sleepiness effects on the EEG [5]. This will be discussed in more detail
in section 10.5.
3.4.2 EEG analysis in vigilance studies
The central referential electrode montage C3-A2 is widely used to record the EEG in vigilance studies
[35] [61] [166] [106] as it is recommended by the standard manual for sleep stage scoring [136]. The
manual also recommends an epoch length of 15-30s, and this has also been adopted for alertness scoring
[38] [8] [157]. However, episodes of stage 1 or “micro-sleep” periods as brief as 1 − 10s have been
identified [142] [130] [126].
3.4 Analysis of the EEG for vigilance monitoring 37
Alford et al. developed a sleepiness scale based entirely on PSG measures using 15s-epochs. The scale
has 6 waking categories and one sleep category (see table 3.2) [8]. This scale will be considered in more
detail in chapter 7 of this thesis. Given that vigilance stages may change within seconds, the EEG in
a 30s window is not stationary in terms of vigilance [127], and a correct statement can no longer be
made with respect to any information averaged over 30s [93]. Penzel and Petzold [127] scored the EEG
in variable length segments according to the patterning or rhythmicity and Varri et al. used an adaptive
segmentation algorithm for the EEG prior to visual scoring, resulting in segments of 0.5s to 2s [170].
With their technique, a 90 min vigilance test may consume an entire day of work for a technician (2-3
hours from preparation to the removal of the electrodes, plus 5 hours scoring the PSG) [126]. As wake
EEG is more complex than sleep EEG, the inter-rater agreement for vigilance EEG scoring is usually lower
(≈72% [61]) than in the sleep case (86% according to [126]).
Vigilance sub-category Description
Active Wakefulness (Active) active/alert patternmore than 2 eye movements per epoch
increased/definite body movementQuiet Wakefulness Plus (QWP) active/alert pattern
more than 2 eye movements per epochaverage/possible/no body movements
Quiet Wakefulness (QW) alert patternless than 2 eye movements per epoch
average/reduced/definitely no body movementsWakefulness with (WIα) definite burst of α rhythm
Intermittent α for less than half of an epochWakefulness with (WCα) definite burst of α rhythm
Continuous α for more than half of an epochWakefulness with (WIθ) definite burst of θ rhythm
Intermittent θ for less than half of an epoch(plus α rhythm, if present)
Wakefulness with (WCθ) definite burst of θ rhythmContinuous θ for more than half of an epoch (stage 1 of sleep)
Table 3.2: The vigilance sub-categories and their definition
Spectral methods
Changes associated with sleepiness in the EEG are mainly in the patterns and rhythms of the signal.
Therefore it seems that the signal is better analysed in the frequency domain, either by power spectrum
3.4 Analysis of the EEG for vigilance monitoring 38
estimation or by band-pass filtering, using the standard EEG frequency bands to define the filter bound-
aries. The rhythms most affected by drowsiness are θ, δ and α in that order. However, they do not change
in the same way, nor are all of the changes linear with respect to the decrease in performance. Late in the
60’s Daniel found that θ waves dropped significantly prior to failures in a detection task, and the occur-
rence of α waves was not necessarily correlated with errors [38]. Later on Lorenzo et al., using central
electrodes, found a linear increase in θ power as a result of sleep deprivation which was also linked to
deterioration in performance [97]. Da Rosa et al. modelled the awake and sleep EEG with sufficient accu-
racy using the linearisation and simplification of a nonlinear distributed parameter physiological model
[36]. Studies on a minute scale showed that α power declines with drowsiness, while θ power increases
linearly with the loss in performance.
Flight simulations and in-cockpit studies have found correlation between EEG power-spectrum and pilot
performance, except for the α band [153]. Makeig and Jung [104] found that the second eigenvector of
the normalised EEG log spectrum is highly correlated with variations in drowsiness and sleep onset.
3.4.3 Vigilance monitoring algorithms
Attempts to implement an alertness monitor follow two major trends in pattern classification, the rule-
based type and the neural network approach. The signals most commonly used in these algorithms are
the EEG, the EOG and the EMG, but one of the prototypes for driver performance monitoring uses a non-
physiological signal, a measurement of the vehicle’s steering (see sub-section Neural Network methods
below). Some of the prototypes have been used only on simulators, while others have also been tested
in real conditions. Ambiguous data and inter-subject variability seem to be a common problem in all of
them.
As in sleep, the existing systems for automatic vigilance scoring are not yet suitable for clinical work,
requiring supervision from a skilled technician. Results in patients with EEG alterations are not reliable
unless their abnormality has been taken into account when developing the system.
3.4 Analysis of the EEG for vigilance monitoring 39
Rule-based algorithms
In 1989 Penzel and Petzold [127] developed a sub-vigil state rule-based classifier based on frequency
domain features extracted from 2s segments of EEG. They achieved 84.4% of agreement with consensus
labelled data and noted that the inter-rater variability defines the limit of what can be achieved for man-
machine agreement. The inter-rater variability was 76% and the intra-rater variability was 81.6% on
their data set. The algorithm was used on OSA data and yielded “good results” in detecting arousals.
Varri et al.’s [170] rule-based computerised system for alertness scoring used more inputs: two EEG, two
EOG and one EMG channels. The system applied adaptive signal segmentation based on mean amplitude
and mean frequency measures, and a bank of filters provided the means of calculating the power within
each EEG band. A similar sub-system detected eye movements and EMG power. The effect of inter-subject
variability was reduced by recording 3 minutes of EEG with the eyes opened in an alert condition and 3
minutes with the eyes closed in a quiet condition to provide reference values for the power in each EEG
band. They found that eye movement can play a very important role in alertness monitoring. The system
gave a 61.6% man-machine agreement. Hasan et al. [61] used the system with new data, having to
perform “prior minor adjustments” to compensate for the differences with the training data. They divided
the group into low/high α activity. They also found a value of 61.8% for the man-machine agreement, for
an inter-rater agreement of 71.9%, and noted that visual scorers had difficulties in correctly identifying
all the bursts of brain waves, especially θ.
Neural Network methods
As in many other classification problems, neural network methods have been applied to the problem of
alertness/drowsiness estimation. In 1992 Venturini et al. [171] attempted to perform real-time estima-
tion of alertness on a minute scale using one EEG channel and a neural network. Power from 5 significant
frequencies were used as input features. The neural network had difficulties in achieving good generali-
sation due to the small size of the available data set, therefore the jack-knife method was used for training
(see chapter 8 for a description of this methodology). Results were “good”, reported as being better than
3.4 Analysis of the EEG for vigilance monitoring 40
a linear discriminator on subjects who missed more than 40% of the target sounds on an auditory vigi-
lance task. They also tried to develop a similar system based on event related potentials (ERP) getting
an accuracy of 96% on data averaged over 28 minutes and of 90% on data averaged over 2 minutes.
However, ERP has two great disadvantages: firstly, it requires the introduction of a distracting sound,
and secondly it cannot be performed on a second-by-second basis because an ERP requires averaging
of a series of repetitive stimuli over at least a 2-min long window to be extracted from the background
EEG. Jung and Makeig [80] refined the system by using 2 EEG channels and a neural network using the
power spectrum as features and PCA to reduce the dimensionality of the feature space. They obtained a
reasonable match with respect to the predictions made by an a priori model and using linear regression.
More recently, Roberts et al. [139] attempted to predict the level of vigilance using multivariate AR
modelling of 2 symmetric channels of EEG (T3 and T4) and the blink rate from 2 channels of EOG
as input features to a committee of neural networks known as Radial Basis Function (RBF) networks
using thin-plate splines as basis functions. The made a comparative study training the neural networks
for regression and for classification, the latter using only extreme-value labels. They trained the neural
networks in a Bayesian framework that allows integration over the unknown parameters (see [102] and
[103] for more detail) and which provides error bars for the results of the neural network analysis. They
obtained “reasonable”correlation with the smoothed human-expert assessment.
Trutschel et al. [166] combined the neural network approach with fuzzy logic when developing a neuro-
fuzzy hybrid system to detect micro-sleep events. The device consisted of 4 neural networks, one for each
of four EEG channels, and a fuzzy-logic combiner. They used the system to monitor alertness in a driving
simulation study, obtaining “high” correlation between the number of micro-sleeps detected per hour and
the accident statistics per hour during the night.
Physiological signals are not the only sources of information which can provide measures of alertness.
Performance measures give an indirect way of monitoring alertness. A vehicle based signal, the steering
measure, has been used to track driver performance and alertness [157]. Power spectrum, mean and
3.4 Analysis of the EEG for vigilance monitoring 41
variance were chosen as input features in a neural network. Θ-plus individuals5 were rejected from the
study. The system only worked with 75% of the drivers. Poor results may, however, have been due to
contradictory data. For instance, the experts who labelled the data using EEG, EOG and EMG channels,
scored one subject as being asleep for nearly two hours of driving. Results indicate that steering measure
and alertness are not 100% correlated.
Shortcomings
As mentioned above, EEG and SEM are the most significant physiological signals in alertness assessment.
However, SEM is very difficult to measure, and the EEG present two disadvantages, firstly the inherent
complexity of the wake EEG, affected by many factors, like task characteristics, motivation and mood,
and secondly the inter-subject EEG variability.
The EEG has a wide-spread distribution among the population, and even within groups with the same
gender and age range. Matsuura et al. [106] found a large inter-individual variability especially with
respect to age. The percentage of α time and α continuity were greater in males than in females after
adolescence, the percentage of θ time was greater in females than in males during childhood, and the
percentage of β time was higher in females than in males at all ages.
As we said in section 3.4.1, in about 10% of the population, visual inspection of the EEG shows α rhythm
during wakefulness, while for the other 90% the EEG only shows α rhythms when the subjects are in
eyes-shut wakefulness or in the first sleep stages. Another 10% of the population shows very low or no α
activity with eyes closed. The first group is known as α-plus (α+) or P-type while the other is the M-type,
P being used for persistent and M for minimal [87]. One of the vigilance studies found one α-plus subject
whose α activity decreased when becoming drowsy instead of the normal increase experienced by the
rest of the subjects [5].
A study of short-term EEG variability using the FFT suggests that interpretation of relative measures of
δ, θ and β in individual spectra may be dependent on absolute α power [120]. Varri et al. [170] divide
5Their EEG displays θ waves while they are awake
3.4 Analysis of the EEG for vigilance monitoring 42
the data into low or high α to adapt their algorithm to the “normal” differences in α activity. As already
mentioned, Hasan et al. [61] had to perform “prior minor adjustments” to compensate for the differences
with the training data. They also found that subjects with poorly defined occipital α activity constitute a
special problem in the detection of drowsiness [61].
A third problem in alertness/drowsiness scoring using the EEG comes from the standard procedures
followed to score the sleep EEG. The standard set of rules for sleep scoring [136] recommends a length
of 15-30s for the EEG epochs. However, Kubicki et al. opine that it is often difficult to make a distinction
between an “α-sleep type” and pre-arousals (micro-arousals)on this time scale [93].
Portable devices
A few commercial alertness monitoring devices based on one or several of measures such as eye-tracking,
pupillometry, eyelid closures, head motion detectors, electrophysiological and skin measures and perfor-
mance deterioration, are currently available [31][116]. A specialized company [31] advertises a micro-
sleep/fatigue detection algorithm that uses advanced neural network and fuzzy logic hybrid systems for
detecting and predicting the occurrence of micro-sleeps, a description that coincides with the system de-
veloped by Trutschel et al. [166]. The same company offers integrated systems with alertness monitoring
and alertness stimulation/ micro-sleep suppression technologies, i.e. vibration, aroma, lighting, sound
and interactive performance systems combined with automatic micro-sleep/fatigue detection.
Chapter 4
Parametric modelling and linearprediction
This chapter reviews the theories of auto-regressive (AR) modelling and linear prediction, after an intro-
ductory section on spectrum estimation. A more detailed review of AR modelling can be found in [63].
Noise classification can be found in [62], and filter structures in [122].
4.1 Spectrum estimation
4.1.1 Deterministic continuous in time signals
Let x(t) be a deterministic continuous signal with finite energy. Its Fourier transform Xc(f) is given by
Eq. 4.1:
Xc(f) =∫ ∞
−∞x(t) e−j2πft dt (4.1)
where the subindex c is used to distinguish it from its counterpart in the discrete-time domain.
Given the Fourier transform Xc(f), the signal x(t) can be recovered using the inverse Fourier transform:
x(t) =∫ ∞
−∞Xc(f) ej2πft df (4.2)
4.1.2 Stochastic signals
Many physical phenomena occur in such a complicated way that even if they are governed by determin-
istic laws, the almost infinite amount of interactions and the noise present in the sensors makes the use
4.1 Spectrum estimation 44
of a probabilistic model more sensible. Stochastic signals1 carry an infinite amount of energy, and the
Fourier transform integral as defined in Eq 4.1 normally does not exist. They are not periodic, so the
Fourier series expansion does not apply either. Instead of the energy content, we may be interested in
the power (time average of energy) distribution with frequency. If the generating process is stationary1,
second order averages like the autocorrelation and the autocovariance offer an alternative to performing
the time-frequency transform. Normally, the autocovariance tends to zero as the lag increases, but if the
process is zero-mean, the autocorrelation equals the autocovariance and therefore shows the same trend.
This is a sufficient condition for the existence of the Fourier transform of the autocorrelation, given by
Eq 4.3:
R(f) =∫ ∞
−∞r(τ)e−j2πfτ dτ
R(ω) =∫ ∞
−∞r(τ)e−jωτ dτ
(4.3)
The autocorrelation at lag zero, which is equal to the average power of the signal, is related to the Fourier
transform R(f) by the Wiener-Khinchin theorem:
r(0) = E[x(t)2] =∫ ∞
−∞R(f)df (4.4)
Therefore, the function R(f) represents the distribution of the power in the frequency domain, as a result
of which it has been named power spectral density (PSD) or power spectrum of the signal, often denoted
as S(f):
S(f) = R(f)
S(ω) = R(ω)(4.5)
The PSD has several properties which are reviewed in [62, pp.254-56].
Estimating the power spectrum
Autocorrelation function estimators The autocorrelation function is an average over the ensemble
x(t, ξ)1. Usually only a single realisation x(t) (i.e. fixed ξ) of a given process x(t, ξ) is available leaving
1for a definition and a review of stochastic processes see Appendix A
4.1 Spectrum estimation 45
us unable to estimate r(τ) unless ergodicity is assumed. If the process is ergodic, the autocorrelation
function of the process equals the time average over a single realisation given by the left-hand side of
Eq. 4.6:
r(τ) = E[x(t)x(t + τ)] = limT→∞
12T
∫ T
−T
x(t)x(t + τ) dt (4.6)
However, in most of the cases, the signal x(t) is only available during a limited interval of time. Then
the autocorrelation function can only be estimated. Denoting x′(t) as the signal x(t) truncated by a
rectangular window of length 2T , we can estimate r(τ) as:
r(τ) =1
2T − |τ |
∫ T−|τ |/2
−T+|τ |/2
x(t +|τ |2
)x(t − |τ |2
) dt (4.7)
Eq. 4.7 is valid for |τ | < 2T , for |τ | ≥ 2T the estimate r(τ) is set to zero. This is an unbiased estimator
(i.e. its mean value is the real value of r(τ)), but its variance increases as |τ | increases, because of the
factor 2T − |τ | in the denominator. Instead, the estimator r′(τ):
r′(τ) =2T − |τ |
2Tr(τ) (4.8)
has smaller variance, and although it is a biased estimator, it is more commonly used because its Fourier
transform is related to the energy density spectrum of the truncated signal x′(t). Indeed, r′(τ) is equal
to:
r′(τ) =1
2Tx′(τ) ∗ x′(−τ) (4.9)
where the symbol ∗ represents convolution in τ .
The periodogram: Fourier estimate of the PSD Invoking the Fourier transform property of convo-
lution in time, and noting that the transform of x′(−τ) is X ′(−f) then the Fourier transform of r′(τ)
is:
R′(f) = 12T X ′(f)X ′(−f)
= 12T |X ′(f)|2
(4.10)
4.1 Spectrum estimation 46
Then the PSD estimate using the estimator r′(τ) for the autocorrelation is:
S′(f) = 12T |X ′(f)|2
= 12T |∫ T
−T
x(t)e−j2πftdt|2
=∫ 2T
−2T
r′(τ)e−j2πfτ dτ
(4.11)
The function S′(f) is called the periodogram. It is an asymptotically unbiased estimator but its variance
increases with T . This surprising result is due to the integral in τ of the estimator r′(τ) with increasing
variance as |τ | approaches 2T . In the limit T → ∞ the periodogram tends to be a white-noise process
with mean S(f). Smoothing windows have been widely used as palliatives to overcome this behaviour,
either applied to the autocorrelation estimate r′(τ) to deemphasize the unreliable values at the borders,
or convolved with the periodogram to reduce the variance directly.
Discrete-in-time stationary stochastic processes If the signal is sampled in time, the equations above
change accordingly. The autocorrelation function r(m) is now a function of an integer lag m. Its Fourier
transform R(ω) is periodic, as a result of the sampling in time, and the total power can be found simply
by integrating over a period of R(ω):
P = r(0) =12π
∫ π
−π
R(ω)dω (4.12)
where R(ω) is:
R(ejω) =∞∑
m=−∞r(m)e−jωm (4.13)
If only N samples of the time series x(n) have been taken, the discrete version of the autocorrelation
estimator in Eq. 4.8 can be calculated as:
r′(m) =1N
N−|m|−1∑n=0
x(n)x(n + |m|) (4.14)
This estimator presents the same characteristics as its continuous-time version. The expected value of
r′(m) is m/N times r(m), but is asymptotically unbiased, as the bias tends to zero as N increases. Also,
4.1 Spectrum estimation 47
its variance increases as m approaches N . A full expression for this variance is very difficult to find
for non-Gaussian processes [122]. However, Jenkins and Watt [77] conjecture that, in many cases, the
mean-square error of r′(m) is less than for the unbiased estimator.
Discrete-in-time periodogram Based on the Jenkins and Watt conjecture, Eq. 4.14 is used in Eq 4.13:
R′(ejω) =1N
N−1∑m=−N+1
N−|m|−1∑n=0
x(n)x(n + |m|)e−jωm (4.15)
After some mathematical manipulation [122, pp.542-3]:
R′(ejω) = 1N
N−1∑k=0
N−1∑n=0
x(n)x(k)e−jω(k−n)
= 1N
N−1∑k=0
x(k)e−jωkN−1∑n=0
x(n)ejωn
= 1N X(ejω)X(e−jω)
= 1N |X(ejω)|2
(4.16)
where X(ejω) is the Fourier transform of the finite length time series x(n). Note that R′(ω) is the PSD
estimator S′(ω) known as the periodogram.
S′(ejω) =1N
|X(ejω)|2 (4.17)
It can be proved ([122] pp.542-3) that using either the unbiased estimator of the autocorrelation or the
biased estimator proposed by Jenkins and Watt, the periodogram for a discrete in time stationary process
is a biased estimator of the PSD. As in the continuous in time case, its variance does not tend to zero
as N increases. Again, this result can be improved by smoothing techniques, one of which divides the
time series into smaller, overlapping segments to perform an average over the periodograms, but this
compromises the resolution in frequency.
Parametric modelling methods A model is any attempt to describe the laws which yield a given phe-
nomenon. Once a model is selected and its parameters estimated from the data, it can be used to generate
as many realisations as are needed to calculate the averages over the ensemble, or even better, it can be
4.1 Spectrum estimation 48
used to calculate directly the PSD without having to use the Fourier transform. The choices for a model
are infinite, but using a priori knowledge over the data, the range can be reduced considerably. As-
sumptions like zero values outside the observation window can be avoided. However, some assumptions
always have to be made for the characterisation of the model.
Yule [174] proposed in 1927 the use of a deterministic linear filter to represent a stochastic process. The
filter is driven by a sequence of statistically independent random variables with a zero-mean, constant-
variance Gaussian distribution2. This purely random series is known as white Gaussian noise because
its autocorrelation function is zero for all lag except for the origin, where it equals the variance of the
Gaussian process. The corresponding power spectral density is therefore a constant for all frequencies,
like the optical spectrum of white light. The filter performs a linear transformation on this uncorrelated
sequence to generate a highly correlated series x that statistically matches the data x from the process
under analysis, as is shown in Fig. 4.1. The modelling procedure consists of the calculation of the filter
parameters.
White Gaussian noise v(n) discrete-timelinear filter
y(n) = x(n)^
Figure 4.1: Stochastic process model
The input-output relation of the filter has this general form:
(present value
of model output
)+
⎛⎝ linear combination
of past valuesof model output
⎞⎠ =
(present value
of model input
)+
⎛⎝ linear combination
of past valuesof model input
⎞⎠
This can be written as a linear difference equation that relates the input driving sequence v(n) with the
output y(n) as:
y(n) =q∑
m=0
bmv(n − m) −p∑
k=1
aky(n − k) (4.18)
2Being the most common distribution found in physical phenomena, and given that the output of a linear filter driven by aGaussian random process is another Gaussian process, it is the most convenient distribution at the filter input for a vast range ofapplications.
4.1 Spectrum estimation 49
Giving that the proposed filter is time-invariant and linear, linear filters theory applies. Therefore, taking
the z-transform3 of both sides of Eq. 4.18:
Y (z) =q∑
m=0
bmV (z)z−m −p∑
k=1
akY (z)z−k (4.19)
Rearranging Eq. 4.19 to leave only Y (z) on the left-hand side:
Y (z) =∑q
m=0 bmz−m
1 +∑p
k=1 akz−kV (z) (4.20)
The z-transform of the unit-sample response of the filter h(n) can be found by making v(n) = δ(n), which
has a z-transform V (z) equal to 1:
H(z) = Y (z) |V (z)=1=∑q
m=0 bmz−m
1 +∑p
k=1 akz−k(4.21)
By using the substitution z = ejω , then the Fourier transform of the filter unit-response can be found from
Eq. 4.21:
H(ejω) =
q∑m=0
bme−jωm
1 +p∑
k=1
ake−jωk
(4.22)
Let us now consider the input sequence as white Gaussian noise with variance σ2v . Its autocorrelation
function is equal to σ2vδ(n) and its Fourier transform is equal to σ2
v for all frequencies. Using the relation
between the input and output autocorrelation functions given in Eq. A.13:
ry(m) =∞∑
i=−∞
∞∑k=−∞
h(i)h(k)σ2vδ(k − i + m) (4.23)
and taking the Fourier transform of both sides:
Sy(ejω) = Sx(ejω) = σ2v |H(ejω)|2 (4.24)
it can be seen that the PSD of the output of the filter can be obtained from the filter parameters {bi, ai}
and the input noise variance.
3Defined as Z[g(n)] = G(z) =∑∞
n=−∞ g(n)z−n. See [122] for properties.
4.2 Autoregressive Models 50
Whether the linear combination of the past output values or the linear combination of the past input
values or both are used in the input-output relation defines the following types of filter:
1. Autoregressive model (AR): No linear combination of past values of the inputs is used.
2. Moving average (MA) model: No linear combination of past values of the outputs is used.
3. Autoregressive-Moving average (ARMA) model: Include all the terms shown in Eq. 4.18.
The use of one or other kind of model depends on the nature of the process. A description of the models
follows in the next section.
4.2 Autoregressive Models
Let the time series y(n), y(n−1), . . . , y(n−p) represent a realization of an autoregressive process of order
p. Then it satisfies the following difference equation:
y(n) + a1y(n − 1) + a2y(n − 2) + . . . + apy(n − p) = v(n) (4.25)
where the constants a1, a2, . . . , ap are the parameters of the model (AR coefficients), and [v(n)] is a white
Gaussian noise sample.
The term “autoregressive” comes from the similarity between the AR model equation and the regression
model equation.
y =p∑
k=1
wkuk + v (4.26)
The regression equation relates a dependent variable y to a set of independent variables u1, u2, . . . , up
plus an error term v. It is said that y is regressed on u1, u2, . . . , up. In a similar way the actual sample
of the AR process, y(n) is regressed on previous values of itself (auto) as is shown if we rewrite the AR
equation as:
y(n) =p∑
k=1
wky(n − k) + v(n) (4.27)
4.2 Autoregressive Models 51
where wk = −ak.
Transforming Eq. 4.25 to the z domain we get:
Y (z)[1 + a1z−1 + a2z
−2 + . . . + apz−p] = V (z) (4.28)
Therefore the transfer function of an AR filter is:
H(z) =Y (z)V (z)
=1
1 + a1z−1 + a2z−2 + . . . + apz−p
(4.29)
The use of previous samples of the output of the filter is depicted with feedback paths as shown in
Fig. 4.2. The AR filter is an Infinite Impulse Response (IIR) filter, or all-pole filter of order p. It can be
stable or unstable, depending on the location of its poles. If one or more poles lie outside the unit circle
the filter will be unstable. The p poles may be calculated from the characteristic equation of the filter:
1 + a1z−1 + a2z
−2 + . . . + apz−p = 0 (4.30)
z -1
Σ
Σ
z -1
Σ
z -1
a
-+
v(n)white noise AR process
a
a1
y(n)
y(n-1)
y(n-p+1)
y(n-p)p
p-1
Figure 4.2: Autoregressive filter
4.2 Autoregressive Models 52
Moving Average Models
Moving average filters are described by:
y(n) = v(n) + b1v(n − 1) + b2v(n − 2) + . . . + bqv(n − q) (4.31)
where the constants b1, b2, . . . , bq are the MA parameters of the model and [v(n)] is a white Gaussian
noise process.
This type of filter is an all-zero filter, inherently stable and with finite impulse response (FIR). For this
kind of discrete filter the order of the filter equals q, as it is the minimum number of delay units used to
implement it (see Fig. 4.3). The term “moving average” refers to the weighted average of the input time
series v(n).
z -1z -1 z -1
ΣΣ Σ
1b 2b
white noise v(n)
...
MA process y(n)
bq
Figure 4.3: Moving Average filter
Moving Average Autoregressive Models
This combines the features of the AR and MA filters. The difference equation which describes them is:
y(n) + a1y(n − 1) + a2y(n − 2) + . . . + apy(n − p) =
v(n) + b1v(n − 1) + b2v(n − 2) + . . . + bqv(n − q) (4.32)
where the constants a1, a2, . . . , ap, b1, b2, . . . , bq are the ARMA parameters of the model and [v(n)] is a
white Gaussian noise process. For this kind of IIR filter with direct transmission from the input the order
is said to be the pair (p, q). AR and MA models are special cases of an ARMA model.
4.2 Autoregressive Models 53
Σ Σ
z -1
Σ Σ
z -1
Σ Σ
z -1
Σ
z -1
..
. ...
a1
a2 b2
b1
b0-
+v(n)white noise
y(n-p+1)
y(n-2)
y(n-p)
ARMA process y(n)
bp
ap
ap-1
y(n-1)
Figure 4.4: Moving Average Autoregressive filter (b0 = 1, q = p − 1)
Wold decomposition
Wold’s decomposition theorem states that any stationary discrete-time stochastic process [u(n)] may be
decomposed into the combination of a general linear process and a predictable process. The last two
processes are uncorrelated. According to this, the process [u(n)] may be expressed as:
u(n) = y(n) + s(n) (4.33)
The term s(n) is the predictable process, i.e. the sample s(n) can be predicted from its own past values
with zero predictive variance. The term y(n) is the general linear process which may be represented by
the MA model:
y(n) = v(n) +∞∑
k=1
bkv(n − k) (4.34)
where∑∞
k=1 | bk |2< ∞.
The white noise term v(n) which drives the general linear process y(n) is uncorrelated with the pre-
dictable process s(n), i.e. E[v(n)s(k)] = 0 for all pair (n, k). The general linear process may be an AR
process as well; all we have to do is to be sure that the impulse response of the AR filter equals the
4.3 AR parameter estimation 54
impulse response of the MA filter. That is:
h(n) =∞∑
k=0
bkδ(n − k) (4.35)
where b0 = 1.
AR models have gained more popularity than the MA and the ARMA models. The reason lies in the
computation of the filter parameters, which leads to a system of equations that is linear for AR filters and
nonlinear for MA and ARMA filters [81][105].
4.3 AR parameter estimation
4.3.1 Asymptotic stationarity of an AR process
The classical solution to the AR difference equation (Eq. 4.25) separates the homogeneous solution from
the particular solution. The particular solution is the AR model difference equation shown in Eq. 4.27.
But the homogeneous solution yh(n) is of the form:
yh(n) = B1zn1 + B2z
n2 + . . . + Bpz
np (4.36)
where z1, z2, . . . , zp are roots of the characteristic equation (Eq. 4.30) of the filter. The constants B1,
B2, . . . , Bp may be determined by the set of p initial conditions y(0), y(−1), . . . , y(−p + 1). For arbitrary
values of the constants Bk, it is clear from equation 4.36 that the homogeneous solution will decay to
zero as n approaches infinity if and only if:
| zk |< 1, for all k (4.37)
In other words, this means that all the poles of the AR filter lie inside the unit circle in the z-plane. A
system which is able to “forget” its initial values in this way is said to be asymptotic stationary.
The autocorrelation function of such a system satisfies the homogeneous difference equation of the model.
This may be found if we rewrite equation 4.27:
p∑k=0
aky(n − k) = v(n) (4.38)
4.3 AR parameter estimation 55
where a0 = 1. Multiplying both sides by y(n − m) and taking the expectation we get:
E[p∑
k=0
aky(n − k)y(n − m)] = E[v(n)y(n − m)] (4.39)
This may be simplified if we note that the expectation E[y(n − k)y(n − m)] equals the autocorrelation
function for a lag of (m − k), and the expectation of v(n)y(n − m) is zero for m > 0, since the sample
y(n − m) is only related to input samples up to time (n − m).
p∑k=0
akr(m − k) = 0, for m > 0 (4.40)
Expanding the last equation gives the desired result:
r(m) = w1r(m − 1) + w2r(m − 2) + . . . + wpr(m − p), for m > 0 (4.41)
where wk = −ak. We may express the general solution of this equation as:
r(m) =p∑
k=1
Ckzmk (4.42)
where Ck are constants and zk are the roots of the characteristic equation (Eq. 4.30). As a result of this
we can say that the autocorrelation function of an asymptotic stationary AR process approaches zero as
the lag tends to infinity. This autocorrelation function will be damped exponentially if the dominant root
is real, changed in sign alternatively if is negative, or will be a damped sine wave if the dominant roots
are a complex conjugate pair.
4.3.2 Yule-Walker equations
Writing equation 4.41 for m = 1, 2, . . . , p yields a set of p simultaneous equations for the unknowns
a1, a2, . . . , ap assuming that the autocorrelation function r(m) is known at least for the lags from 1 to p:⎡⎢⎢⎢⎣
r(0) r(1) . . . r(p − 1)r(1) r(0) . . . r(p − 2)
...... . . .
...r(p − 1) r(p − 2) . . . r(0)
⎤⎥⎥⎥⎦⎡⎢⎢⎢⎣
w1
w2
...wp
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎣
r(1)r(2)
...r(p)
⎤⎥⎥⎥⎦ (4.43)
where wk = −ak. This set of equations is known as the Yule-Walker equations. In matrix form:
Rw = r (4.44)
4.3 AR parameter estimation 56
where R is the p × p autocorrelation matrix, w = [w1, w2, . . . , wp]T and r = [r(1), r(2), . . . , r(p)]T . Its
solution is:
w = R−1r (4.45)
It can be seen from Eq. 4.45 that the set of AR coefficients may be uniquely determined from the first p + 1
samples of the autocorrelation function of the process x(n) under modelling. If we evaluate Eq. 4.39 for
m = 0 and y(n) equal to the data time series x(n), we get:
E[p∑
k=0
akx(n − k)x(n)] = E[v(n)x(n)] (4.46)
The right-hand side of Eq. 4.46 is:
E[v(n)x(n)] = E [v(n) (∑p
m=1 wmx(n − m) + v(n)) ]
=∑p
m=1 wk E[v(n)x(n − m)] + E[v(n)v(n)]
= E[v(n)v(n)]
(4.47)
The right-hand side of the equation is the variance of the input noise σ2v . This variance may be determined
from the set of AR coefficients and the first p + 1 samples of the autocorrelation function.
σ2v =
p∑k=0
akr(k) (4.48)
Eq. 4.44 can be solved by Gaussian elimination. However, the Toeplitz characteristic of matrix R is used
efficiently to find the parameters ak. In section 4.6.1 a recursive algorithm to solve the Yule-Walker
equation will be presented.
4.3.3 Using an AR model
An AR model may be used for synthesis or for analysis. In synthesis, a stationary stochastic process y(n)
characterised its variance σ2y and the parameters of its AR model, i.e. the AR filter coefficients, are given,
and we want to generate a time series of the process. In analysis, we want to model a stochastic process
given a time series x(n), by estimating the set of AR parameters for a model order p and the input noise
4.3 AR parameter estimation 57
variance, assuming that p is the optimum model order 4. Next, we will present a second-order example
of a synthesis problem and an analysis problem.
Second order AR process synthesis
Assume that we want to synthesise a real valued, second order stationary AR process y(n) with unit
variance. The difference equation of the AR model is:
y(n) + a1y(n − 1) + a2y(n − 2) = v(n) (4.49)
As a condition for asymptotic stationarity, we need to ensure that the roots of the characteristic equation
of the model lie inside the unit circle in the z-plane:
1 + a1z−1 + a2z
−2 = 0 (4.50)
⇒ z1,2 =−a1 ±
√a21 − 4a2
2(4.51)
where z1 and z2 are the roots of equation 4.50. To satisfy the asymptotic stationarity condition:
| z1 |< 1, and | z2 |< 1 (4.52)
requires the following restrictions for the AR parameters:
−1 ≤ a2 + a1
−1 ≤ a2 − a1
−1 ≤ a2 ≤ 1(4.53)
which is satisfied by a triangular region in the a2, a1 plane, with corners at (−2, 1), (0,−1) and (2, 1). Let
us choose arbitrarily the following values for a1 and a2 from this region:
a1 = −0.1a2 = −0.8 (4.54)
We get roots at:
z1 = 0.9458z2 = −0.8458 (4.55)
4We will not cover in this section the problem of finding the optimum model order
4.3 AR parameter estimation 58
where the positive root z1 dominates the autocorrelation function. In order to calculate the input noise
variance we need to find the first 3 samples of the autocorrelation function r(m):
σ2v = r(0) + a1r(1) + a2r(2) (4.56)
From the Yule-Walker equation:
[r(0) r(1)r(1) r(0)
] [w1
w2
]=[
r(1)r(2)
](4.57)
where w1 = −a1 and w2 = −a2. We know that r(0) = σ2y = 1, and hence we can find the other 2 samples
of r(m), substituting in 4.57:
[1 r(1)
r(1) 1
] [0.10.8
]=[
r(1)r(2)
](4.58)
⇒[
0.2 0.0−0.1 1.0
] [r(1)r(2)
]=[
0.10.8
](4.59)
⇒
⎧⎨⎩
r(1) = 0.5r(2) = 0.85σ2
v = 0.27(4.60)
To generate the time series, we substitute in Eq. 4.49 the values of the AR parameters and run the
difference equation with v(n) from N (0, 0.27), and zero initial values for y(n):
y(n) = 0.1y(n− 1) + 0.8y(n − 2) + v(n) (4.61)
A time series generated in this way is shown in Fig. 4.5. The autocorrelation function plotted in Fig. 4.6
has been calculated applying Eq. 4.41 with the initial set of values r(0), r(1) and r(2) found above:
r(m) = 0.1r(m − 1) + 0.8r(m − 2), for m > 2 (4.62)
Second order AR process analysis
Assume that we have got 128 samples of a time series from a stationary stochastic process [y(n)]. We will
see if the given process can be modelled as an AR process of model order 2. The Yule-Walker equations
4.3 AR parameter estimation 59
0 20 40 60 80 100 120−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
sample n
y(n)
Figure 4.5: Time series of the synthetised AR process
0 10 20 30 40 50 600
0.2
0.4
0.6
0.8
1
lag m
r(m
)
Figure 4.6: Autocorrelation function of the synthetised AR process
z -1
Σ
Σ
z -1
-+
v(n)white noise y(n)
y(n-1)
a2 y(n-2)
a1
AR process
Figure 4.7: Second order AR process generator
4.3 AR parameter estimation 60
may be used to estimate the AR parameters:
w = R−1r (4.63)
but we need first to estimate the first 3 samples of the autocorrelation function from the available data.
Using the sample autocorrelation estimator (Eq. 4.14) for N = 128:
r′(m) =1
128
127−|m|∑n=0
y(n)y(n − m) (4.64)
we may estimate the first 3 samples of r(m), and express the matrix R:
R′ =[
r′(0) r′(1)r′(1) r′(0)
](4.65)
If the matrix R′ is nonsingular, we may find a1 and a2 from the Yule-Walker matrix equation. The input
noise variance may be estimated from the r′(m) sequence by Eq. 4.48. To test the model we may use the
inverse filter to see if it is capable of “whitening” the given time series. If the model fits the data well, the
output of this “whitening” filter will be white Gaussian noise with zero mean and variance σ2v. The direct
filter will have the transfer function H(z) given by:
H(z) =1
1 + a1z−1 + a2z−2(4.66)
then the “whitening” filter transfer function is:
HW (z) = H−1(z)= 1 + a1z
−1 + a2z−2 (4.67)
The “whitening” filter is also called the AR process analyser and its impulse response has finite duration
(FIR). If the process is not truly autoregressive, or if the model order is not p, or if the error in the
estimation of the autocorrelation is high, then the output of the inverse filter will be coloured noise.
As an example, we may use the AR process generator found in the last section to generate a 128-sample
time series to feed into the AR analyser. For a time series generated in this way we estimated the first 3
values of r(k), obtaining 0.7045, 0.1963 and 0.5612. Therefore the matrix R is:
R′ =[
0.7045 0.19630.1963 0.7045
]
4.4 Linear Prediction 61
z -1z -1
Σ Σ
stochastic process
1 2a a
y(n-1) y(n-2)y(n)
noise v(n)
Figure 4.8: Second order AR process analyser
and the vector r′ = [0.1963, 0.5612]T . Applying Eq. 4.63 we get the estimate of w = [0.0614, 0.7795]T .
A better approximation to the true value w = [0.1, 0.8]T can be obtained by increasing the number of
samples or by running the generator several times, collecting several time series of the same process (ie
an ensemble), analysing and averaging the results. Table 4.1 and Fig. 4.9 show the mean and variance of
the AR coefficients estimated using the procedure described in this section for an ensemble of 500 time
series and a number of samples per time series from 16 to 1024.
N 16 32 64 128 256 384 512 1024a1 mean -0.1622 -0.1311 -0.1126 -0.1045 -0.1028 -0.1039 -0.1022 -0.1002
variance 0.0700 0.0258 0.0094 0.0038 0.0018 0.0012 0.0009 0.0004a2 mean -0.5065 -0.6448 -0.7181 -0.7588 -0.7777 -0.7827 -0.7881 -0.7949
variance 0.0408 0.0180 0.0084 0.0034 0.0016 0.0011 0.0008 0.0004
Table 4.1: AR coefficients estimates’ mean and variance
4.4 Linear Prediction
4.4.1 Wiener Filters
A typical statistical linear filtering problem consists of an input time series x(n), a linear filter device
characterised by its impulse response b0, b1, b2, . . . , and the output sequence y(n). This output is an estimate
of a desired response d(n) (Fig. 4.10).
Defining the estimation error as
e(n) = d(n) − y(n) (4.68)
4.4 Linear Prediction 62
0 200 400 600 800 1000
−0.16
−0.14
−0.12
−0.1a 1 e
stim
ate
(ens
embl
e m
ean)
number of samples N in time series0 200 400 600 800 1000
−0.8
−0.7
−0.6
−0.5
a 2 est
imat
e (e
nsem
ble
mea
n)
number of samples N in time series
0 200 400 600 800 10000
0.01
0.02
0.03
0.04
0.05
0.06
0.07
a 1 est
imat
e va
rianc
e
number of samples N in time series0 200 400 600 800 1000
0
0.01
0.02
0.03
0.04
a 2 est
imat
e va
rianc
e
number of samples N in time series
Figure 4.9: AR coefficients estimates’ mean and variance
1b , b ,2b ,0 ...Σ
outputy(n)
desiredresponse
linear, discrete-time filter+
-
d(n)
x(n)estimation errore(n)
input
Figure 4.10: Filter problem
4.4 Linear Prediction 63
the filter can be optimised by minimising the cost function J:
J = E[ e(n)e(n) ]
= E[ |e(n) |2 ](4.69)
by making its gradient in the space, constituted by the filter coefficients, equal to zero:
∇J = 0 (4.70)
Solving Eq. 4.70 yields the following result:
E[x(n − k)eo(n)] = 0, for k = 0, 1, 2, . . . (4.71)
where eo(n) denotes the estimation error of the filter operating in its optimum condition.
Substituting Eq. 4.68 in the Eq. 4.71 gives the following set of equations, known as the Wiener-Hopf
equations:
Rbo = c (4.72)
where the p × p correlation matrix R has been defined in Eq. A.15 and
bo = [bo0 bo1 . . . bo,p−1]T
c = [c(0) c(−1) . . . c(1 − p)]T(4.73)
where c(−k) = E[x(n − k)d(n)].
4.4.2 Linear Prediction
One of the most common use for Wiener filters is to predict a future sample of a stationary stochastic pro-
cess, given a set of past samples of the process. The Wiener-Hopf equations may be used to optimise the
predictor in the mean-square sense. Assume that a time series of the process x(n−1), x(n−2), . . . , x(n−p)
is available. The estimation of the sample at time n, x(n) is a linear function of the previous samples:
x(n) =p∑
k=1
bkx(n − k) (4.74)
4.4 Linear Prediction 64
The desired response is the true value of the sample x(n):
d(n) = x(n) (4.75)
Then, the prediction error for this filter e(n) is:
e(n) = x(n) − x(n) (4.76)
The vector bo in the Wiener-Hopf equations becomes bo = [bo1, bo2, . . . , bop]T . Note the difference in one
of the indices of the coefficients bok with respect to Eq. 4.73, because the input sequence starts at sample
n − 1 instead of n. The input sequence provides the data for the estimation of the first p + 1 samples
of the autocorrelation function r(m), which may be used to find the p × p correlation matrix R and the
vector c. The latter is possible because the desired response is a sample of the input time series:
c =
⎡⎢⎢⎢⎣
E[x(n − 1)x(n)]E[x(n − 2)x(n)]
...E[x(n − p)x(n)]
⎤⎥⎥⎥⎦
=
⎡⎢⎢⎢⎣
r(1)r(2)
...r(p)]
⎤⎥⎥⎥⎦
(4.77)
If the matrix R is nonsingular, the solution of the Wiener-Hopf set of equations gives the optimum linear
predictor, characterised by the set of parameters boi for i = 1, . . . , p. Fig. 4.11 shows a linear predictor of
order p.
z -1 z -1
ΣΣ
1b 2b ...
x(n-2)
x(n)^
x(n-p)x(n-1)
bp
Figure 4.11: Prediction filter of order p
Note that the number of delay units is (p − 1) while the number of filter parameters remains at p. The
4.4 Linear Prediction 65
apparent incongruity between the number of delay units, the number of parameters and the model order
disappears when the linear predictor is related to the Wiener filter, as is shown in Fig. 4.12.
z -1 z -1
Σ
z -1
ΣΣ
2b1bb0...
x(n-1) x(n-2)x(n)
x(n)
+
-e(n)
linear predictor of order p
x(n-p)
bp
^
Figure 4.12: Prediction-error filter of order p
Relationship between AR models and Linear Prediction
Moreover, the filter of Fig. 4.12 may be rearranged to have the same structure as the AR analyser filter of
Fig. 4.8. The resulting filter is shown in Fig. 4.13.
z -1z -1 z -1
ΣΣ Σ
- 1b - 2b ...
x(n)x(n-1) x(n-2) x(n-p)
e(n)
- bp
Figure 4.13: Prediction-filter filter of order p rearranged to look as an AR analyser
The filters in Fig. 4.13 and Fig. 4.8 show the equivalence of the linear prediction-error filter and the AR
analyser. Both filters are fed with a time series from a stochastic process, and are expected to have an
uncorrelated random sequence at the output, e(n) or v(n), respectively. This random output has been
minimised for the linear predictor in the mean-square prediction-error sense, solving the Wiener-Hopf
equations:
Rbo = c (4.78)
4.5 Maximum entropy method (MEM) for power spectrum density estimation 66
The set of coefficients bi of the linear predictor are related to the parameters ai of the AR model through:
ai = −bi, for i = 1, 2, . . . , p (4.79)
where p = [r(1) r(2) . . . r(p)]T . The set of AR parameters may be calculated using the Yule-Walker
equations:
Rw = r (4.80)
with w = [−a1,−a2, . . . ,−ap]T and r = [r(1), r(2), . . . , r(p)]T . Therefore, the set of AR parameters
found by solving Yule-Walker equations is optimum in the mean-square prediction-error sense.
4.5 Maximum entropy method (MEM) for power spectrum densityestimation
The Yule-Walker equations for AR modelling (or linear prediction) can be used to find the parameters
of the filter that models the stochastic process x(n), and estimate its PSD by using Eq. 4.24. But the
goodness of the estimator S still depends on the statistical characteristics of the estimator for the auto-
correlation function r(m). The periodogram in Eqs. 4.11 and 4.17 assumes that the unknown values of
the autocorrelation (for lags greater in modulus than the data length) are zero. This leads to smearing in
the PSD estimate. Burg [25] applied the principle of maximum entropy to the estimation of the unknown
autocorrelation lags of a Gaussian stochastic process. In this sense, the maximum entropy autocorrelation
estimate will be the one with the most random autocorrelation series, i.e. the maximum entropy estima-
tor will not add any information to the estimate. The solution for a set of 2p + 1 known autocorrelation
lags is:
rMEM(m) =
⎧⎨⎩
r(m), for |m| ≤ p
∑pk=1 bp,krMEM(m − k), for |m| > p
(4.81)
where the coefficients bp,k are none other than the parameters of the p order linear predictor, and there-
fore equal to minus the ap,k parameters of a p order AR filter for the known autocorrelation lags. The
4.6 Algorithms for AR modelling 67
MEM PSD estimate, obtained by the Fourier transform of rMEM, yields:
SMEM(ω) =Pep∣∣∣∣∣1 −
p∑k=1
bp,ke−jωk
∣∣∣∣∣2 (4.82)
where Pep denotes the prediction error power average E[|ep(n)|2] for the p-order linear predictor, which is
equivalent to the input noise variance σ2v,p of the p-order AR model. In terms of the AR model parameters,
Eq. 4.82 is:
SMEM(ω) =σ2
v,p∣∣∣∣∣1 +p∑
k=1
ap,ke−jωk
∣∣∣∣∣2 (4.83)
4.6 Algorithms for AR modelling
4.6.1 Levinson-Durbin recursion to solve the Yule-Walker equation
The Levinson-Durbin algorithm [95][49] uses the symmetric and Toeplitz properties of the autocorre-
lation matrix R to provide an efficient solution of Eq. 4.44, requiring only p2 operations for a model
order p, instead of the p3 computations required for Gaussian elimination. Also, the algorithm reveals the
fundamental properties of AR processes. It recursively computes the filter parameter and input variance
{am,k, σ2m} for model orders m = 1, 2, . . . , p.
The algorithm proceeds as follows:
1. Initialisation:
a1,1 = −r(1)/r(0) (4.84)
σ21 = (1 − |a1,1|2)r(0) (4.85)
2. Recursion for m = 2, 3, . . . , p:
am,m =−
[r(m) +
∑m−1k=1 am−1,k r(m − k)
]σ2
m−1
(4.86)
am,k = am−1,k + am,mam−1,m−k for k = 1, 2, ..,m − 1 (4.87)
σ2m = (1 − |am,m|2)σ2
m−1 (4.88)
4.6 Algorithms for AR modelling 68
The solution {ap,k, σ2p} is the same as would be obtained using Eq. 4.44. The solution sets for lower
model orders provide useful information. If the values r(m) used in the recursion represent a valid auto-
correlation sequence, then it can be shown [10] that the last parameter for each model order satisfies5:
|am,m| ≤ 1 (4.89)
consequently, the input variance follows this property:
σ2m ≤ σ2
m−1 (4.90)
Using the analogy with linear predictors, Eq. 4.90 means that the prediction error decreases or at least re-
mains steady as the model order increases. This represents an advantage if the model order is not known
a priori. If the stochastic process x(n) is actually an AR process of order p and known autocorrelation
function, then the Levinson-Durbin recursion will reproduce the set {ap,k, σ2p} for model orders greater
than p. Under real conditions, either the autocorrelation is unknown or the process is not truly AR, then
the input variance as a function of the model order will decrease monotonically. However, it would show
a “knee” or turning point, where further increments in the model order do not improve significantly the
prediction error.
Lattice form of a linear predictor
The parameters am,m play an important role in the theory of linear prediction. To see how they are
related to the linear predictor of order p, let us define two types of prediction errors:
Forward-prediction error: The prediction error shown in Eq. 4.76 for a p-order linear predictor, de-
noted by ep(n), is:
ep(n) = x(n) −p∑
k=1
bp,kx(n − k)
= x(n) +p∑
k=1
ap,kx(n − k)
(4.91)
5In fact, the condition |am,m| ≤ 1 is necessary and sufficient for the values of r(m) to represent a valid autocorrelation function
4.6 Algorithms for AR modelling 69
where bp,k are the parameters of the p-order linear predictor, and ap,k the p-order AR model parameters.
We will continue using the second form to keep consistency with the notation used in the Levinson-Durbin
algorithm.
Backward-prediction error: If the data time series was reversed and fed to the p-order linear predictor
for the original sequence x(n), then the filter would sequentially predict the “past” samples of the original
time series. Thus, the backward prediction error for the sample x(n − p), denoted by bp(n)6, would be:
bp(n) = x(n − p) +p∑
k=1
ap,kx(n − p + k)
= x(n − p) + ap,1x(n − p + 1) + ap,2x(n − p + 2) + . . . + ap,px(n)
(4.92)
Using Eq. 4.87, a relationship between the forward and backward prediction errors can be found:
ep(n) = x(n) +p−1∑k=1
ap,kx(n − k) + ap,px(n − p)
= x(n) +p−1∑k=1
(ap−1,k + ap,pap−1,p−k)x(n − k) + ap,px(n − p)
=
(x(n) +
p−1∑k=1
ap−1,kx(n − k)
)+
(p−1∑k=1
ap,pap−1,p−kx(n − k) + ap,px(n − p)
) (4.93)
Noting that the terms within the brackets in Eq. 4.93 are related to the p − 1-order linear predictor by:
ep−1(n) = x(n) +p−1∑k=1
ap−1,kx(n − k)
bp−1(n − 1) = x(n − 1 − (p − 1)) +p−1∑k=1
ap−1,kx(n − 1 − (p − 1) + k)
= x(n − p) + ap−1,1x(n − p + 1) + ap−1,2x(n − p + 2) + . . . + ap−1,p−1x(n − 1)
= x(n − p) +p−1∑k=1
ap−1,p−kx(n − k)
(4.94)
thus Eq. 4.93 can be written as:
ep(n) = ep−1(n) + ap,pbp−1(n − 1) (4.95)
Similarly, it can be shown that:
bp(n) = bp−1(n − 1) + ap,pep−1(n) (4.96)
6Note that from this point the symbol b will denote an error and not a filter coefficient
4.6 Algorithms for AR modelling 70
Therefore Eqs. 4.95 and 4.96 relate the forward and backward prediction error of a given model order p
to the error for a linear predictor of model order p − 1. They can be used recursively to derive the lattice
form of a p-order linear predictor. Renaming the parameters am,m as:
am,m = κm (4.97)
and calculating the initial values for the forward and backward prediction errors:
e0(n) = b0(n) = x(n) (4.98)
we get:
e1(n) = e0(n) + κ1b0(n − 1)b1(n) = b0(n − 1) + κ1e0(n) (4.99)
Fig. 4.14 condenses the relationships shown by Eqs. 4.98 and 4.99. This structure resembles the basic
pattern of a lattice.
Σ
κ1
κ1
z -1
Σ e (n)1
b (n)1
x(n)
e (n)0
b (n-1)0
Figure 4.14: Lattice filter of first order
To continue the lattice, Eqs. 4.95 and 4.96 can be generalised by substituting m for p:
em(n) = em−1(n) + κmbm−1(n − 1)bm(n) = bm−1(n − 1) + κmem−1(n) (4.100)
and evaluating them for m = 2, 3, . . . , p. The complete filter will adopt the structure shown in Fig. 4.15.
Note that the transfer function of the lattice linear predictor filter is:
HLP(z) = 1 +p∑
k=1
ap,kz−k (4.101)
4.6 Algorithms for AR modelling 71
z -1
κ1
κ1
Σ
Σ z -1
Σ
Σb (n)1
e (n)1
z -1
Σ
Σ
x(n)
e (n)0
b (n-1)0 b (n-1)1
κ2
κ2
e (n)p-1
b (n-1)p-1
κp
κp
b (n)p
e (n)p
Figure 4.15: Lattice filter of first order
which is the inverse of the transfer function of the corresponding AR filter. By analogy with transmission
line theory, the parameters κm in the lattice filter are called reflection coefficients 7, while the parameters
ap,k could be referred to as the feedback coefficients, a term that is self-explanatory by looking at the AR
filter structure in Fig. 4.2.
The lattice structure has some advantages over the transversal filter structure shown in Fig. 4.13, or
the feedback form shown in Fig. 4.2. Not only does it generate both forward and backward prediction
sequences, but it also has modularity. The first “step”, with coefficient κ1, in the lattice depicted (see
Fig. 4.14) represents the first order linear predictor. Adding a second step κ2 increments the order by
one, yielding a second order linear predictor without having to modify the first step, and so on to reach the
desired model order. Also, each “step” or module is “decoupled” from the others as it can be shown that
forward and backward prediction errors are orthogonal, i.e. uncorrelated with each other for stationary
input data.
Furthermore, the condition |κm| ≤ 1 for m = 1, 2, . . . , p is necessary and sufficient to guarantee that all
the poles of the AR filter are lying within or on the unit circle, which is a condition for stability. If any
of the reflection coefficients equals ±1, then the Levinson-Durbin recursion will terminate with σ2i = 0,
where κi is the first reflection coefficient with unit modulus. The process in this case is purely harmonic,
consisting only of sinusoids.
It is important to note that the set of reflection coefficients κ1, κ2, . . . , κp represents the p-order linear
predictor as the set of feedback coefficients ap,1, ap,2, . . . , ap,p does. Using the Levinson-Durbin recursion,
7The parameters κm are also known as PARCOR coefficients, for partial correlation, in the statistics literature
4.6 Algorithms for AR modelling 72
it is possible to calculate the ap,i’s from the κi’s, as is shown in Table 4.2 for the first three sets of feedback
coefficients.
Model order m am,1 am,2 am,3
1 κ1
2 (κ1 + κ1κ2) κ2
3 (κ1 + κ1κ2 + κ2κ3) (κ2 + κ1κ3 + κ1κ2κ3) κ3
Table 4.2: Feedback coefficients in terms of the reflection coefficients
The inverse Levinson-Durbin recursion in Eq. 4.102 provides the means to calculate the κi’s as a function
of the ap,i’s. Again, the results for the first three model orders are shown in Table 4.3.
am−1,i =am,i − am,mam,m−i
1 − |am,m|2 (4.102)
Model order m κ1 κ2 κ3
1 a1,1
2a2,1
1+a2,2a2,2
3a3,1−a3,2a3,3
1+a3,2−a3,1a3,3−a23,3
a3,2−a3,1a3,3
1−a23,3
a3,3
Table 4.3: Reflection coefficients in terms of the feedback coefficients
It is apparent from Tables 4.2 and 4.3 that the reflection coefficients are less correlated with each other
than the feedback coefficients. This could give another reason to choose the set of reflection coefficients
to represent an AR process over the set of feedback coefficients.
4.6.2 Other algorithms for AR parameter estimation
For short data sets, the estimation of the first p lags of the autocorrelation sequence r(m) limits the
accuracy of the AR parameter estimation when using the Levinson Durbin recursion. Other approaches
use standard statistical estimation directly from the data. The commonly used method of Maximum
Likelihood Estimation (MLE) is too difficult to apply [21], as it leads to a set of nonlinear equations [108].
Approximations to the exact MLE have been sought [26][82][172]. McWhorter and Scharf summarise the
work done in approximate MLE. Unfortunately, hardly any improvement is achieved by using approximate
4.6 Algorithms for AR modelling 73
MLE methods despite the high computational cost. Returning to the least square (LS) approach used in
section 4.4.1, we will present three more methods for AR parameter estimation directly from the data.
LS of the forward prediction error
Given a data time series x(n) of length N , the forward prediction error ep(n) shown in Eq. 4.91 can be
written as:
ep(n) =p∑
k=0
ap,kx(n − k), where ap,0 = 1 (4.103)
Computing ep(n) for all the data available, i.e. for n = 0, 1, . . . , N + p − 1, and assuming that the values
of x(n) for n < 0 and for n ≥ N are zero, we get:
ep(0) = x(0)ep(1) = x(1) + ap,1x(0)ep(2) = x(2) + ap,1x(1) + ap,2x(0)
......
ep(p) = x(p) + ap,1x(p − 1) + ap,2x(p − 2) + . . . + ap,px(0)...
...ep(N − 1) = x(N − 1) + ap,1x(N − 2) + ap,2x(N − 3) + . . . + ap,px(N − p − 1)ep(N − 1) = ap,1x(N − 1) + ap,2x(N − 2) + . . . + ap,px(N − p)
......
ep(N + p − 1) = ap,px(N − 1)
(4.104)
In matrix form, this can be written as:⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
ep(0)...
ep(p)...
ep(N − 1)...
ep(N + p − 1)
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
=
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
x(0) ©...
. . .x(p) · · · x(0)
......
x(N − 1) · · · x(N − p − 1). . .
...© x(N − 1)
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
⎡⎢⎢⎢⎢⎢⎣
1ap,1
ap,2
...ap,p
⎤⎥⎥⎥⎥⎥⎦ (4.105)
ep = Xa (4.106)
The forward prediction error energy is the summation over all the range of |ep(n)|2:
Ep =∑
n
|ep(n)|2
=∑
n
(p∑
k=0
ap,kx(n − k)
)2 (4.107)
4.6 Algorithms for AR modelling 74
Minimising Ep with respect to ap,k results in a set of p equations:
∂Ep
∂ap,i= 0 ⇒
p∑k=0
ap,k
(∑n
x(n − k)x(n − i)
)= 0, for 1 ≤ i ≤ p (4.108)
The minimum error energy, denoted by Ep,min, is obtained by expanding Eq. 4.107 and substituting into
Eq. 4.108. The result can be shown to be [105]:
Ep min =p∑
k=0
ap,k
(∑n
x(n − k)x(n)
)(4.109)
Using matrix notation for Eqs. 4.108 and 4.109:
XTXa = [Ep min 0 0 . . . 0]T (4.110)
The matrix XTX has a Toeplitz structure. In fact, multiplying both sides of Eq. 4.110 by 1/N makes
the equation equivalent to the Yule-Walker equations using the biased autocorrelation estimator given in
Eq. 4.14, hence, the name of Yule-Walker estimator. The Levinson-Durbin recursion can be used to solve
Eq. 4.110.
However, if we avoid the assumption of zeros for the unknown values of x(n) and restrain the calculation
of ep(n) from n = p to n = N −1, the matrix X in Eq. 4.110 will change to Xcov, defined as one of matrix
X’s partitions:
X =
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
x(0) ©...
. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x(p) · · · x(0)...
...x(N − 1) · · · x(N − p − 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . ....
© x(N − 1)
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
=
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
Xpre
. . . . . . . . . .
Xcov
. . . . . . . . . .
Xpost
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
(4.111)
Using the matrix Xcov is essentially equivalent to using the same number of products, N − p, for each
lag in the estimation of the autocorrelation function. Unless the signal is periodic with a period which
is a multiple of N − p, the matrix S = XTcovXcov will not be Toeplitz. Instead it is the so-called sample
covariance matrix. Hence, the resultant set of equations using Eq. 4.111 is called the covariance equations.
4.6 Algorithms for AR modelling 75
The sample covariance matrix is not always positive definite and instabilities of the AR filter may occur.
In the case of a periodic signal with a period which is a multiple of N − p, the sample covariance matrix
and the true autocorrelation matrix are identical. The covariance estimator is statistically closer to the
MLE estimator than the Yule-Walker estimator, although the latter has a lower variance, especially if data
windowing is applied. It is interesting to note that Xcov can be linked to non-linear dynamical systems
theory if seen as a sequence of N − p delay reconstructed vectors in a (p + 1)-dimensional embedding
space when using a one-sample delay [1].
The use of either Xpre or Xpost with Xcov leads to pre-windowed or post-windowed equations respectively.
Algorithms have been developed to solve the covariance equations, the pre-windowed and the post-
windowed equations, but none of these perform significantly better than the Yule-Walker estimator. All
the LS methods based on forward prediction error present line spectral splitting in the PSD estimator.
This phenomenon consists of two peaks being generated in the PSD, very close to each other, where the
real PSD has only one peak. The method that uses the matrix X as in Eq. 4.111 gives the least spectral
resolution. Displacement of the peaks and great sensitivity to noise are other common problems of the
LS forward prediction approach. The next sections describe methods based on the backward prediction
error as well as the forward prediction error, in an attempt to improve the results described up to this
point.
Constrained LS of the forward and backward prediction errors
Provided that the process x(n) is stationary the forward and the backward prediction errors are equal in
a statistical sense. Based on this property, Burg [25] proposed the use of both errors in the cost function
for the parameters ap,k, without making any other assumption about the data:
Ep =N−1∑n=p
|ep(n)|2 + |bp(n)|2
=N−1∑n=p
⎛⎝
∣∣∣∣∣p∑
k=0
ap,kx(n − k)
∣∣∣∣∣2
+
∣∣∣∣∣p∑
k=0
ap,kx(n − p + k)
∣∣∣∣∣2⎞⎠ (4.112)
Burg further proposed to minimise the error Ep subject to the constraint over the ap,k parameters that
4.6 Algorithms for AR modelling 76
they satisfy the Levinson-Durbin recursion:
am,k = am−1,k + am,mam−1,m−k (4.113)
for all orders m from 1 to p.
This constraint ensures a stable AR filter. Substituting the recursive expressions for ep(n) and bp(n) shown
in Eq. 4.100, the cost function Ep becomes a function of the reflection coefficient ap,p, defined in §4.6.1,
and the forward and backward prediction errors for the order immediately below, p− 1. Minimising with
respect to ap,p yields:
ap,p = κp =−2∑N−1
k=p bp−1(k − 1)ep−1(k)∑N−1k=p (|bp−1(k − 1)|2 + |ep−1(k)|2)
(4.114)
The denominator on the right-hand side of Eq. 4.114 can be found recursively using the relations given
in Eq. 4.100 as:
DENp =N−1∑k=p
(|bp−1(k − 1)|2 + |ep−1(k)|2
)
= DENp−1(1 − |κp−1|2) − |bp−1(N − p)|2 − |ep−1(p)|2
(4.115)
The Burg algorithm is then implemented as follows:
1. Initialisation:
e0(n) = b0(n) = x(n)
DEN0 =N−1∑k=0
(|x(k − 1)|2 + |x(k)|2
)
2. Recursion for m = 1, 2, 3, . . . , p:
DENm = DENm−1(1 − |κm−1|2) − |bm−1(N − m)|2 − |em−1(m)|2
am,m = κm =(−2∑N−1
k=m bm−1(k − 1)em−1(k))/
DENm
am,k = am−1,k + am,mam−1,m−k, for k = 1, 2, ..,m− 1
em(n) = em−1(n) + κmbm−1(n − 1)
bm(n) = bm−1(n − 1) + κmem−1(n)
As it may be noted, the Burg method does not minimise the cost function for all the reflection coefficients
at the same time, but rather minimises the error with respect to the last reflection coefficient for each
4.6 Algorithms for AR modelling 77
model order. This has been pointed out as being the cause for the line spectral splitting also observed in
PSD estimators obtained from Burg AR estimates [81]. Several researchers have tried to overcome this
effect by minimising the error function with respect to either all the reflection coefficients or the feedback
coefficients of the p-order AR filter at the same time (see [81] for a review). The last approach removes
the constraint (Eq. 4.113) imposed by Burg, and is usually referred to as forward-backward LS. It requires
about 20% more computations than the Burg algorithm and although it is apparent that this method
removes the spectral line splitting [81], the stability of the AR filter is not guaranteed.
Modified Burg algorithm
Narayan and Burg [117] also proposed a solution to the spectrum line splitting problem for quasi-periodic
time series, in a method called covariance Burg. The method requires a priori knowledge of the period of
the signal to estimate the autocorrelation matrix from the sample covariance matrix.
The algorithm is very similar to the Burg method except for a trapezoidal weighting function applied in
the computation of the reflection coefficient:
κm =−∑N−1
k=m wm(k)bm−1(k − 1)em−1(k)∑N−1k=m wm(k) (|bm−1(k − 1)|2 + |em−1(k)|2)
(4.116)
where wm(n) takes the value of the minimum of n − m + 1, p − m + 1 and N − n, denoted as:
wm(n) = min(n − m + 1, p− m + 1, N − n) (4.117)
For high signal-to-noise ratio and periodicity of the data, this estimator will be much closer to the opti-
mum in the maximum likelihood sense 8 than the one obtained by the covariance method. However, the
algorithm also works quite well with non-periodic data, although the results are not so close to the MLE
optimum as with periodic data. This surprising feature may be explained by the fact that the method
performs a kind of average over the sample covariance matrices (dimension (p + 1) × (p + 1)) along the
N -point long data segment.
8Burg et al [26] developed an algorithm to find the closest Toeplitz matrix to the sample covariance matrix in the maximumlikelihood sense using the normalised distance measure Dn(S,R) = Trace(R−1S)− ln(|R−1S|)− (p +1). The algorithm, called“structured covariance matrices”, is an approximation to the MLE and is computationally very expensive
4.7 Modelling the EEG 78
4.6.3 Sensitivity to additive noise of the AR model PSD estimator
One of the major problems in the use of AR parametric modelling for estimating PSD is the presence of
additive noise in the signal. Although noise has been considered in the model to represent the unpre-
dictable nature of a stochastic signal, the addition of white or coloured noise to the signal will obscure
its spectrum in the PSD estimation of signal plus noise as is shown in Eqs. 4.118 and 4.119 below. Let us
suppose that x(n) is a p-order AR stochastic process contaminated with noise:
xn(n) = x(n) + ν(n) (4.118)
Assuming that ν(n) is white uncorrelated noise with variance σ2ν , the PSD of xn(n) is:
Sn(ω) = Sx(ω) + σ2ν
=σ2
v,p∣∣∣∣∣∣∣
1+
p∑k=1
ap,ke−jωk
∣∣∣∣∣∣∣
2 + σ2ν
=
σ2v,p+σ2
ν
∣∣∣∣∣∣∣
1+
p∑k=1
ap,ke−jωk
∣∣∣∣∣∣∣
2
∣∣∣∣∣∣∣
1+
p∑k=1
ap,ke−jωk
∣∣∣∣∣∣∣
2
(4.119)
As can be seen in Eq. 4.119, the process which includes the additive noise is an ARMA process instead of
the original AR process. This distorts the spectrum, with a loss of resolution in the detection of the peaks
in the original process. Additive noise is very common in signal processing as it is the most common
model for sensor noise.
4.7 Modelling the EEG
AR modelling, just like FFT methods, assumes that the signal under analysis is stationary. The EEG can
be considered to be a stochastic process, which is the result of the summation of numerous sources of
electrical activity. These activities depend on external and internal variables, and on large numbers of
inhibitory and excitatory interactions that can be considered to be random. The central limit theorem
states that the sum of many independent random variables tends to be a random variable with a Gaussian
4.7 Modelling the EEG 79
distribution. The EEG sources are not independent but we may assume them to be statistically indepen-
dent in origin, modelling the numerous interactions between neurons as a shaping filter which transforms
these sources in a correlated Gaussian process. Under stable external and internal conditions, we may
consider the EEG process to be stationary, and, even further, take it to be ergodic.
Now the question is for how long can we assume during continuous recordings that the external/internal
conditions remain stable? The answer depends on many factors, but for the kind of data analysed in
this thesis the main factors are the subject’s activity and presence or absence of a pathological condition.
If the subject is healthy and asleep, and if the external conditions are favourable to sleep, then we can
assume that the changes in the statistical properties of the EEG will occur slowly. However, we know
from section 3.1.2 that the sleep EEG may spontaneously present transient waves like spindles and vertex
waves, which affect the stationarity of the signal, even if the brain stays in the same sleep stage. These
events last for about a second. In fact, several authors [14][12] have recommended that the EEG should
be analysed in segments no longer than 1 second to ensure stationarity. However, segments of 1-s dura-
tion may be too short to obtain an accurate autocorrelation estimate. If the subject is awake, many more
factors can influence the patterns in the EEG, some of which are difficult to control under experimen-
tal conditions. For instance, eye closing/opening affects the alpha rhythm in the EEG. Transitions from
alertness to drowsiness bring an important series of changes in the EEG, some of which may happen very
quickly. Varri et al. use EEG segments of variable length, from 0.5s to 2s [170] in vigilance studies.
Barlow [14] gives a review of methods suitable for detecting EEG non-stationarities. The methods can be
based on fixed intervals or adaptive intervals. EEG features from a fixed reference window are compared
with the features from the EEG in a moving “test” window, looking for significant changes in the features.
Once a change is detected, a boundary is set at the point where the change occurs in order to segment the
signal into stationary segments. A new reference window is then placed at the start of the new segment.
The features are usually based on time descriptors or FFT coefficients or AR modelling.
Another method for analysis of non-stationary EEG is time-varying AR modelling, better known as Kalman
4.7 Modelling the EEG 80
filtering [148]. With this method, the AR parameters are estimated for a short segment at the beginning
of the signal by any conventional algorithm (Yule-Walker, Burg, etc.) and then updated every sample.
The non-stationarities can be tracked from the rate-of-change of the AR parameters. This method has
been reported to yield better results than FFT or conventional AR modelling in recognising rapid changes
in the frequency of oscillations [148], but the computational cost and the data expansion instead of
compression make the method unattractive [14]. Some researchers average the Kalman AR coefficients
over segments of 1s or more, for smoothing and data compression, but this has the same effect as using
a conventional AR method over the segment. Also, brief disturbances like artefacts and transient waves
have a lasting effect with Kalman filtering (depending on gain of the filter), given that the technique
employs information from the recent past to update the current estimate of the model coefficients. In
contrast, in conventional AR modelling, a brief disturbance will only affect the segment during which it
occurs [14].
Chapter 5
Neural network methods
We have already seen that the EEG signal can be described in terms of its power spectrum or as a set
of filter parameters. These values extracted from the PSD or the AR model are generally called features.
The next step is to find the relationship between the EEG features and the mental state, i.e. deep sleep,
wakefulness, drowsiness, etc. This constitutes a classification task which seeks to partition the input
space into 1-of-K classes. The input space is a mathematical abstraction such that a given set of features
xo1, x
o2, . . . , xo
(d) is assigned to a point xo in a d-dimensional space with coordinates x1, x2, . . . , xd. Classi-
fication involves dividing the input space into regions such that points taken from the same region belong
to the same class. The dividing line between two regions is known as a decision boundary.
Classifiers can be divided according to the type of mapping generated. In linear classifiers, the decision
boundaries are hyper-planes (for dimensions greater than three). However, it is well known that real-
world problem datasets show considerable overlap between classes and hence require non-linear decision
boundaries. For the same reason, it is helpful to adopt a probabilistic framework, placing the decision
boundary in the loci for which probabilities of belonging to either class are equal. There are several
methods for evaluating the posterior probability of belonging to a given class, using either parametric
or non-parametric techniques. With the non-parametric methods, no assumption is made regarding the
probability distribution of the data belonging to each class.
Neural networks are non-linear, non-parametric function approximators which can be used in regression
5.1 Neural Networks 82
problems as well as in classification problems. Compared with other types of classifiers, neural networks
offer advantages in problems for which the classification rules are complex and difficult to specify. Pro-
vided that a sufficiently large set of input data is labelled by human experts (called the training set),
a neural network can “learn” the underlying generator of the data and for a given input, produce an
output in terms of the posterior probabilities of the classes. With new data (or test data), drawn from
the same distribution as the training set, the trained neural network should produce accurate posterior
probabilities of the data belonging to the classes, a property sometimes known as generalisation. The set
of posterior probabilities can be fed to a decision-making stage to assign the input to one of the classes.
Training a neural network is a time-consuming task, but once this task is performed, classification is fast,
and requires very little computational resources.
P(C | ), ... , P(C | )1 x K xn n
FeatureExtractor
MLP
DecisionMaking
EEGn
features x , ... , x0 d-1
posterior probabilities
classification
Ck
nn
Figure 5.1: The classification process
5.1 Neural Networks
A neural network consists of arrays of interconnected “artificial neurons”. The structure of an artificial
neuron is showed in Fig. 5.2. The artificial neuron adds the weighted values of its d inputs xi, and applies
a non-linear function to this summation in order to produce an output y, whose value is in the range from
5.1 Neural Networks 83
0 to 1:
y = g
(d∑
i=0
wixi
)(5.1)
wd
x
x
1
2inputs Σ y outputa
xd
gh
w1
w2
0w0x =1
Figure 5.2: An artificial neuron
The nonlinear function g(a) is known as the activation function. Several types of activation functions can
be used. One such function is the so-called hard-limiter gh(a), which produces an binary output, 1 or 0,
depending on whether or not the summation exceeds a given threshold, and is defined as:
gh(a) ={
0 for a < 01 for a ≥ 0 (5.2)
where a =∑d
i=0 wixi.
The use of the hard-limiter has a physiological background, as it simulates the “all-or-nothing” rule of
real neurons. The weight associated with the bias input x0, w0 represents the threshold when gh(a) is
used because it sets the minimum value of a for the neuron “to fire”.
Other activation functions have continuous outputs between 0 and 1, allowing the output to be inter-
preted as a probability. Examples of such functions are the sigmoid function gσ, and the softmax function
gsoftmax. The sigmoid function is the hyperbolic tangent function tanh, scaled to lie between saturation
levels of 0 and 1.
gσ(a) =1
1 + e−a(5.3)
This mathematically simple function is widely used in two-class problems [19, pp.231]. The softmax
function, a more generalised form of the sigmoid function also known as the normalised exponential, is
better suited to multiple class problems and will be explained later.
5.1 Neural Networks 84
a
1
-1
tanh(a)
a
0.5
1.0
g (a) σ
Figure 5.3: Hyperbolic tangent and Sigmoid functions.
The non-linear mapping performed by a neural network can be written as:
y = y(x;w) = G(x;w) (5.4)
where y represents the vector of outputs [y1, . . . , yK ]T , generally representing the probabilities of belong-
ing to class Ck in a classification problem; and w represents the vector of connection weights between the
input nodes and the neurons, between neurons, and between the neurons and the outputs. The process
of finding the weights to perform the mapping correctly is called learning or training the network. During
supervised learning the input patterns or vectors are presented repeatedly to the network, along with
the desired value for the outputs (the target value tk for the kth output). The weights are successively
adjusted in order to minimise a cost function, generally associated with the mean squared error. The
performance of the trained network is then tested on the test set, i.e. a set of patterns not included in
the training set. An over-trained network will fit the noise rather than the data and hence will generalise
poorly.
5.1.1 The error function
The general goal of the neural network is to make the best possible prediction of the target vector t =
[t1, . . . , tK ]T when a new input vector value x is presented. The most general and complete description
of the data is in terms of the joint probability density p(x, t) given by:
p(x, t) = p(t |x)p(x) (5.5)
where p(t |x) represents the probability of t given a particular value of x, and p(x) is the unconditional
5.1 Neural Networks 85
probability of x, given by:
p(x) =∫
p(x, t)dt (5.6)
The cost function to minimise during training can be arbitrarily defined. A good cost function can be
derived from the likelihood of the training set {xn, tn}, which can be written as:
L =∏n
p(xn, tn)
=∏n
p(tn | xn)p(xn)
(5.7)
For optimisation purposes, it is simpler to take the negative logarithm of the likelihood:
E = − lnL
= −∑
n
ln p(tn | xn) −∑
n
ln p(xn)(5.8)
where E defines a cost function, usually called the error function. The second term of the right hand side
of Eq. 5.8 can be omitted as it does not depend on the parameters of the neural network, so:
E = −∑
n
ln p(tn | xn) (5.9)
Sum of squares function
If we assume the target variables tk to be continuous, with independent zero-mean Gaussian distributions,
we can write:
p(t | x) =K∏
k=1
p(tk | x) (5.10)
Furthermore, let us assume that the tk ’s are given by some deterministic function of x with added Gaus-
sian noise ε:
tk = hk(x) + εk (5.11)
where the noise distribution is given by:
p(εk) =1√
2πσ2exp
− ε2k
2σ2 (5.12)
5.1 Neural Networks 86
As the training process seeks to model the functions hk(x) with yk(x;w), we can use the latter in Eq. 5.11
and substitute εk in Eq. 5.12 to give:
p(tk | x) =1√
2πσ2exp
−{yk(x;w) − tk}2
2σ2 (5.13)
Combining Eq. 5.13 and Eq. 5.10 in the expression for the error in Eq. 5.9:
E = −∑n
lnK∏
k=1
p(tk | x)
= −N∑
n=1
K∑k=1
ln1√
2πσ2exp
−{yk(x;w) − tk}2
2σ2
=1
2σ2
N∑n=1
K∑k=1
{yk(x;w) − tk}2 + NK lnσ +NK
σln(2π) (5.14)
where N is the number of input patterns used during the training process. Note that the last two terms
of Eq 5.14 do not depend on the weights w so they can be omitted, as well as the dividing factor σ2 in
the first term. Thus the error function E ends up as:
E =12
∑n
∑k
{yk(xn;w) − tnk}2
=12
∑n
‖y(xn;w) − tn ‖2
(5.15)
The error function in Eq. 5.15 is called the sum-of-squares function. It reduces the optimisation process
to a least-squares procedure. Its use is not restricted to Gaussian distributed target data, and although
the sum of the outputs equals unity (very convenient if we want to interpret the outputs as probabilities),
the results cannot distinguish between the true distribution and any other distribution having the same
mean and variance.
Cross-Entropy function
In a classification problem, the target data represent discrete class labels, therefore a more convenient
code for the target data is the “1-of-K” scheme:
tnk = δk�, for xn ∈ C� (5.16)
5.1 Neural Networks 87
where δk� is the Kronecker delta which is 1 for k = �, and 0 otherwise.
The output is meant to represent the posterior probability of class membership:
y� = P (C� | x) (5.17)
therefore, we can write p(t� | x) = (y�)t� , and more generally, assuming that the distributions p(tn | xn)
are statistically independent:
p(tn | xn) =K∏
k=1
(ynk )tk (5.18)
Substituting Eq. 5.18 in Eq. 5.9 for the log-likelihood error function:
E = −∑
n
K∑k=1
tnk ln ynk (5.19)
This error function has an absolute minimum with respect to the yk ’s when ynk = tnk for all k and n. At the
minimum, E is:
Emin = −∑
n
K∑k=1
tnk ln tnk (5.20)
If tk takes only values 0 or 1, this minimum is equal to zero, but if tk is a continuous variable in the
range (0, 1) this minimum does not necessarily get to zero. In fact, it will represent the cross entropy [19,
pp.244] between the distributions of the target and the output. Hence, this error function derived from
the maximum likelihood criterion for a 1-of-K target coding is called the cross-entropy error function.
To ensure a zero value at the minimum, the value Emin is subtracted from the error function in Eq. 5.19,
giving this modified error:
E = −∑
n
K∑k=1
tnk lnyn
k
tnk(5.21)
which is non-negative and equals zero when ynk = tnk for all k and n.
The cross-entropy error function has some advantages over the sum-of-squares error function. Firstly, it
can be proved [19, pp.235-6] that for an infinitely large data set the outputs yk are exactly the posterior
5.1 Neural Networks 88
probability P (Ck | x), and therefore are limited to the (0, 1) range. Secondly, it performs better at
estimating small probabilities. Indeed, if we denote the error at the output ynk as εn
k then the cross-
entropy error is:
E = −∑
n
K∑k=1
tnk lntnk + εn
k
tnk
= −∑
n
K∑k=1
tnk ln(
1 +εn
k
tnk
) (5.22)
Its clear from Eq. 5.22 that the cross-entropy error function depends on the relative errors of the neural
network outputs, in contrast with the sum-of-squares function which depends on the squares of the
absolute errors. Therefore, minimisation of the cross-entropy error will tend to give similar relative
errors on both small and large probabilities, while sum-of-squares tends to give similar absolute errors
for each pattern, resulting in large relative errors for small output values.
Cross-entropy error for a two-class problem For K > 2 it is desirable to have one output per class, so
that each output represents the posterior probability of belonging to one of the classes, but for a 2-class
problem, only one output representing one class is necessary as the probability for the other class can
be found by subtracting the output value from 1. This causes a few changes in the cross-entropy error
function, which will be reviewed here briefly.
Assigning y=P (C1 | x) the conditional probability p(t | x) can be written as:
p(t | x) = yt(1 − y)1−t (5.23)
where t takes value 1 if x ∈ C1 and 0 if x ∈ C2. The cross-entropy error takes the form:
E = −∑
n
{tn ln yn + (1 − tn) ln(1 − yn)} (5.24)
Differentiating with respect to yn:
∂E
∂yn=
yn − tn
yn(1 − yn)(5.25)
5.1 Neural Networks 89
It is easy to see from Eq. 5.25 that the cross-entropy function for a 2-class problem has an absolute
minimum at 0 when yn = tn for all n. Again, it will be convenient to subtract the minimum from the
expression in Eq. 5.24:
E = −∑
n
{tn lnyn
tn+ (1 − tn) ln
(1 − yn)1 − tn)
} (5.26)
5.1.2 The decision-making stage
To arrive at a classification from the posterior probabilities evaluated at the outputs of the neural network,
the minimum error-rate criterion is usually adopted. To minimise the probability of misclassification, a
new input should be assigned to the class having the largest posterior probability. Several aspects should
be taken into account when training in real-world problems. Firstly, the neural network is trained to
estimate the posterior probabilities of class membership based on the assumption that the input data
has been drawn from the same data distribution as the training set. Hence the output k represents
P (Ck | x, D) where D = {xn, tn} is the training set data. Secondly, the proportion of data from each
class in the training set reflects the prior probabilities of classes. From Bayes’ theorem:
P (Ck | x) =p(x | Ck)P (Ck)
p(x)(5.27)
⇒ p(x | Ck)P (Ck) = P (Ck | x)p(x) (5.28)
Integrating on both sides of Eq. 5.28 yields:
∫p(x | Ck)P (Ck) dx =
∫P (Ck | x)p(x) dx (5.29)
⇒ P (Ck) =∫
P (Ck | x)p(x) dx (5.30)
By assuming that all the values of x are equally probable in the training set, the right-hand side of Eq. 5.30
can be approximated as:
P (Ck) =∫
P (Ck | x)p(x) dx ≈ 1N
N∑n=1
P (Ck | x) (5.31)
5.1 Neural Networks 90
Thus, the prior probabilities are approximated as the average of each neural network output over all the
patterns in the training set. Hence, the prior probability P (Ck) should determine the proportion of the
patterns belonging to class Ck in the training set. In some cases, this can be problematic if the class with
the maximum risk of misclassification is very scarce (e.g. when diagnosing a fault or a disease). In these
cases, it would be desirable to include as many patterns from the high risk class as from the other classes
in the training set. Compensation for the different prior probabilities can be easily performed multiplying
each output by the ratio of the “true” prior probability with respect to the prior in the training set, and
normalising the corrected outputs so that they sum to unity.
5.1.3 Multi-layer perceptrons
A neural network with its neurons arranged in layers is called a perceptron. A single-layer of neurons is
therefore called a single-layer perceptron. It has inputs whose values are x1, x2, . . . , xd written as a feature
vector x. Given that there are no connections from the outputs to the inputs, a single-layer-perceptron
is a feed-forward neural network. It is also a linear classifier as it partitions the input space with hyper-
planes. The perceptron learning rule [159, pp.11] guarantees to find a solution with a single-layer if the
input feature vectors are linearly separable.
If more complex decision boundaries are required, two or more layers of neurons should be used. Such
a network is known as a multi-layer perceptron (MLP). Fig. 5.4 shows an I−J−K (2-layer) MLP, with
I-dimensional input patterns, J hidden units zj , and K outputs yk. The neurons in the output layer are
simply called “outputs”, while the neurons in the intermediate layer are called “hidden units”. When
necessary, the superindices z or y will be used to distinguish the weights w or the inputs of the activation
function a of the hidden layer from those in the output layer.
It can be shown that a 2-layer perceptron with smooth nonlinearities is able to approximate any arbitrary
function [159, pp.16]. However, the decision boundaries will not be abrupt, as with the hard-limiter
perceptron, but smooth and continuous instead. The approximation accuracy will then depend on the
number of units in the hidden layer. A low number of hidden units will give an insufficiently complex
5.1 Neural Networks 91
wj kwi j
xi yk
z1
zJ
zjhiddenunits
0z =1
x2
xI
y1
yK
z y
inputs outputs
x1
=1x0
Figure 5.4: A I−J−K neural network.
model for the given problem, while a large number of hidden units will result in an over-fitted model. An
over-trained or over-fitted network is a disadvantage in real world problems, since most real-world data
is very noisy.
Training an MLP: the error backpropagation algorithm
The training algorithm that underpins the use of multi-layer perceptrons is the so-called error backpropa-
gation algorithm. It uses error gradient information to seek a minimum of the error function. In order to
apply this algorithm to an MLP it is necessary to use continuous and differentiable activation functions.
The activation function for the hidden units does not necessarily have to be the same as for the outputs.
Hyperbolic tangent or sigmoid functions are usually chosen as the non-linearity for the hidden units.
If probabilities are to be represented at the outputs, then these units have to be restricted to the [0,1]
range. Hence, the sigmoid function for a 2-class problem, or its generalisation, the softmax function for
a K-class problem (K > 2) are recommended for the outputs. The softmax function is defined as:
gsoftmax(ak) =eak∑k′ eak′
(5.32)
where ak =∑J
j=1 wjkyj is the summation of the inputs to the kth neuron and k′ is the index of the
summation over all the K neurons in the output layer.
Let us assume that a neural network is to be trained to solve a classification problem for K > 2 mutually
exclusive classes, with a training set of input patterns xn with n = 1, . . . , N , represented by I feature
values, xn = [xn1 , . . . , xn
I ]T , and a class membership target vector tn = [tn1 , . . . , tnk ]T coded with a 1-of-K
5.1 Neural Networks 92
scheme. Assume that the network is a 2-layer MLP with J hidden units zj with a sigmoidal activation
function, and one output per class yk with softmax activation function. We would like to find values for
all the weights in the neural network, the vector w that minimises the error function E(xn; tn;w).
The gradient of the error, given by the vector ∇wE, points in the opposite direction to that of the deepest
descent of the error function in weight space. It can, therefore, be used in the search for the minimum of
the error function in weight space, by recursive updating of the weights given by:
w(τ+1) = w(τ) + Δw(τ)
= w(τ) − η∇wE(τ) (5.33)
where η is called the learning rate and τ denotes the iteration number. Expressing Eq. 5.33 for each
weight leaves:
w(τ+1)i = w
(τ)i + Δw
(τ)i
= w(τ)i − η ∂E(t)
∂wi(5.34)
For the reasons given in section 5.1.1 the cross-entropy error function is chosen to optimise the network
parameters. This error function is now written as a function of the weight vector:
E(w) =∑
n
En(w)
= −∑
n
K∑k=1
tnk lnyn
k (w)tnk
The derivatives of the cross-entropy error function with respect to the weights of the neural network can
easily be found by propagating back the error at the outputs towards the hidden and input layers as will
be shown below.
Derivatives of E with respect to the hidden-to-output weights
The output units’ activation function (softmax g(·), Eq. 5.32) includes in its denominator the inputs ayk
for all the outputs yk, so the weight wyjk affects all the outputs. Therefore, all of the outputs should be
5.1 Neural Networks 93
considered when differentiating the error for pattern n with respect to the output weights wyjk:
∂En
∂wyjk
=K∑
k′=1
∂En
∂ynk′
∂ynk′
∂ayn
k
∂ayn
k
∂wyjk
(5.35)
The first partial derivative for the right term of Eq. 5.35 is:
∂En
∂ynk′
= − tnk′
ynk′
(5.36)
The second partial derivative can be found from Eq. 5.32:
∂ynk′
∂ayn
k
= ynk′δkk′ − yn
k′ynk (5.37)
and the last derivative in the chain:
∂ayn
k
∂wyjk
= znj (5.38)
which does not depend on k′. Combining these two partial derivatives and summing over k′:
K∑k′=1
∂En
∂ayn
k
= ynk − tnk (5.39)
Then, substituting Eqs. 5.39 and 5.38 into Eq. 5.35 gives:
∂En
∂wyjk
= δyn
k znj (5.40)
where δyk =yk − tk.
Derivatives of E with respect to the input-to-hidden weights
We follow a similar procedure to find the derivatives of the error function with respect to the hidden
layer weights wzij , this time noting that the sigmoid activation function of the unit zj only depends on the
inputs to this unit:
∂En
∂wzij
=K∑
k′=1
∂En
∂ynk′
K∑k=1
∂ynk′
∂ayn
k
∂ayn
k
∂znj
∂znj
∂azn
j
∂azn
j
∂wzij
(5.41)
5.1 Neural Networks 94
The first two derivatives in Eq. 5.41 have already been found above, and are denoted by δyn
k . The third
derivative is:
∂ayn
k
∂znj
= wyjk (5.42)
Using Eq. 5.3:
∂znj
∂azn
j
= znj (1 − zn
j ) (5.43)
The last derivative of the chain is:
∂azn
j
∂wzij
= xni (5.44)
Combining all of these derivatives according to Eq. 5.41 gives:
∂En
∂wzij
= znj (1 − zn
j ) xni
K∑k=1
δyn
k wyjk (5.45)
= δzn
j xni (5.46)
As can be seen in the above equation, the weight update for the input-to-hidden weights depends on the
weight update for the hidden-to-output weights. Weight errors (the δ’s) are propagated backwards, to
the preceding layer, hence the name given to the algorithm.
Backpropagation for the two-class problem
As we have seen in section 5.1.1, for a 2-class problem only one output is needed, in which case the cross-
entropy error function is slightly different and gives rise to a modified expression for the backpropagation
“errors”. Also, as there is only one output, the sigmoid activation function is used for all units in the
network.
The part of error function that depends on the weights is:
E(x) = −∑
n
{tn ln yn(w) + (1 − tn) ln(1 − yn(w))}
5.1 Neural Networks 95
Differentiating with respect to the hidden-to-output weights wj:
∂En
∂wj=
∂En
∂yn
∂yn
∂ayn
∂ayn
∂wj(5.47)
we find that:
∂En
∂yn= yn−tn
yn(1−yn) (5.48)
∂yn
∂ayn = yn(1 − yn) (5.49)
∂ayn
∂wj= zn
j (5.50)
Then we get:
∂En
∂wj= δyn
znj (5.51)
where δy =y − t. This result is exactly the same as for the multiple-class problem.
To find the derivatives of E with respect to the input-to-hidden weights:
∂En
∂wij=
∂En
∂yn
∂yn
∂ayn
∂ayn
∂znj
∂znj
∂azn
j
∂azn
j
∂wij(5.52)
we find that the partial derivatives in the chain are:
∂En
∂yn
∂yn
∂ayn = yn − tn (5.53)
∂ayn
∂znj
= wj (5.54)
∂znj
∂azn
j
= znj (1 − zn
j ) (5.55)
∂azn
j
∂wij= xn
i (5.56)
Substituting all of them into Eq. 5.52 gives:
∂En
∂wij= δyn
wjznj (1 − zn
j )xni (5.57)
= δzn
j xni (5.58)
5.2 Optimisation algorithms 96
where δzn
j = δyn
wjznj (1 − zn
j ). This result is also exactly the same as for the multiple class problem with
K =1. This is very convenient because it makes the backpropagation of errors independent of the number
of classes in the problem.
5.2 Optimisation algorithms
5.2.1 Gradient descent
As we mentioned above, the training of an MLP is performed by minimising the error function E(w) in
the weight space W, conformed by all the weights in the network, using the deepest descent method, also
called gradient descent. This algorithm can be applied in batch fashion or sequentially. The first version
averages the Δwn for all the patterns and then updates the weights. The sequential version updates the
weights after each pattern is presented to the network. In either version a suitable value for the learning
rate η needs to be selected. A range can be found for η by using a quadratic approximation of the error
function around the minimum at w∗.
E(w) ≈ E(w∗) +12(w −w∗)TH(w −w∗) (5.59)
where H is the Hessian matrix of the error function with elements Hij = ∂2E∂wi∂wj
|w∗ , for i = 1, 2, ...,W
and W is the total number of weights.
The gradient of this approximation is:
∇wE = H(w −w∗) (5.60)
The eigenvalue equation for the Hessian matrix is:
Hui = λiui (5.61)
where the eigenvectors ui can be used as a basis in W, so we can write:
w −w∗ =∑
i
αiui (5.62)
5.2 Optimisation algorithms 97
where αi can be interpreted as the distance from the minimum in the ui direction. Then, the gradient
approximation can be written in terms of the eigenvectors of H:
∇wE =∑
i
αiλiui (5.63)
and also the difference between the weights for two consecutive iterations of the algorithm:
w(τ+1) −w(τ) =∑
i(α(τ+1)i − α
(τ)i )ui
=∑
i Δαiui
(5.64)
But since Δw=−η∇wE(τ), then:
∑i Δαiui = −η
∑i α
(τ)i λiui
⇒ α(τ+1)i − α
(τ)i = −ηα
(τ)i λi
⇒ α(τ+1)i = (1 − ηλi)α
(τ)i
(5.65)
After τf steps from a starting point w0, with α(0)i :
α(τf )i = (1 − ηλi)τf α
(0)i (5.66)
To reach the minimum, αi should tend to zero as τf increases. Then, the condition on η and the λi’s is:
|1 − ηλi| < 1
⇒ 0 < ηλi < 2(5.67)
for i=1, 2, ...,W .
It can be proved that if λi >0 for all i the minimum at w∗ is a global minimum. This is true for a definite
positive Hessian matrix. In this case, the condition in Eq. 5.67 gives the following range for η:
0 < η <2
λmax(5.68)
Note that in Eq. 5.66 the step size is constant around the minimum, imposing a linear convergence
towards the minimum. The convergence speed will be dominated by the minimum eigenvalue, so taking
the maximum value allowed for η, we find that the size of the minimum step is:
1 − 2λmin
λmax(5.69)
5.2 Optimisation algorithms 98
The ratio λmin/λmax is called the conditional number of the Hessian matrix. If this ratio is very small
(i.e. the error function has high curvature around the minimum) the convergence will be extremely slow.
A way to overcome this problem by increasing the effective step size can be achieved by adding an extra
term in the weight update equation.
Momentum
Adding a term proportional to the previous change in the weight vector in the equation for the weight
update may speed the convergence of w and smooth the oscillations.
Δw(τ) = −η∇wE |w(τ) +μΔw(τ−1) (5.70)
where μ is called the momentum parameter. If the momentum rate is in the open interval (0, 1), the effect
of adding momentum to the weight update in low-curvature error surfaces is an increase in the effective
learning rate by the factor:
11 − μ
(5.71)
However, in regions of large curvature the momentum term loses its effectiveness, and oscillations around
the minimum generally occur. In fact, the gradient descent rule used in the backpropagation algorithm
makes a very inefficient search for the minimum because, in practice, the error gradient does not point
towards the minimum most of the time, causing oscillations in the search for the minimum. Another
disadvantage of this method is the inclusion of two parameters, η and μ, with non-specified values and
no formal criteria for choosing their values.
5.2.2 Conjugate gradient
If instead of moving w a fixed distance along the negative gradient, we look in this direction until we
find the minimum of E(w) and then set the point as the new weight vector, the size of the step becomes
optimum in the search direction. At the new point the component in the search direction of the error
gradient vanishes. If additionally, we choose the new searching direction as one that does not “spoil” the
5.3 Model order selection and generalisation 99
minimisation achieved in the previous direction, i.e. that keeps the projection of ∇wE in the previous
direction null, and minimise again in this new direction, and repeat the procedure successively, we will,
after W steps, reach the minimum w∗. The set of non-interfering (or conjugated) directions can be found
without any need for extra parameters as is shown in Appendix B, where this method is described in
detail. This represents a definite improvement over the gradient descent method, even if in practice the
convergence is achieved in more than W steps for general non-linear error functions.
As already stated above, this algorithm does not have any non-specified parameter and in general con-
verges much faster than gradient descent, but it requires the computation of the first and second order
partial derivatives of the error function with respect to the weights. The backpropagation formulae for
the first order derivatives still apply. To avoid the use of the Hessian in the computation of αj , a numer-
ical procedure of line searching, or central differences can be used as an approximation to the Hessian.
The latter is used in a modification of this algorithm called scaled conjugate gradient algorithm which is
described next.
Scaled conjugated gradient
Apart from using an approximation to avoid the calculation of the Hessian, this algorithm overcomes
the other two major drawbacks of the conjugate gradient algorithm that arise when the error is far from
being quadratic. A technique called model trust region, based on a quadratic approximation of E, can be
applied to make sure that every step leads to a lower error when the Hessian matrix is negative definite
(otherwise the error may increase with the step). Also, the quality of the quadratic approximation is
tested at every step to adjust the parameter of the model trust method. The scaled conjugate gradient
method is described in the Appendix B.
5.3 Model order selection and generalisation
As stated in section 5.1.3, the number of hidden units J in an MLP determines the accuracy and degree
of generalisation that the neural network can achieve. As with any regression problem, too few free
5.3 Model order selection and generalisation 100
parameters will fail to fit the function properly, while too many parameters will over-fit the noisy data. A
compromise between accuracy and generalisation has to be found. This can be compared to the trade-off
between the bias and the variance of the network. A neural network with zero bias produces zero error
on average to all the possible sets of patterns drawn from the same distribution as the training data.
Even in this case, if the neural network has a marked sensitivity to a particular set of patterns, then the
variance of the network is high. Unfortunately there is no formal means of relating the number of hidden
units to the bias and variance of the network. Roughly, the number of hidden units should be close to the
geometric mean of the number of inputs I and the number of outputs K:
J =√
IK (5.72)
One of the simplest, although very demanding in computational resources, way to find the optimum J
is to train a set of neural networks with a range of number of hidden units around the estimate given
in Eq.5.72. Because the error optimising algorithm can get stuck in a local minimum, several random
weight initialisations should be tried in order to increase the probability of finding a good minimum. The
optimum is then selected based on the performance on an independent set of labelled input patterns,
known as the validation set.
Another way to “optimise” the number of free parameters in the network is to set a sensibly high value of
J and then penalise the least relevant weights in the network with an appropriate cost function during
training. This method is called regularisation and it will be explained next.
5.3.1 Regularisation
Regularisation is a common technique in regression theory which aims to encourage a smoother fit
through the inclusion of a penalty term Ω in the error function:
E = E + νΩ (5.73)
where ν is a control parameter for the penalty term Ω. The penalty function should be such that a good
fit will produce a small error E, while a smooth fit will produce a small value for Ω.
5.3 Model order selection and generalisation 101
It is well known heuristically that an over-fitted mapping produces large values of weights, whereas small
values of weights will drive the activation units mostly in the linear region of the activation function,
producing an approximately linear mapping, which is the smoothest possible. Therefore, a good function
for Ω would be one that increases as the magnitude of the weights increases. The simplest of these is the
sum-of-the-squares, commonly called weight decay:
Ω =12
∑i
w2i (5.74)
If a gradient descent procedure is applied to optimise the modified error function E, the gradient of
which is proportional to the weights:
ΔwE = ΔwE + νw (5.75)
then the variation of the weights in “time” due to the penalty term can be seen as:
dwdτ
= −ηνw
⇒ w(τ) = w(0)e−ηντ
(5.76)
which shows how, as a result solely of the influence of the penalty term, all the weights “decay” expo-
nentially towards zero during the training. It can be easily shown in a second order approximation of the
error function that the components of the error function along the directions with the lowest variances
of E in weight space are the most penalised by the regularisation term. This can be expressed as:
wj =λj
λj + νw∗
j (5.77)
where w is the minimum of the error function with weight decay E, w∗ is the minimum of the original
error function E, and λj is an eigenvalue of the Hessian matrix H evaluated at w∗, in a weight space
aligned with the eigenvectors of H. Therefore, weight decay will tend to reduce the value of the weights
with less influence on the error function. The final result will be a smoother fit than the one achieved
with w∗.
The weight decay function in Eq. 5.74 is not consistent with a linear transformation performed on the
input data as it treats weights and biases on equal grounds. A bias unit is an additional unit in the input
5.3 Model order selection and generalisation 102
and hidden layers of an MLP, with a permanent input value of 1, placed to compensate for the differences
between the yk’s mean and the tk ’s mean. Considering weights from different layers separately and
excluding the bias unit weights from the regularising term will solve this consistency problem as in:
Ω =νz
2
∑w∈Wz
w2 +νy
2
∑w∈Wy
w2 (5.78)
where Wz are the input-to-hidden weights except for the bias weights w0j , and Wy are the hidden-to-
output weights except for the bias weights w0k.
5.3.2 Early stopping
Another way to prevent an MLP with a relatively high number of hidden units from over-fitting the
training data is to stop the training process at a premature stage. This method, called early stopping,
makes use of a validation set to stop the training process when the error on the validation reaches a
minimum as is shown in Fig 5.5.
E
τvτ
validationerror
errortraining
Figure 5.5: Early stopping
5.3.3 Performance of the network
To evaluate the performance of the network the error function can be used, or its gradient. However,
given that the goal of training is to learn to discriminate between the classes of the training set in order
to perform interpolation of the class membership probabilities on test data, a more suitable measure of
performance could be the percentage of correctly classified patterns (accuracy) in a given dataset. For a
5.3 Model order selection and generalisation 103
K-class problem and a dataset with Nk patterns from each class, the accuracy of a classifier is defined as:
A = 100 ∗∑K
k=1 N ck∑K
k=1 Nk
(5.79)
where N ck is the number of correctly classified patterns from class k. A more common measure in pattern
recognition is the classification error rate, defined as the percentage of misclassified patterns:
Erate = 100− A (5.80)
If several networks are being evaluated, the optimal network should be chosen according to the final
validation error, while the performance of the selected network should be measured on data never seen
before, that being the purpose of the test set.
If several networks trained on the same data are very close to each other in terms of validation error, a
committee of networks can be formed, the output of this association being the average of the individual
outputs. It can be proved that a committee of networks statistically performs better, or at least not worse,
than the individual networks [19, pp. 366].
The training, validation and test sets should ideally be of equal size. The number of input patterns in
the training set should be at least 10 times greater than the number of free parameters in the network
[159, pp.70], a requirement sometimes difficult to meet, even more if an equal number of patterns has
to be saved for the validation and test sets. In this case, a technique usually applied in statistics as part of
“jack-knife” estimation [109], cross-validation, can be applied, such that the data is split into S subsets,
or partitions, of equal size. Each of these subsets is in turn the test set for the neural network while
the remaining S − 1 subsets are used to form the training and validation sets. The S neural networks
obtained in this way can then be combined in a committee of networks. If the data is not plentiful for
division into S subsets, then the leave-one-out method, a variant of the jack-knife method whereby every
sub-set consists of only one sample, can be used.
5.4 Radial basis function neural networks 104
5.4 Radial basis function neural networks
Another kind of neural network, which can be used to estimate posterior probabilities, is the so-called
Radial Basis Function (RBF) network. Its architecture, shown in Fig. 5.6, is very similar in appearance
to the MLP but its operation is very different. Firstly, the activation function of the hidden units is not a
weighted summation followed by a non-linearity. Instead, it is a radially symmetric function φj (usually
a Gaussian) with a different mean vector μj for each unit. In addition, the output units only perform
a linear combination of the hidden unit outputs, without applying any nonlinear function to the result.
Also, RBF training is different from that of an MLP, since it is performed in two phases instead of one, and
a nonlinear optimisation process is not required, the equations for minimising the quadratic output error
over the second-layer weights being linear.
xi yk
wj k
x
x
1
2
xI
y1
yK
inputs outputs
Φ
JΦ
1
hiddenunits Φj
Figure 5.6: A radial basis function network
Originally developed to perform exact function interpolation, early RBF networks made a non-linear
mapping of N input vectors in an I-dimensional space to K target points in a 1-dimensional space,
through N radial basis φj(·) functions. In order to obtain better generalisation when fitting noisy data,
the number of basis functions was reduced, to a number significantly lower than the number of input
vectors. The resulting RBF has been widely used not only in noisy interpolation, but also in optimal
classification theory.
The mapped points or RBF outputs for a K-class problem are given by:
yk =J∑
j=1
wjkφj + w0k for k = 1, 2, . . . ,K (5.81)
5.4 Radial basis function neural networks 105
For the case of a Gaussian basis function, φj is defined as:
φj(x) = exp{−1
2(x − μj)
TΣ−1j (x − μj)
}(5.82)
where μj and Σ represent the mean and covariance matrix respectively. In an RBF network, the covari-
ance matrix of the Gaussian basis functions can be considered to be of the form σ2I (hyper-spherical
Gaussians) without loss of generalisation. In this case, Eq. 5.82 takes the form:
φj(x) = exp
(−‖ x − μj ‖2
2σ2j
)(5.83)
The Gaussian functions in an RBF are un-normalised since any multiplier factors are absorbed in the
weights wjk in Eq. 5.81.
5.4.1 Training an RBF network
The first training phase is used to estimate the parameters of the basis functions φj(·) and no class
information is required for it, hence this phase is unsupervised. Once this phase is completed, the kernel
function activation is determined only by the distance between the input vector x and the mean vector
μj , and the kernel width σj . It can be shown [19] that, after this training phase, the summation of all the
radial basis function outputs is an estimate of the unconditional probability of the data p(x). Posterior
probabilities for each class P (Ck) are estimated at the outputs of the RBF network after the second phase
of training, which adjusts the second-layer weights wjk, this time using the target values tk of each input
vector x in the training set. For this reason, the second phase is called supervised.
Unsupervised phase: cluster analysis
Unsupervised training can be viewed as a clustering problem. Each Gaussian kernel represents a group
of similar vectors in the I-dimensional input space. Since the objective of the initial phase of learning
is to model the unconditional probability density function, the clusters do not necessarily separate data
from different classes. To summarise, the aim of this phase is to find the location of the cluster centres
and the distribution of data within them in order to determine the mean and variance of each Gaussian
5.4 Radial basis function neural networks 106
(hidden unit of the RBF network). One of the most common clustering algorithms, the so-called K-
means algorithm [159], can be used for this purpose. The number of means K is chosen to be equal to
the number of hidden units J .
The K-means Algorithm This algorithm seeks to find a partition of the input data set into K regions or
clusters. Usually, the similarity criterion that defines a cluster is the distance between data (Euclidean in
most cases). The algorithm determines, for each of the K clusters, Ck, the location of its centre mk, and
identifies the patterns xi that belong to this cluster. In an iterative optimisation process, the partition is
modified so that the distances between the patterns belonging to a cluster and its centre are minimised.
This can be expressed as the optimisation of a quadratic error function defined by:
E2K =
K∑k=1
∑x∈Ck
‖ x −mk ‖2 (5.84)
Random values are initially assigned to the centres mk, and each data point is assigned to the cluster
with the centre nearest to it. Then, each centre mk is changed to be the mean of the data belonging to
the cluster Ck, reducing in this way the value of the error function defined in Eq. 5.84. These two last
steps are repeated until no significant change in centre positions is detected.
The procedure described above is known as the “batch” version of the K-means algorithm, since, at
every step, the centres are modified once all the patterns have been assigned to the clusters. There is an
“adaptive” version whereby the nearest centre is modified each time a pattern is considered, so that the
distance between them is reduced.
m(τ+1)k = m(τ)
k + η(x −m(τ)k ) (5.85)
where the parameter η is a learning parameter. The adaptive version is a stochastic procedure because
the patterns are chosen from the data set randomly, and the algorithm is more prone to becoming trapped
in a local minimum.
The value for K can be found by running the algorithm for k = 1, 2, 3, . . . until a knee in the curve
5.4 Radial basis function neural networks 107
E2k-vs-k is obtained. Typically this curve decreases monotonically (reaching zero value for k = N), but
the “knee” indicates a substantial change in the ‘error function rate’, which is large for small values of k
and decreases less quickly for k values above the ‘knee’ value. The value of k at the knee can be taken as
the optimum value [159, pp.23].
Normalisation Since Euclidean distance is used to set the location of the Gaussian kernels, differences
in dynamic range between features will cause the smallest ones to be ignored by the clustering algorithm.
To avoid this, zero-mean, unit-variance normalisation is applied to the entire data set of N patterns before
the unsupervised phase:
xni =
xni − μi
σi(5.86)
where:
μi =1N
N∑n=1
xni (5.87)
σ2i =
1N − 1
N∑n=1
(xni − μi)2 (5.88)
Then, a clustering procedure like the one described in the previous section is performed to find a set of
J centres. Once the unsupervised phase of the RBF training is complete, the cluster variance σ2j is found
for each cluster Cj :
σ2j =
1Nj
∑x∈Cj
(x − μj)T (x − μj) (5.89)
where Nj is the number of patterns that belong to cluster Cj .
Second phase: linear optimisation of weights
The second training phase uses the labelled patterns in supervised learning mode. The output layer
receives the information from the hidden-unit outputs and it can be trained with a data set smaller than
5.4 Radial basis function neural networks 108
the one used for the first training stage. The LMS algorithm is used to minimise the error function:
E(w) =12
N∑n=1
K∑k=1
⎛⎝ J∑
j=0
wjkφnj − tnk
⎞⎠
2
(5.90)
Differentiating the error with respect to the weights wjk and setting it to zero to find the minimum gives:
N∑n=1
⎛⎝ J∑
j′=0
wj′kφnj′ − tnk
⎞⎠ φn
j = 0, for j = 1, 2, . . . , J and k = 1, . . . ,M (5.91)
These equations are known as normal equations, and have an explicit solution. Using matrix notation:
ΦT ΦWT = ΦTT (5.92)
with the elements of the matrices being defined as (Tkn) = tnk , (Wjk) = wjk and (Φjn) = φj(xn). The
solution is:
WT = Φ†T (5.93)
where Φ† represents the pseudo-inverse matrix of Φ. The pseudo-inverse matrix is given by:
Φ† ≡ (ΦT Φ)−1ΦT (5.94)
for which Φ†Φ = I always, although ΦΦ† �= I in general. When data is noisy, it is very common to find
that the matrix (ΦTΦ) is nearly singular. In this case, the singular value decomposition (SVD) algorithm
can be used to avoid larger values for the weights wjk, since it sorts out the roundoff error accumulation
problem and chooses from a set of possible solutions the one that gives the smallest values [131].
5.4.2 Comparison between an RBF and an MLP
In general, the performance of an MLP is slightly better than that of an RBF. The reason behind this is be-
cause the MLP’s fully surpevised non-linear optimisation is in general better than the RBF’s unsupervised
non-linear clustering process followed by a linear optimisation[159]. Advantages of the use of an RBF are
the shorter training time, since training does not require non-linear optimisation, and the lack of the need
for a validation set. The hidden layer representation of an RBF is more accessible. Since it represents the
5.5 Data visualisation 109
unconditional probability of the training set, it can be used as a novelty detector on new data, when all the
hidden units show very low activation, indicating that the RBF network is extrapolating, and therefore,
no confidence should be given to the result [159].
5.5 Data visualisation
One of the first stages in the solution of a classification problem usually consists of getting more insight
into the data structure. If the features extracted do not reveal enough separation between the classes
in the feature space, a search for new features should be considered. It is also desirable to obtain more
details such as inter-subject variability and incidence of outliers. The visualisation of the data distribution
for a number of features L less than or equal to 3 is straightforward, otherwise more sophisticated
procedures are required.
The relations of proximity and organisation of the data in a feature space, with dimensionality higher
than 3, can be visualised through a non-linear projection from RL to R
M , with M typically 2 or 3. This is
the basis of the Sammon map, which will be described next.
5.5.1 Sammon map
Sammon’s algorithm seeks to create a mapping such that the distances between the image points in the
projection plane are as close as possible to the corresponding distances between the original data points
in feature space. The following error function at iteration number τ is defined:
E(τ) =1∑N
i
∑Nj=i+1 δ
(τ)ij
N∑i
N∑j=i+1
[dij − δ(τ)ij ]2
δ(τ)ij
(5.95)
where N refers to the number of vectors to be mapped, dij to the Euclidean distances between the vectors
xi and xj in L-space, and δ(τ)ij to the Euclidean distances between the corresponding vectors (images or
projections) y(τ)i and y(τ)
j in M -space.
dij =‖ xi − xj ‖ (5.96)
δij =‖ yi − yj ‖ (5.97)
5.5 Data visualisation 110
Minimisation of this error function can be achieved, starting from random locations for the image points,
by adjusting them in the direction which gives the maximum change in the error function (gradient
descent method), as is shown in Eq. 5.98
y(τ+1)im = y
(τ)im − αΔ(τ)
im for m = 1, . . . ,M (5.98)
where:
Δim =∂E
∂yim÷
∣∣∣∣ ∂2E
∂y2im
∣∣∣∣ for m = 1, . . . ,M (5.99)
and the gradient proportionality factor α is determined empirically to be between 0.3 and 0.4 [141]. The
partial derivatives are given by:
∂E
∂yim=
−2∑Nk=1
∑Nj=k+1 dkj
N∑j=1j �=m
[dij − δij
dijδij
](yim − yjm) (5.100)
∂2E
∂y2im
=−2∑N
k=1
∑Nj=k+1 dkj
N∑j=1j �=m
1dijδij
[(dij − δij) −
(yim − yjm)2
dij
(1 +
dij − δij
δij
)](5.101)
A small number of representative vectors can be extracted (using K-means clustering, for example) to
reduce the number of computations required [O(N2)] to complete the mapping.
5.5.2 NeuroScale
The Sammon map’s main drawback is that it acts as a look-up table. Previously unseen data cannot be
located in the projection map without re-running the optimisation procedure. A parameterised transfor-
mation yi = G(xi;w), where w is the parameter vector, would allow the desired interpolation however.
This parametric transformation can be performed by a neural network. During training, this neural
network has no fixed targets, the outputs and weights being adjusted to minimise an error, or “stress”
measure, related to the Sammon map error function and given by:
E =N∑i
N∑j=i+1
[dij − δij ]2 (5.102)
5.5 Data visualisation 111
where the terms dij and δij are given by Eqs. 5.96. and 5.97. The training of such a neural network
is said to be relatively supervised as there is no specific output target, but a relative measure of target
separation between each pair {yi,yj}.
For an RBF with H basis functions the square of dij can be expressed as:
d2ij =
M∑m=1
(H∑
h=1
whm[φh(‖ xi − μh ‖) − φh(‖ xj − μh ‖)])2
(5.103)
Then, the derivatives of the stress function with respect to the weights for each data point xi are given
by:
∂Ei
∂whm=
∂Ei
∂yi
∂yi
∂whm(5.104)
where:
∂Ei
∂yi= −2
N∑j=1j �= i
dij − δ2ij
δij(yi − yj) (5.105)
Note the difference between the derivative in Eq. 5.105 and the corresponding term for a supervised
problem with sum-of-squares error (see Eq. 5.90), the latter being given by:
∂Ei
∂yi= yi − ti (5.106)
Thus the relatively supervised training procedure has an estimated target vector ti given by:
ti = yi −∂Ei
∂yi
= yi + 2N∑
j =1j �= i
dij − δ2ij
δij(yi − yj)
(5.107)
However, the minimisation of the stress measure cannot be performed in one step as in the linear phase
of an RBF training (Eq. 5.93) because the estimated targets are not fixed, but depend upon the current
outputs yi and weights. Instead, the minimum can be sought in an iterative approach with an EM1-like
procedure, which is more efficient than backpropagation in an MLP [163].
1Expectation-Maximisation [19, pp.65] is a two-step procedure to solve the highly non-linear, coupled equations of maximumlikelihood optimisation problems
5.5 Data visualisation 112
To prevent an increase in the stress during the early stages of the algorithm, when the estimate of the
targets is poor, a learning rate η(τ) control is introduced in Eq. 5.107:
t(τ)i = y(τ)
i − ητ∂Ei(τ)
∂y(τ)i
(5.108)
where η(τ) is initially set to have a small value and is progressively increased as the stress decreases
during training.
The training algorithm then becomes:
1. Initialise the weights to small random values
2. Initialise η1 to some small value
3. Calculate the pseudo-inverse matrix Φ†
4. Initialise τ = 1
5. Calculate the target vectors t(τ)i
6. Solve W(τ)T = Φ†T
7. Calculate the stress
8. • If the stress has increased, decrease η
• If the stress has decreased, increase η
9. If the stopping criterion is not satisfied, return to step 5
The increase and decrease in η in step 8 is arbitrarily set to a range of 10 − 20% [163]. If the stress
measure is comparable to the final stress calculated in a standard Sammon mapping procedure then the
algorithm can be stopped. This optimising procedure is called the shadow targets algorithm and the RBF
described in this section has been referred to as NEUROSCALE [163].
A caveat for the use of the techniques described above is that the lower dimensional projection generated
may show data overlap, which may not be present (or at least not in the same proportion) in the high
dimensional feature space.
5.6 Discussion 113
5.6 Discussion
So far we have introduced the neural network approach to classification as a non-parametric method
for the estimation of the posterior probabilities of class membership. Non-parametric methods are more
flexible than parametric approaches and are easier to apply than semi-parametric methods, such as Gaus-
sian Mixture Models [19, pp.60]. The probabilistic nature of the neural network outputs gives them an
advantage over other classifiers, like linear discriminants and support vector machines to mention a few
[169].
We presented two kinds of neural networks, the MLP and the RBF network, and stated that the first tends
to outperform the second one for the reasons given in §5.4.2 (see [175] for a comparison between an
MLP and an RBF performance in disturbed sleep analysis). A balanced dataset should be used to assign
the same relevance to all the classes. For MLP training, a validation and a test set should be reserved
from the balanced dataset in order to avoid over-fitting the training data. If the amount of data is not
sufficient to allow this partitioning, the leave-one-out method should be used.
The aim of the work presented in this thesis is to estimate the state of the brain in the sleep context
(for μ-arousal detection), and with in the vigilance alertness-drowsiness continuum. Although the data
is labelled according to six or seven discrete classes (see sections 3.2.1 and 3.4.2), the neural network is
capable of performing interpolation between classes. The 1-of-K code is recommended when the targets
are discrete, as is the case in both the sleep and vigilance problems. The cost function associated with
this coding scheme is the cross-entropy error function. Minimisation of the cost function can be achieved
efficiently by using the scaled conjugate gradients algorithm. The performance of the network may be
evaluated using the misclassification error as the criterion.
The trade-off between bias and variance of the network suggests that the search for an optimum network
architecture can be carried out by training several networks with different initial values for the network
parameters. Regularisation techniques can be applied to achieve better generalisation, and although the
values of the regularisation parameters cannot be found analytically, they can be included in an extensive
5.6 Discussion 114
search for the best generalisation, the latter being evaluated as the classification performance on the
validation set.
The techniques known as Sammon map and NEUROSCALE, introduced in this chapter, can be used to
visualise the relations of proximity between patterns from different classes in the feature space, providing
hints as to which classes should be used for neural network training, and helping to establish what might
be expected from the neural network performance. These visualisation techniques can also help to rule
out outliers, i.e. data from a distribution different from that of the training data, the NEUROSCALE map
being particularly useful when analysing new data.
Chapter 6
Sleep Studies
Prior to the analysis of the sleep of patients with OSA, a deeper understanding of normal sleep should
be acquired. Previous work has shown that neural network methods for data visualisation and classifi-
cation provide useful information on data structure and clustering [123], and these methods have been
successfully applied to sleep staging and tracking [143][123][16][158]. The EEG is the most significant
and reliable physiological measure of sleep, and is relatively easy to acquire as a signal. As we have
seen in Chapter 3, the sleep EEG of OSA sufferers has the same characteristics as that of normal subjects,
the difference being in the higher number of rapid transitions from sleep to wakefulness in OSA patients.
Therefore, it should be possible to use a neural network trained with normal sleep EEG data with the EEG
recorded from OSA patients. In this chapter we report on the training of MLP networks using a database
of normal sleep EEG records and investigate their subsequent performance on OSA sleep EEG (test data).
6.1 Using neural networks with normal sleep data: benchmark ex-periments
6.1.1 Previous work on normal sleep
In previous work [123], 10th-order AR modelling of 1s EEG segments and a visualisation technique
known as the Kohonen map [90] were applied to give an overall view in 2-D of the AR coefficients for
normal sleep EEG. Kohonen’s map is a self-organising algorithm which projects an entire data set or
input vectors from an L-dimensional space into a relatively few cluster centres or “code-vectors” laid
6.1 Using neural networks with normal sleep data: benchmark experiments 116
out on a mesh in a lower, M -dimensional space (usually M = 2), in such a way that the relations of
proximity (topology) between the input vectors are preserved1. This work showed that there were three
well differentiated groups or clusters of data in the sleep EEG database, corresponding to the stages of
wakefulness, REM/light sleep (stage 1) and deep sleep (stage 4). Intermediate stages 2 and 3 did not
form separate clusters, but transient events such as K-complexes and spindles were mapped onto different
regions of the map. This phase of learning is unsupervised since no labels are taken into account when
constructing the Kohonen map, although labels are used later to identify the clusters. Based on the results
obtained with the Kohonen map, a neural network was trained with the same sleep EEG database, the
aim being to classify the sleep EEG into the 3 categories identified in the Kohonen map, by estimation of
the posterior probabilities of class membership. Results on test data showed that the plot of the neural
network outputs over time “tracks” the sleep-wake continuum with a better resolution than the R&K
discrete stages (as the neural network outputs can take any value between 0 and 1) and with a better
resolution in time since 1-s epochs are used to segment the EEG rather than the 30 seconds of the R&K
hypnograms. Fig. 6.1 shows the time course of the three neural network outputs (P(W ) for wakefulness,
P(R) for REM/light sleep, and P(S) for stage 4) for a 7-hour sleep recording. The main features of
the normal sleep-wake cycle can be seen in these plots. The P(W ) output takes a value close to 1 at
the beginning of the night, followed by a rapid descent to zero and remains at this level for about 40
minutes, while the P(R) output rises from zero to a value higher than 0.5 at the same time as the P(W )
output decreases, indicating a transition from fully awake to the first stage of sleep (sleep onset). The
P(S) output, which starts at zero, rises steadily as the P(W ) and P(S) outputs decrease, and stays high for
the remaining 40 minutes of the first hour of the night. For the rest of the night, the P(R) and the P(S)
outputs wax and wane alternately, an indication of the 90-minute REM and non-REM sleep cycle, with a
progressive lightening of sleep as the night advances. When P(W ) is high (subject awake), P(S) is low,
since it is not physiologically possible that these two probabilities can exhibit a similar value, except when
both are near zero. In such a case the value P(R) must be high since the sum of the three probabilities
1Kohonen’s map main disadvantage over Sammon’s map is that the image points are constrained to lie on a rectangular grid
6.1 Using neural networks with normal sleep data: benchmark experiments 117
must be equal to one, indicating that the subject is in REM/light sleep. Hence, the Wakefulness output
P(W ) and the deep sleep output P(S) were combined in a measure of “sleep depth” P(W )-P(S), in which
the values of 1, 0 and -1 indicate wakefulness, REM/light sleep and deep sleep respectively. The trace
P(W )-P(S), shown at the bottom of Fig. 6.1, is similar to the R & K hypnogram, but with a continuous
resolution in amplitude and a ×30 time resolution.
0 1 2 3 4 5 6 7
−1
0
1
0
1
0
1
0
1
hours
P(W
)−P
(S)
P(W
)P
(R)
P(S
)
Figure 6.1: The neural network’s wakefulness P(W ), REM/light sleep P(R) deep sleep P(S) outputs; andmeasure of sleep depth P(W )-P(S) (from Pardey et al. [123])
The above work established the feasibility of using AR coefficients at the inputs to a neural network with
3 outputs to describe the sleep-wake continuum. A previous investigation [176] compared the use of
5th order AR parameters with the power in five EEG bands when these where used as inputs to a neural
network, and showed that they contribute the same information to the analysis of normal and disturbed
6.1 Using neural networks with normal sleep data: benchmark experiments 118
sleep. In this chapter, we build on these results in order to detect μ-arousals in the sleep of OSA subjects.
A 10-th model order is used as Pardey et al. [123] found that some EEG segments corresponding to
wakefulness may be under-fitted with a lower model order. Since the effect of OSA on the EEG is only
to change the sleep structure rather than the EEG itself, we can use a neural network trained on normal
subjects to analyse the sleep of OSA patients. We re-visit the choice of algorithm for extracting coefficients
and aim to minimise variance whilst ensuring stationarity. We go beyond the work described in [123]
by carrying out a thorough investigation of network architecture and free parameters (including weight
decay coefficients) in order to identify the optimal network. This involves training more than 2,000
networks. Finally, we use the optimal network in order to detect μ-arousals in sleep EEG recordings
acquired from seven subjects with severe OSA.
6.1.2 Data Extraction
The normal sleep EEG from nine healthy female adults, with no history of sleep disorders, aged between
21 and 36 years (average 27.4), was recorded with electrode pair C4/A1, and digitised with an 8-bit
A/D converter and a sampling rate of 128 Hz. Prior to digitisation, the analogue EEG was filtered with a
bandpass filter (0.5–40 Hz with a −40 dB/dec slope in the transition band). Two EOG channels and the
submental EMG were also recorded for the purpose of generating R & K hypnograms.
The length of every record was approximately 8 hours. Each record was divided into 30s-segments, and
classified separately by three human experts, trained in the same laboratory, according to the R&K rules.
The number of 30s-segments for which the three experts were in agreement in their classification varied
from 137 for stage 1, which a transitional stage that only last a few minutes (see section 3.2.1), to 2,665
for stage 2, the most abundant and easiest to score of all sleep stages. These segments are referred to as
being consensus-scored.
6.1 Using neural networks with normal sleep data: benchmark experiments 119
6.1.3 Feature extraction
Pre-processing
The sampled EEG signal was also digitally filtered with a low-pass linear phase filter, with a cutoff fre-
quency at 30 Hz, a bandpass gain of 1.00 ± 0.01 and −50 dB attenuation at 50 Hz, using a zero-phase-
distortion filtering technique. The mean of each EEG recording (calculated over the whole record) was
removed.
Autoregressive Analysis
To apply AR modelling to the EEG segments, an investigation of the algorithms described in chapter 4
was undertaken. The relationship between segment data length and the bias and variance of the AR
coefficients was also studied. 911 and 5,214 consensus-scored 4s-segments of Wakefulness and Sleep
Stage 4 respectively were selected from the database, and their reflection coefficients (for an AR model
order 10) were estimated using the Burg algorithm. The means of these estimates, μW and μS , were
then used to synthesise “typical” Wakefulness and Sleep Stage 4 EEG signals. The mean reflection coeffi-
cients were transformed to 10th-order mean feedback coefficients (for definition see §4.6.1) by using the
inverse Levinson-Durbin recursion (Eq. 4.102). Following the procedure described in section 4.3.3 for
AR synthesis, an ensemble of 500 time series with length N was generated using a white noise generator
with unit variance and the AR feedback coefficients, and four algorithms, namely Burg, Covariance (Cov),
Modified Burg (ModBurg) and Structure Covariance Matrices (SCM), were used in turn to estimate the
AR reflection coefficients for each time series. The Euclidean distance between the mean of the estimates
for the ensemble μ and the value used to generate the ensemble μ was calculated, as well as the trace of
the covariance matrix of the estimates TrS. The results for N , a power of 2 from 16 to 512, are shown
in Tables 6.1 and 6.2, and show that there is little difference between the various algorithms, at least for
data lengths N ≥ 128.
The Burg algorithm has lower computational cost than the others and so was chosen to estimate the
6.1 Using neural networks with normal sleep data: benchmark experiments 120
mean errorN 16 32 64 128 256 384 512Burg 0.3814 0.1714 0.0952 0.0517 0.0272 0.0204 0.0155Cov -a 0.6402 0.1260 0.0521 0.0270 0.0204 0.0154ModBurg 3.9057 0.7073 0.0947 0.0520 0.0276 0.0204 0.0155SCM 0.9083 0.1783 0.0990 0.0529 0.0277 0.0204 0.0156
covariance matrix traceN 16 32 64 128 256 384 512Burg 0.6267 0.2000 0.0929 0.0449 0.0207 0.0143 0.0105Cov -a 395 4.4268 0.0500 0.0219 0.0148 0.0107ModBurg 7897 225 0.0958 0.0451 0.0209 0.0144 0.0106SCM 1.5395 0.2175 0.0955 0.0455 0.0210 0.0144 0.0106
aCovariance algorithm requires the data length to be at least twice the model order
Table 6.1: Mean error and trace of covariance matrix for synthesised EEG reflection coefficients (wake-fulness)
mean errorN 16 32 64 128 256 384 512Burg 0.3410 0.1407 0.0719 0.0405 0.0189 0.0128 0.0094Cov -a 7.2100 0.5122 0.0406 0.0187 0.0128 0.0093ModBurg 2.1033 1.2879 0.0707 0.0399 0.0188 0.0128 0.0094SCM 0.8055 0.1410 0.0729 0.0405 0.0189 0.0127 0.0094
covariance matrix traceN 16 32 64 128 256 384 512Burg 0.5482 0.1719 0.0809 0.0385 0.0189 0.0127 0.0096Cov -a 27595 68.6 0.0428 0.0195 0.0129 0.0097ModBurg 2620 839 0.0837 0.0390 0.0190 0.0127 0.0096SCM 1.5370 0.1861 0.0829 0.0391 0.0190 0.0127 0.0096
aCovariance algorithm requires the data length to be at least twice the model order
Table 6.2: Mean error and trace of covariance matrix for synthesised EEG reflection coefficients (stage 4)
6.1 Using neural networks with normal sleep data: benchmark experiments 121
reflection coefficients of a 10th-order AR model of the EEG data. Figure 6.2 illustrates the results of the
Burg algorithm on the synthesised EEG with N going from 16 to 1,024. It can be seen that the accuracy
and the variance of the estimates improve significantly once N gets to 256.
As was discussed in §4.7, stationarity concerns suggest that segments should not be greater than 1 second
(128 samples) when analysing the EEG. But the results here show that the variance of the estimates can
be reduced by increasing the length of the time series segment to 256 or greater. A compromise was found
by using a 384-sample sliding window (corresponding to 3 seconds), which is advanced in one-second
steps (128 samples). Each set of reflection coefficients is then taken to represent the middle second of
the 3-second window.
0 200 400 600 800 10000
0.1
0.2
0.3
0.4
data length N
mea
n er
ror
Wakefulness
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
data length N
cova
rianc
e m
atrix
trac
e
Wakefulness
0 200 400 600 800 10000
0.1
0.2
0.3
0.4
data length N
mea
n er
ror
Sleep stage 4
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
data length N
cova
rianc
e m
atrix
trac
e
Sleep stage 4
Figure 6.2: Mean error and covariance matrix trace for reflection coefficients computed with the Burgalgorithm (wakefulness and Sleep stage 4) vs data length N
6.1.4 Assembling a balanced database
The same number of segments for each of the categories (wakefulness (W), REM (R) and sleep stage 4
(S)) was randomly taken from the overall database to build a balanced data set using only consensus-
6.1 Using neural networks with normal sleep data: benchmark experiments 122
scored segments, the overall number being determined by the minimum number available in any one
class. In sleep studies, it is obviously the wakefulness set which will be the smallest, with only 164 30-s
segments, yielding 4,920 one-second segments and hence 4,920 sets of 10 reflection coefficients were
assembled for each class. From now on, this dataset will be referred to as the balanced sleep dataset.
6.1.5 Data visualisation
The Sammon map and NEUROSCALE visualisation techniques (see section 5.5) were applied in order to
gain insight into the clustering present in the data. The reflection coefficients were normalised to give
a zero mean and unity standard deviation in each axis of the feature space. This gives each coefficient
equal importance a priori.
K-means algorithm
Given that the amount of data (14,760 data points) is too large to be handled by the visualisation algo-
rithm, per-class clustering using the K-means algorithm was applied with 60 means per class and an η
factor (see Eq. 5.85) of 0.02.
Sammon Map
A 2D Sammon map algorithm was applied to the 180 mean vectors generated by the K-means algorithm.
The gradient proportionality factor α, (see Eq. 5.98) was adjusted to a value of 0.06. The Sammon map
for the three classes and for each class separately is shown in Fig. 6.3, with a circle around each centre
whose radius indicates the relative size of the cluster represented by the centre.
It can be seen from the map that the classes form well defined clusters with some overlap between them.
The Wakefulness cluster is the most sparse, whilst the REM/light sleep cluster lies between wakefulness
and deep sleep, as expected.
6.1 Using neural networks with normal sleep data: benchmark experiments 123
(a) classes W, R and S (b) W class
(c) R class (d) S class
Figure 6.3: Sammon map for the balanced sleep dataset; classes W, R and S
NeuroScale
A NEUROSCALE neural network with 50 basis functions was trained with the same reduced data set for
comparison and also to explore the overall distribution of the data points, as the advantage introduced
by this visualisation technique is that data not seen before, but belonging to the training data distribu-
tion, can be mapped onto the trained visualisation map (see §5.5.2). The map of the centres and the
subsequent projection of all the data points in the balanced feature set are shown in Fig. 6.4.
6.1.6 Training a Multi-Layer Perceptron neural network
A multi-layer perceptron was chosen over a radial basis function neural network because MLPs tend to
perform slightly better than RBF networks, when the latter are trained using the two-phase process de-
scribed in §5.4. The balanced dataset was divided into 3 balanced subsets of 4,920 data points each
(1,640 per class), namely the training set, validation set and test set (see introductory section and sec-
tion 5.3 in chapter 5). Although, all the inputs xi have absolute value equal to or lower than 1, they have
6.1 Using neural networks with normal sleep data: benchmark experiments 124
−10 −8 −6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6NeuroScale with sleep data, 60 means per class, 50 basis functions, 500 iterations
Wakefulness REM/Light SleepDeep−sleep
(a) Means only
−10 −8 −6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6NeuroScale with sleep data, 60 means per class, 50 basis functions, 500 iterations
Wakefulness REM/Light SleepDeep−sleep
(b) All patterns
Figure 6.4: NEUROSCALE map for the balanced sleep dataset; classes W, R and S
different dynamic ranges. In theory, this does not affect the MLP training, since the weights are capable of
correcting the differences in dynamic ranges, but certain optimisation procedures, such as regularisation
(see Eq. 5.78), require equal range of variations at the inputs to the neural network. Hence, zero-mean
and unit-variance normalisation was performed on the three subsets, using the training set statistics (μ,
σ). Normalisation also helps to reduce neural network training time [159, pp.84]. The 3 outputs of the
MLPs represent the classes W, R or S (1-of-K coding with softmax activation post-processing). Cross-
entropy error was selected as the cost function for the scaled conjugate gradients optimisation algorithm
for network training.
Optimising the network architecture
As was explained in chapter 5 there is no analytical means of determining the optimal value of MLP
parameters such as the number of hidden units J or the weight decay factors νz and νy. Although we
know that the regularising parameters νz and νy penalise an “excessive” number of hidden units, the
limits for this number are unknown. Therefore, we evaluate the performance on the validation set of a
number of MLP’s trained with values of these parameters varying over a given range in order to find the
“optimal” MLP architecture.
Equation 5.72 suggests that the numbers of hidden units J should be approximately the geometric mean
6.1 Using neural networks with normal sleep data: benchmark experiments 125
of the number of inputs times the number of outputs, i.e.√
10 × 3 = 5.5 here. J was therefore varied from
4 to 10. No guideline is available for the regularisation parameters, and so these were varied between
10−6 to 10−2 in powers of ten. To avoid being trapped in local minima, a stochastic optimum search
was performed by shuffling the patterns using five random seeds when allocating them to the training,
validation and test sets. In addition, three different random weight initialisations were employed. This
yields the following total number of networks of:
5 shuffling seeds for training, validation and test sets ×3 weight initialisations ×
7 values of J×5 values of νz×5 values of νy
=2,625networks
The results show little performance variation with respect to weight initialisation, no more than 0.9% in
the difference between the best and the worst classification error. The variation in the classification error
for the data shuffling into the three datasets is not greater than 1.5% for the training and test sets, and
less than 2% for the validation set. Figure 6.5 shows the relationship between the number of hidden units
and the average performance of the networks (averaged for all data partitioning seeds, weight seeds and
reguralisation terms).
4 5 6 7 8 9 106.3
6.4
6.5
6.6
6.7
6.8
6.9
7
7.1Mean classification error vs number of hidden units
number of hidden units
% m
iscl
assi
fatio
ns
training set validation settest set
Figure 6.5: Average performance of the MLPs vs number of hidden units
It can be seen from the plot that the performance for the training set improves monotonically with an
increase in the number of hidden units. But the percentage of misclassifications in the validation set
6.1 Using neural networks with normal sleep data: benchmark experiments 126
(and also the test set) reaches a minimum for J = 6 and then shows a slight increasing trend for larger
number of hidden units. The three neural networks which produce the best classification performance
on the validation set were all generated using the same shuffling seed, but have different initial values of
weights. The smallest of the three has a 10-6-3 architecture (see Table 6.3). Therefore, the 10-6-3 MLP
with νz = 10−4 and νy = 10−5 was chosen as the optimal network. Incidentally, this 10-6-3 MLP has the
best performance on the test set of the three optimal MLPs.
J νz νz training validation test8 10−3 10−6 6.63% 5.75% 6.83%7 10−3 10−6 6.54% 5.75% 6.79%6 10−4 10−5 6.63% 5.75% 6.28%
Table 6.3: Misclassification error (expressed as a percentage) for the best three MLPs
Figure 6.6 shows the performance of the 10-6-3 MLP for the whole range of (νz , νy) parameters. The
training set performance increases as the values of (νz, νy) decrease. However, the validation set per-
formance (and also that of the test set) shows an increase in the percentage of misclassifications as the
values of (νz , νy) are simultaneously decreased, the minimum being located at the (10−4, 10−5) point in
the (νz, νy) plane. A similar trend is found for the rest of the trained MLPs, with the exception of three
10-8-3 MLPs (same set shuffling seed and weight initialisation seed), from all the 2,625 MLPs, which were
the only ones to get stuck in a local minimum, with a percentage of misclassification equal to 66.3%.
It is clear that MLP performance on the training set tends to improve as parameters are moved towards
their extreme values. But the validation set performance also reveals that the MLP is being over-trained
as the number of hidden units is increased, or as the amount of regularisation is decreased. These trends
are all related, as the regularisation parameters penalise the non-relevant weights, compensating for an
excessive amount of hidden units.
6.1.7 Sleep analysis using the trained neural networks
The missclassification error on the test set, in Table 6.3, only shows how well an MLP trained using
“well-defined”, consensus-scored EEG segments from the three main stages of the sleep-wakefulness con-
6.1 Using neural networks with normal sleep data: benchmark experiments 127
−6
−5
−4
−3
−2
−6
−5
−4
−3
−24
5
6
7
8
9
log10
(νy )
training set
log10
(νz )
% m
iscl
assi
ficat
ion
−6
−5
−4
−3
−2
−6
−5
−4
−3
−24
5
6
7
8
9
log10
(νy )
validation set
log10
(νz )
% m
iscl
assi
ficat
ion
−6
−5
−4
−3
−2
−6
−5
−4
−3
−24
5
6
7
8
9
log10
(νy )
test set
log10
(νz )
% m
iscl
assi
ficat
ion
Figure 6.6: Performance of the 10-6-3 MLP vs regularisation parameters
tinuum, performs on data with the same “well-defined” characteristics. In order to test the performance
of the MLP with more general and “noisy” data (still drawn from the same distribution), the optimal
MLP was used to process an overnight record from one of the subjects in the sleep database (subject ID
9). The 10 reflection coefficients extracted from the EEG were presented to the MLP consecutively, on a
6.1 Using neural networks with normal sleep data: benchmark experiments 128
second-by-second basis (using a 3-second windows with 2-second overlap). The results for the 3 outputs,
the probability estimates P(W ), P(R) and P(S) are shown in Fig. 6.7. As expected, the night starts with a
high value for P(W ), and then this value decreases progressively, while the P(S) value increases. When
the P(R) output rises, the P(S) value decreases, suggesting that the subject has a REM or light sleep
period2.
0
1
Sleep database subject 09 MLP outputs
P(W
)
0
1
P(R
)
0 1 2 3 4 5
0
1
P(S
)
time [hours]
(a) all-night time courses
0
1
Sleep database subject 09 MLP outputs (zoom in)
P(W
)
0
1
P(R
)
1:12:00 1:13:12 1:14:24 1:15:36 1:16:48 1:18:00 1:19:12 1:20:24 1:21:36 1:22:48 1:24:00
0
1P
(S)
time (HH:MM:SS)
(b) zoom-in
Figure 6.7: MLP outputs, P(W ), P(R) and P(S) for subject 9’s all-night record, showing a 12-minutesegment detailed
Using the representation of the sleep-wake continuum described in [123] and in section 6.1.1, we com-
pare the “depth of sleep” [P(W )-P(S)] with the hypnogram generated by a human expert in Fig. 6.8(a)
and (c). The extreme values (-1,+1) indicate the deep sleep and fully awake states respectively, and the
middle value (0) indicates REM/light sleep.
The spikes in the [P(W )-P(S)] output have two different causes. In the first instance, the MLP output
is generated on second-by-second basis, while experts score sleep on a 30-s basis, and so some of the
spikes in the MLP output show the short-time variations of the sleep-wake process. The second cause of
the spikes is the variability of the AR estimates and the possible overlap between the classes as shown by
the 2D projections in §6.1.5. To minimise the first of these effects when comparing with the 30-s epoch
2It is not possible to distinguish between REM or light sleep on the basis of the EEG alone
6.1 Using neural networks with normal sleep data: benchmark experiments 129
hypnogram, a 31-point median filter is applied to [P(W )-P(S)] for comparison with the hypnogram (see
Fig. 6.8(b)3).
0 1 2 3 4 5
−1
0
1
Sleep database subject 09
P(W
)−P
(S)
(a)
0 1 2 3 4 5
4
3
2
1
R
M
W
slee
p st
ages
time [hours](c)
0 1 2 3 4 5
−1
1
31−
pt m
edia
n fil
tere
dP
(W)−
P(S
)
(b)
Figure 6.8: Sleep database subject 9 P(W )-P(S), raw (a) and 31-pt median filtered (b) compared tohuman expert scored hypnogram (c)
The correlation between these two plots is excellent. The [P(W )-P(S)] output shows an initial value of 1
for the first 20-minute interval, in agreement with the human expert, who scored it as wakefulness. Then,
the [P(W )-P(S)] output shows three slow oscillations between −1 and 0, which match the transitions
from deep sleep (stage 4) to REM/light sleep (stage 1) of the hypnogram and back. During the intervals
in which the [P(W )-P(S)] output has a well defined mean at a value of -1 the hypnogram indicates sleep
3Label “M” in the hypnogram stands for movement
6.2 Using the neural networks with OSA sleep data 130
stage 4. Also, the intervals scored by the human expert as REM/sleep stage 1 (or light sleep) correspond
closely to those in which the [P(W )-P(S)] output has a near-zero mean. It is interesting to note that
some of the remaining spikes in the filtered [P(W )-P(S)] correspond to periods of movement. Movement
generally induces high-frequencies in the EEG, which can be indistinguishable from β rhythm once the
EEG has been low-pass filtered (see §3.1.4), and hence are categorised by the MLP as wakefulness. The
intervals corresponding to intermediate stages 2 and 3 in the hypnogram are not very stable, nor is the
[P(W )-P(S)] output, which shows the most pronounced local oscillations during these intervals.
6.2 Using the neural networks with OSA sleep data
6.2.1 Data description, pre-processing and feature extraction
Sleep EEG recordings from seven subjects with severe OSA (provided by the Osler Chest Unit, Churchill
Hospital, Oxford), with apnoea/hypopnoea index (AHI) higher than 30/h, were analysed in order to
detect the occurrence and length of the micro-arousals. The Fp1/A2 or Fp2/A1 electrode pair was used
instead of the C4/A1 montage to facilitate the recognition of the arousals by the human experts [156].
Other electrophysiological measures like EOG, chin EMG, nose and mouth airflow, ribcage and abdominal
movements, and oxygen saturation, were also taken to aid the experts in the identification of the breath-
ing events. The length of the records varies from 32 to 61 minutes, but in all of them only 20 consecutive
minutes were scored according to standard American Sleep Disorders Association (ASDA) rules [11] (see
§3.3.2).
The OSA sleep EEG was sampled and pre-processed in the same way as the normal sleep data (see §6.1.3).
Autoregressive analysis with model order 10 was applied to the EEG recordings using the Burg algorithm
and a sliding window as described in §6.1.3. The patterns consisting of 10 reflection coefficients for each
second were stored as an OSA test set, with each recording being processed as a continuous sequence of
patterns.
6.2 Using the neural networks with OSA sleep data 131
6.2.2 MLP analysis
Normalisation was carried out on the OSA patterns using the normal sleep training set statistics. The
normalised OSA test set of patterns was then presented to the 10-6-3 MLP selected in §6.1.6, which had
been trained with the normal sleep data. Figure 6.9 shows the MLP outputs for 20 minutes of processed
EEG from two representative subjects in the OSA database, with ID number 3 and 8. The [P(W )-P(S)]
output is shown in Fig. 6.10 for each subject (upper and middle traces). Twenty minutes of [P(W )-P(S)]
for sleep subject 9 from the normal sleep database, chosen from her second hour of sleep, during the
transition from deep sleep to REM sleep, are shown at the bottom of Fig. 6.10 for reference. None of the
outputs shown in Figs. 6.9, 6.10 or in the subsequent figures in this chapter has been median-filtered.
0
0.5
1
OSA subject 3
P(W
)
0
0.5
1
P(R
)
0 2 4 6 8 10 12 14 16 18 20
0
0.5
1
P(S
)
time in minutes
0
0.5
1
OSA subject 8
P(W
)
0
0.5
1
P(R
)
0 2 4 6 8 10 12 14 16 18 20
0
0.5
1
P(S
)
time in minutes
Figure 6.9: OSA sleep MLP outputs for subjects 3 and 8
The oscillating nature of the [P(W )-P(S)] output shown in Fig. 6.10 compared with its counterpart in
normal sleep suggests that the sleep cycle in the OSA database is severely disrupted, with frequent (more
than 1 per minute) transitions from deep sleep to wakefulness for brief periods of time.
6.2.3 Detection of μ-arousals
According to the ASDA rules, a non-REM4 sleep μ-arousal is defined as an EEG shift in frequency lasting
3 seconds or more [11]. Given that the MLP has been trained to detect the changes in the EEG frequency
4Submental (chin) EMG is necessary to score an μ-arousal in REM sleep. Given that we are only using the EEG, this study isrestricted to non-REM sleep events
6.2 Using the neural networks with OSA sleep data 132
−1
0
1
OSA subject 3
P(W
)−P
(S)
−1
0
1
OSA subject 8
P(W
)−P
(S)
0 2 4 6 8 10 12 14 16 18 20
−1
0
1
Normal sleep subject 9
P(W
)−P
(S)
time in minutes
Figure 6.10: [P(W )-P(S)] output for OSA sleep subjects 3 (top) and 8 (middle); and for normal sleepsubject 9, (bottom)
associated with sleep, μ-arousals can be detected from the [P(W )-P(S)] output by applying a threshold
and discarding transitions which last for less than 3s. The ASDA rules also treat two consecutive μ-
arousals separated by less than 10 seconds as the same event. Therefore, we can automate the μ-arousal
scoring process according to the ASDA rules by removing pulses (thresholded [P(W )-P(S)] output) whose
duration is less than 3s and by merging two pulses which are separated by less than 10s. The automated
μ-arousal detection procedure applied to 3 minutes of [P(W )-P(S)] output from OSA subject 3 is shown
in Fig. 6.11, with a threshold of 0.5.
Events marked as “A” in Fig. 6.11 (middle trace) are pulses shorter than 3s, while those shown as “B” are
negative-going transitions also shorter than 3s. These two types of events have been removed from the
final output (lower trace). An event denoted by the letter “C” corresponds to two pulses separated by
less than 10s. These are considered to be the same event, according to the ASDA rules and they therefore
6.2 Using the neural networks with OSA sleep data 133
−1
0
0.5
1
P(W)−P(S)
0
1
after thresholding
A B CA B C
200 220 240 260 280 300 320 340 360
0
1
including 3s and 10s ASDA criteria
time [s]
Figure 6.11: μ-arousal detection procedure. Upper trace: [P(W)-P(S)] and a 0.5 threshold; middle trace:thresholding result; lower trace: μ-arousal automatic score with ASDA timing criteria
appear merged in the final μ-arousal output on the lower trace.
To evaluate the performance of this μ-arousal detector, the final output was compared with the μ-arousals
scored by the human expert (visual scoring). A true positive is found when both the visual and the
automatic scores agree on the occurrence of an event (logical AND equal to 1) as is shown in Fig. 6.12
for OSA subject 2. A false positive is an event only scored by the automated system (post-processed
[P(W )-P(S)] output). False negatives are the events missed by the automated system, scored only by the
expert using the visual method.
In the case of multiple detection of a single event only one true positive is counted, as can be seen in the
middle trace of Fig. 6.12 for the 3rd and 4th pulses. These two automated scored events match the second
visually scored μ-arousal but are considered as a single true positive. The dip between the two pulses is
6.2 Using the neural networks with OSA sleep data 134
0
1
automated system scores for threshold 0.7
TP FP TP TP TP TP FN
0
1
automated system scores for threshold 0.8
TP FP TP TP TP TP FN
150 200 250 300 350 400
0
1
human expert scores
time [s]
Figure 6.12: μ-arousal validation Upper trace: automated score for 0.7 threshold; middle trace: auto-mated score for 0.8 threshold; lower trace: visually scored signal
not counted as a false negative. This is an arbitrary decision, introducing a bias in favour of the automated
system, but it is taken to facilitate comparison between different thresholds (see section 6.2.4).
The performance of the automated μ-arousal detector was assessed by estimation of the ratios known as
sensitivity (Se) and positive predictive accuracy (PPA)[138], given by:
Se = P ( an event has been detected | an event has occurred )
≈ TP
TP + FN(6.1)
PPA = P ( an event has occurred | an event has been detected )
≈ TP
TP + FP(6.2)
where TP is the number of true positives, FP the number of false positives and FN the number of false
6.2 Using the neural networks with OSA sleep data 135
negatives.
Se indicates the ability of the method under test to detect events, while PPA represents the selectivity of
the method, i.e. the ability to pin-point only the true events. A low value for the PPA indicates a large
number of false detections. The ideal detector would have Se and PPA values equal to 1.0, since neither
false negatives nor false positives would occur.
Although performance measures such as Se and PPA can give an idea of how many events are identified
by the automated system, they do not provide any indication of the relative timing between the events
scored by the automated system and those scored by the human expert. This is illustrated in Fig. 6.12,
where two sets of scores generated from the [P(W )-P(S)] output (thresholds 0.7 and 0.8) with the same
number of true positives (TP ), false positives (FP ) and false negatives (FN), and hence the same Se
and PPA, are compared with the human expert scores. The gray thick lines under each signal indicate
the segments for which there is an exact match between the automated and the human expert scores.
The first true positive found by the automated system with a threshold of 0.7 (upper trace on Fig. 6.12)
has a similar duration and starting time as the visually scored event (lower trace). This is no longer
true when the threshold is given a value of 0.8 (middle trace). Other examples can be found later: see
the second, fourth and fifth events. For this reason, the correlation measure given below is used as an
additional indicator of the performance of the automated μ-arousal detector.
Corr = 1 − 1N
N∑i=0
(ynn(i) ⊕ yhs(i)) (6.3)
where yws(i) represents the [P(W )-P(S)] output at time i seconds, thresholded and with pulses shorter
than 3s filtered out, yhs(i) represents the human scores, the ⊕ sign denotes the binary “exclusive OR”
operation and N is the duration in seconds of the two sequences.
For the two sequences shown in Fig. 6.12 (thresholds of 0.7 and 0.8) the correlation indices have values
of 0.83 and 0.75 respectively.
6.2 Using the neural networks with OSA sleep data 136
6.2.4 The choice of threshold
The shift in frequency that defines a μ-arousal can occur from any sleep stage to a lighter stage (sleep or
wake). This poses a problem in the setting of the threshold, illustrated in Fig. 6.10, which show subject
3’s sleep-wake continuum going from a value near 0 (REM or light sleep) to a value near 1 (wakefulness),
while subject 8’s sleep is disrupted at a deeper level, going from near -1 (deep sleep or sleep stage 4)
to near 1 (wakefulness). Several values of threshold from the [0-0.9] range were investigated and the
values of Se, PPA and Corr were calculated from each of these. The results are shown in Table 6.4 and
Fig. 6.13.
ThresholdSubject 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Se 1.00 1.00 1.00 1.00 0.97 0.97 0.97 0.94 0.77 0.392 PPA 1.00 1.00 1.00 1.00 0.97 0.94 0.94 0.91 0.89 0.92
Corr 0.44 0.67 0.72 0.79 0.82 0.83 0.81 0.81 0.74 0.65Se 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
3 PPA 1.00 1.00 1.00 0.96 0.96 0.90 0.90 0.87 0.93 1.00Corr 0.37 0.45 0.57 0.68 0.74 0.81 0.85 0.89 0.93 0.90Se 1.00 1.00 1.00 1.00 1.00 0.96 0.96 0.96 0.85 0.50
4 PPA 1.00 1.00 0.96 1.00 1.00 1.00 1.00 1.00 1.00 1.00Corr 0.40 0.76 0.90 0.92 0.90 0.87 0.85 0.83 0.77 0.68Se 0.88 0.62 0.47 0.41 0.32 0.26 0.26 0.18 0.15 0.06
5 PPA 0.86 0.75 0.76 0.88 0.92 0.90 0.90 0.86 1.00 1.00Corr 0.40 0.65 0.73 0.75 0.74 0.73 0.72 0.71 0.70 0.69Se 1.00 0.93 0.83 0.76 0.72 0.72 0.66 0.66 0.55 0.45
6 PPA 1.00 0.96 0.92 0.92 0.95 0.95 0.95 0.95 0.94 1.00Corr 0.46 0.76 0.84 0.83 0.82 0.82 0.80 0.76 0.72 0.64Se 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.68 0.50
7 PPA 1.00 1.00 0.96 0.96 0.92 0.88 0.88 0.91 1.00 1.00Corr 0.38 0.60 0.71 0.77 0.79 0.83 0.83 0.83 0.74 0.71Se 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.91 0.56
8 PPA 1.00 1.00 1.00 0.97 0.97 1.00 0.97 0.97 0.88 0.95Corr 0.50 0.51 0.52 0.54 0.55 0.56 0.59 0.66 0.72 0.70
Table 6.4: Se, PPA and Corr per subject for various threshold values
In the light of our discussion of Fig. 6.12, we would agree that correlation is the most relevant index to
assess performance. The plots in Fig. 6.13 show that the Se and PPA indices are greater than 0.83 for all
subjects (except subject 5) at the point of maximum correlation. Table 6.5 shows the optimal threshold
from the results in Table 6.4 using the degree of correlation Corr as a criterion.
6.2 Using the neural networks with OSA sleep data 137
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.2
0.4
0.6
0.8
1
OSA subject 2
theshold
Se PPA corr
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.2
0.4
0.6
0.8
1
OSA subject 3
theshold
Se PPA corr
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.2
0.4
0.6
0.8
1
OSA subject 4
theshold
Se PPA corr
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.2
0.4
0.6
0.8
1
OSA subject 5
theshold
Se PPA corr
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.2
0.4
0.6
0.8
1
OSA subject 6
theshold
Se PPA corr
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.2
0.4
0.6
0.8
1
OSA subject 7
theshold
Se PPA corr
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.2
0.4
0.6
0.8
1
OSA subject 8
theshold
Se PPA corr
Figure 6.13: Se, PPA and Corr vs threshold for OSA subjects
Equi-distance to means (EDM) threshold
Two methods of finding the optimal threshold are considered, although this can only ever be done ret-
rospectively. The first of these is to find the centres of the two main clusters of data points for the
[P(W )-P(S)] output, by running the K-means algorithm for K = 2 on the [P(W )-P(S)] output, and set-
ting the threshold at the point x where the distances to the two centres, m1 and m2, become equal. The
6.2 Using the neural networks with OSA sleep data 138
Subject Optimal threshold Se PPA Corr2 0.5 0.97 0.94 0.833 0.8 1.00 0.93 0.934 0.3 1.00 1.00 0.925 0.3 0.41 0.88 0.756 0.2 0.83 0.92 0.847 0.5 1.00 0.88 0.838 0.8 0.91 0.88 0.72
Table 6.5: Optimal threshold
distance to each mean, d1 and d2, is normalised with respect to the standard deviation, s1 and s2, of the
corresponding mean to allow for the possibility of different data densities around each mean or cluster
centre, as shown below:
d1 =1s1
‖ x − m1 ‖ (6.4)
d2 =1s2
‖ x − m2 ‖ (6.5)
To find the threshold (x in Eq. 6.6 below) these two distances are made equal. Fig. 6.14 illustrates the
procedure to find the EDM threshold for OSA subject 2.
d1 = d2
⇒ 1s1
‖ x − m1 ‖ =1s2
‖ x − m2 ‖
⇒ 1s1
√(x − m1)2 =
1s2
√(x − m2)2,
⇒ 1s21
(x − m1)2 =1s22
(x − m2)2, (6.6)
Developing the square binomial in both sides of Eq. 6.6:
(s22 − s2
1)x2 − 2(m1s
22 − m2s
21)x + (m2
1s22 − m2
2s21) = 0 (6.7)
Solving the quadratic for x yields two possible solutions:
x1 =m1s2 − m2s1
s2 − s1, (6.8)
x2 =m1s2 + m2s1
s2 + s1(6.9)
6.2 Using the neural networks with OSA sleep data 139
one of which is outside the range [m1,m2] with m1 ≤ m2, and is therefore discarded, while the other one
sets the EDM threshold.
The new results for the automated system using the equi-distance to means threshold are presented in
Table 6.6.
0 5 10 15 20 25 30 35
−1
−0.5
0
0.5
1
OSA subject 2
P(W
)−P
(S)
time in minutes
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
50
100
150
200
250
300Amplitude histogram of P(W)−P(S)
EDM threshold
Figure 6.14: [P(W )-P(S)] output for OSA sleep subjects 2 (top) and amplitude histogram showing thetwo main clusters, surrounded by a circle of one standard deviation, and the EDM threshold (bottom)
Subject EDM threshold Se PPA Corr2 0.47 0.97 0.94 0.833 0.55 1.00 0.90 0.834 0.47 1.00 1.00 0.885 -0.27 0.97 1.00 0.346 0.46 0.72 0.95 0.827 0.47 1.00 0.88 0.828 0.49 1.00 1.00 0.56
Table 6.6: Equi-distance to means (EDM) threshold
It can be noticed in Table 6.6 that the threshold for the majority of the subjects lies within 0.5 ± .05.
Therefore, the simple approach of setting the threshold half way between REM/light sleep and wake-
fulness (i.e. [P(W )-P(S)]=0.5) was also tested. The results are shown in Table 6.7. Fig. 6.15 shows
the results using the two methods for setting the threshold compared with the results obtained with the
optimal threshold. Except for subject 5, the two methods can be seen to give very similar results.
6.2 Using the neural networks with OSA sleep data 140
Subject Threshold Se PPA Corr2 0.5 0.97 0.94 0.833 0.5 1.00 0.90 0.814 0.5 0.96 1.00 0.875 0.5 0.26 0.90 0.736 0.5 0.72 0.95 0.827 0.5 1.00 0.88 0.838 0.5 1.00 1.00 0.56
Table 6.7: Fixed (0.5) threshold
2 3 4 5 6 7 80
0.5
1
subject
Se
2 3 4 5 6 7 80
0.5
1
subject
PP
A
2 3 4 5 6 7 80
0.5
1
subject
Cor
r
Figure 6.15: Se, PPA and Corr for the best threshold (blue), the EDM threshold (red), and a 0.5 fixedthreshold (green)
6.2.5 Discussion
From the results obtained with the automated scoring system (shown in Fig. 6.15), two OSA subjects
stand out, subject 5 and subject 8, because of their low correlation values in relation to the rest of the
subjects. In order to investigate this, we examined the EEG and its power spectral density (PSD) for
these two subjects during the intervals which were scored as a μ-arousal by the human expert. The EEG
revealed that OSA subject 5 falls into a much deeper sleep than the other subjects before the onset of
a μ-arousal. Some of the deep sleep EEG is usually scored by the human expert as being part of the μ-
arousal. Thus, this subject’s μ-arousals are characterised by an increase in magnitude both for the lower
frequencies (which is unusual) and the higher frequencies (which is the expected EEG change during a
μ-arousal) during the first few seconds of the event. For the rest of the μ-arousal, the EEG is generally
dominated by α activity (see §3.4.3), which is often interpreted as light sleep by the neural network. This
6.2 Using the neural networks with OSA sleep data 141
is illustrated in Fig. 6.16, which shows 24 seconds of EEG and the corresponding [P(W )-P(S)] output
during a μ-arousal event for subject 5. The start and end of the event, as determined by the expert scorer,
are shown by the broken vertical lines. Fig. 6.17 shows the 1s resolution spectrogram (PSD vs time)
of the EEG segment shown in Fig. 6.16. Note the increase in magnitude of both the δ and α rhythms
during the first few seconds of the μ-arousal and also the prevalence of the peak at 10Hz, indicating the
presence of α rhythm during the whole event. The relatively high power in the lower frequency bands
for subject 5’s EEG may be the reason for which some events are totally missed by the automated system,
as is shown in Fig. 6.18.
510 515 520 525 530
−100
−50
0
50
100
OSA subject 5
time [s]
volta
ge [
μV]
510 515 520 525 530
−1
−0.5
0
0.5
1
time [s]
P(W
)−P
(S)
Figure 6.16: OSA subject 5 EEG and [P(W )-P(S)] output during a typical μ-arousal for this subject (24s)
Another subject with low correlation value (Corr=0.56 for the EDM threshold and the 0.5-threshold) is
subject 8, who has an EEG with high frequency content and shows a reduction in the higher frequencies
prior to the onset of the μ-arousals. The [P(W )-P(S)] output is near 1 (wakefulness) most of the time,
falling to low negative levels in the few seconds prior to the start of the μ-arousal, resulting in the μ-
arousals identified by the automated system being longer than those scored by the expert. Fig. 6.19
shows a 2-minute long section of the [P(W )-P(S)] output and the corresponding scores from the human
expert.
6.3 Summary 142
0
5
10
15
20
25
30510
515520
525530
0
10
20
30
40
time [s]frequency [Hz]
mag
nitu
de [d
B]
β
δθ
α
Figure 6.17: Spectrogram of the EEG segment shown in Fig. 6.16 calculated with 1s resolution using10th-order AR modelling
Comparing results with those using a 1-second analysis window
Previous work in the Neural Networks Research Group [123][175][176] used a 1-second window with
no overlap for the EEG feature extraction, but we found that, with such a window length, the misclas-
sification error of the MLP on the validation set in normal sleep is grater than 10%, compared with the
5.75% obtained with the 3-s window. Fig. 6.20 shows the [P(W )-P(S)] output using the 1-s window
for normal subject 9, compared with the [P(W )-P(S)] output using the 3-s window, together with the
corresponding expert scores. The “noisier” appearance of the output in relation to the 3-s case is likely to
be due to the higher variance of the AR estimates.
Also, the averaged sensitivity (median 0.77) and correlation (median 0.76) in μ-arousal detection are
lower using a 1-s window than using a 3-s window (Se median 0.97 and Corr median 0.82).
6.3 Summary
In this chapter two databases have been presented, corresponding to normal sleep and OSA sleep. The
normal sleep database consists of nine all-night EEG recordings using the central electrode montage. The
EEG is labelled independently, according to the R&K rules, by three human experts on a 30-second basis.
6.3 Summary 143
840 845 850 855 860
−100
−50
0
50
100
OSA subject 5
time [s]
volta
ge [
μV]
840 845 850 855 860
−1
−0.5
0
0.5
1
time [s]
P(W
)−P
(S)
Figure 6.18: OSA subject 5 EEG and [P(W )-P(S)] output during a μ-arousal missed by the automatedscoring system (24s)
The OSA sleep database has seven 20-minute frontal EEG records corresponding to seven subjects with
severe OSA. The records have been scored for μ-arousals by a human expert using the ASDA rules.
An investigation was made to select the algorithm for the estimation of the reflection coefficients, used
to represent the frequency content of the EEG, and also to select the number of samples in the analysis
window. The Burg algorithm was selected for its low computational cost and competitive performance. A
3-second window with 2-second overlap was chosen as a compromise between minimising the variance
of the AR coefficient estimates and the requirement to ensure stationarity of the EEG.
Based on previous work [123], three classes were chosen to describe normal sleep, namely Wakefulness,
REM/light sleep and Sleep stage 4, and a balanced feature set was formed to train a neural network to
estimate the posterior probabilities of class membership.
A 2-layer MLP with the softmax function for the output units was used. The backpropagation algorithm
for multiple classes and the scaled conjugate gradient optimisation algorithm were used to train the
network (cross-entropy error function). Optimisation of the MLP parameters, number of hidden units
and weight decay terms, was achieved using cross-validation. The optimal network (performance on the
validation set) is a 10-6-3 MLP with weight decay parameters (νz, νy) values at 10−4 and 10−5 respectively.
6.3 Summary 144
0 20 40 60 80 100 120
−1
0
1
time [s]
P(W
)−P
(S)
OSA subject 8
0 20 40 60 80 100 120
0
1
time [s]
visu
al s
core
s
Figure 6.19: OSA subject 8 [P(W )-P(S)] output and human expert scores (2 minutes)
The percentage of misclassification on the test set achieved with this network is 6.28%.
The optimal MLP was used to analyse the all-night EEG record of a subject from the normal sleep
database. The time courses of two of the three MLP outputs were combined to give a measure of sleep
depth [P(W )-P(S)], which shows a high correlation with the hypnogram generated by a human expert,
suggesting that the MLP is able to interpolate between classes for intermediate sleep stages 2 and 3.
The sleep EEG of OSA subjects was analysed using the optimal MLP. The time courses of the [P(W )-P(S)]
output show severe disruption in the sleep. A method for automated μ-arousal detection using threshold-
ing of the [P(W )-P(S)] output was introduced. The output of the automated scores was post-processed
to follow ASDA rules for μ-arousal scoring as closely as possible. Sensitivity, positive predictive accuracy
and correlation were used to evaluate the performance of the automated detection system with respect
to the human expert scores. The correlation measure was used to choose the optimal threshold value
per subject, and two methods for setting the threshold, one of them subject-adaptive, were applied ret-
rospectively. The results for five of the seven subjects show a high correlation (greater than 0.8) value,
with values of Se and PPA mostly over 0.9. Possible causes for the lower correlation values (0.56 and
0.34-0.73) obtained with the other two subjects may be explained by the fact that these two subjects
have different types of μ-arousal.
6.4 Conclusions 145
0 1 2 3 4 5
−1
0
1
Sleep database subject 09
P(W
)−P
(S)
usin
g 1s
(a)
0 1 2 3 4 5
−1
1
P(W
)−P
(S)
usin
g 3s
(b)
0 1 2 3 4 5
4
3
2
slee
p st
ages
time [hours](c)
Figure 6.20: Sleep database subject 9 raw P(W )-P(S) using a 1-s analysis window (a) and using a 3-sanalysis window (b), compared to the human expert scored hypnogram (c)
6.4 Conclusions
The neural network, trained with normal sleep data, is capable of following the abrupt transitions in
the sleep EEG of OSA patients. The methods introduced for automated μ-arousal detection were able
to identify a high percentage of the events scored by the human expert, giving the beginning and the
end times for the μ-arousal with relatively high accuracy (as measured by a simple correlation index)
for most of the OSA subjects in the database. The study of the subjects with low correlation levels in
the automated μ-arousal detection showed different changes in the EEG frequency content prior to and
during the μ-arousal.
The 3-second analysis window with a 2-second overlap for the AR modelling has yielded better results in
terms of MLP performance and in the sensitivity and correlation of the μ-arousal detection.
Chapter 7
Visualisation of thealertness-drowsiness continuum
Daytime drowsiness or sleepiness is a common complaint in patients with OSA. A full assessment of an
OSA case may include a vigilance test after a night-time sleep recording has been performed. In any case,
it would be very useful for clinicians to have a method of assessing the day-time performance of OSA
patients in relation to the severity of their sleep disorder.
Drowsiness is a state in which a person will easily fall asleep in the absence of external stimuli. It is quite
different from exhaustion as a result of physical activity. While drowsiness is a mental state which occurs
prior to sleep, its opposite, alertness, is a physiological activated state of the human brain, characterised
by consciousness and awareness. Human beings experience fluctuations in their levels of alertness during
the day because of the circadian rhythm. These fluctuations can be affected by sleep deprivation or low
quality of sleep as is the case with OSA.
In this chapter we investigate changes in the level of alertness which may be gradual rather than abrupt,
like the short events (arousals during sleep) of the previous chapter. Two databases are considered:
1. The “sleep database”, previously used for training neural networks to track the sleep-wake con-
tinuum and hence detect arousals in test data. This has the three previously defined categories of
wakefulness, REM/light sleep and deep sleep.
2. The “vigilance database” described below in which eight sleep-deprived subjects perform vigilance
7.1 The vigilance database 147
tasks while having their EEG monitored. For reasons which are explained below, there are two broad
categories in this database: alertness and drowsiness.
One important question is the inter-relationship and overlap between these five categories. For example,
wakefulness in the sleep database corresponds to a mental state in which the subjects lie in bed with their
eyes shut in a darkened room. On the other hand alertness in the vigilance database represents a state in
which the subjects are awake, with their eyes open, in a well-lit room in front of a computer screen. In
both instances, the subjects are awake but their EEG activity may be different.
In both the analysis of the sleep EEG and the vigilance EEG [153][127][170][171][104][50] it is the
frequency content of the signal which is used to characterised it. Although a 5th-order AR model has been
used previously [50] in the analysis of vigilance EEG, we decided that, in order to be able to compare EEG
signals from both databases, the same parameterisation should be used in both cases, namely reflection
coefficients from a 10-th order model. The inter-relationship between these coefficients for the different
classes will be visualised in 2-D using both the Sammon map and the NEUROSCALE algorithm.
The rest of this chapter is organised as follows. Firstly, the vigilance database used both in previous
work[50] and in subsequent chapters is introduced. Secondly, the Sammon map and the NEUROSCALE
algorithm for visualising the high-dimensional data are applied to the vigilance database to investigate
the separation (or overlap) between the two classes, alertness and drowsiness. Finally, the visualisation
tools are used to study the inter-relationships between the EEG patterns of the five categories present in
the two databases together.
7.1 The vigilance database
The Department of Psychology at the University of West England conducted a study in which eight healthy
young subjects performed various vigilance tests for approximately 2 hours (see Appendix C), after a
night of sleep deprivation and no stimulant consumption for 24 hours before or during the test. The
EEG was recorded from a number of sites on the scalp but only the central (C4) site recordings, as in
7.1 The vigilance database 148
the sleep EEG studies, were used in the work described in this thesis. Expert scoring based on the visual
assessment of the EEG, EMG and EOG was undertaken on a 15-second basis, according to the Alford et
al. sub-categories of Table 3.2 [8]. A brief summary of the database is given in Appendix C and the Table
is reproduced in simpler format below:
Vigilance sub-category DescriptionActive Wakefulness (Active) active/alert, > 2 eye mov/epoch, definite body mov.a
Quiet Wakefulness Plus (QWP) active/alert, > 2 eye mov/epoch, possibly body mov.Quiet Wakefulness (QW) alert, < 2 eye mov/epoch, no body mov.
Wakefulness with Intermittent α (WIα) burst of α < half of an epochWakefulness with Continuous α (WCα) burst of α > half of an epochWakefulness with Intermittent θ (WIθ) burst of θ < half of an epochWakefulness with Continuous θ (WCθ) burst of θ > half of an epoch
amovement
Table 7.1: Alford et al. vigilance sub-categories
In previous work in the Neural Networks Research Group, Duta [50] investigated the tracking of fluc-
tuations in vigilance using both the central and mastoid (behind the ears) EEG sites. In that work the
EEG was divided into one-second segments. Since the expert scoring of the (central) EEG was under-
taken using a 15-second timescale, a large number of 1-s segments are wrongly labelled, as for instance
a 1-s segment from a 15-s epoch of vigilance category WIα may consist predominantly of α-wave ac-
tivity whereas another segment in the same epoch may correspond to Quite Wakefulness (QW). Duta
re-labelled the data using a combination of the expert scoring and Kohonen feature maps to visualise the
cluster to which the one-second segment belonged. As a result of this, she defined two categories:
1. Alertness: one-second segments which are labelled by the expert as Active, QWP or QW, and have
corresponding feature vectors which are mapped onto the area of the Kohonen map mostly visited
by the Active, QWP and QW sub-categories and not visited by WIα, WCα and WIθ.
2. Drowsiness: one-second segments labelled WIα, WCα or WIθ whose feature vectors visit the area
of the Kohonen map mostly visited by the WIα, WCα and WIθ sub-categories and not visited by
Active, QWP and QW.
In addition, an extra class of Uncertain was defined as containing the one-second segments whose fea-
7.1 The vigilance database 149
ture vectors are mapped onto an area of the Kohonen map visited by feature vectors extracted from
one-second segments from all vigilance sub-categories. There are approximately 8,000 and 20,000 1-s
segments which belong to the Drowsiness and Alertness classes respectively, although the distribution is
not uniform amongst the subjects. The distribution of patterns per subject per class is shown in Table 7.2.
SubjectClass 1 2 3 4 5 6 7
Drowsiness 282 1541 413 804 1817 1218 2116Intermediate 1220 2038 1978 2262 1181 2084 1749
Alertness 4802 1896 3280 3394 2591 2416 1368Artefact 1151 2625 2204 2195 2286 4797 2717Total 7455 8100 7875 8655 7875 10515 7950
Table 7.2: Number of patterns per subject per class in vigilance training database
7.1.1 Pre-processing
Although the data in this database was sampled at 256 Hz, it is down-sampled to 128 Hz in order to keep
the pre-processing filters and AR modelling consistent across all databases in this thesis. Ten reflection
coefficients per second are calculated using the Burg algorithm for each 3-s window with 2-s overlap, as
with the sleep database (see section 6.1.3).
7.1.2 Visualising the vigilance database
Ideally we would take an equal number of Alertness (A) and Drowsiness(D) patterns per subject in order
to have every subject equally represented when training the visualisation algorithm. Unfortunately, some
subjects in the database have a very small number of patterns for the Drowsiness class. If we take 800
patterns per class per subject, 5 out of the 7 subjects can provide this number. A training set is then built
randomly selecting 800 patterns per class for each subject, or the maximum available when this is not
possible (see Table 7.3).
The visualisation algorithms used in this thesis, the Sammon map and NEUROSCALE, require a small
number of feature vectors for a reasonable convergence time. With approximately 5,000 patterns per
class, a reduction in the size of the training set is needed. Using the K-means clustering algorithm, the
7.1 The vigilance database 150
Subjecta
Class 1 2 3 4 5 6 7 TotalD 282 800 412 800 800 800 800 4694A 800 800 800 800 800 800 800 5600
aNote that only seven subjects are listed above. The eighth subject wasdiscarded for reasons explained later.
Table 7.3: Number of patterns per subject per class in K-means training set
number of patterns in the training set is reduced to about 200 mean patterns per class (by choosing 14
means per subject per class).
The Sammon map and NEUROSCALE algorithms are run independently with the reduced dataset using
the same parameters as for the sleep database (for Sammon map, gradient proportionality factor = 0.06;
and for NEUROSCALE, number of basis functions = 50). The projections of the means produced by
both visualisation techniques, presented in Figs. 7.1 and 7.2, are very similar and show two partially
overlapped clusters, representing the A and D classes respectively. Of course, the overlapping does not
necessarily occur in the 10-D space as it does in the 2-D projection, in the same way as the edges of a 3-D
cube may touch each other in a 2-D projection.
Visualising the feature vectors for each subject
The cluster size in the Sammon maps shown in Fig. 7.1, represented by the radius of the circles around the
cluster mean, is calculated by counting the number of feature vectors in the training set which “belong”
to that mean (as defined by the Euclidean distance in 10-D between the feature vector and the cluster
mean). The distribution of patterns per subject can also be investigated by considering only the feature
vectors belonging to a specific subject. The results of using the Sammon and NEUROSCALE algorithms on
each subject individually are shown in Figs. 7.3 and 7.4.
7.1.3 Discussion
The maps showing the distribution of the patterns per subject reveal some differences between subjects.
One of the subjects in the database, subject 8 (not shown in the tables), was discarded because she was
7.1 The vigilance database 151
(a) Both classes
(b) Drowsiness (c) Alertness
Figure 7.1: Vigilance Sammon map
identified by the expert who scored the records as belonging to the minority class α+ (see sections 3.4.1
and 3.4.3), a condition in which the subject’s EEG shows an α-rhythm during eyes-open wakefulness
[87]. Although alpha-plus people represent a significant fraction of the population, the lack of data
and subjects for this category in our database makes it difficult to include it in the rest of the analysis.
However, the data from subject 8 allows us to exploit the advantage that the NEUROSCALE algorithm
has over the Sammon map algorithm. The trained NEUROSCALE network can be used on previously
unseen data, provided that the new data is drawn from the same probability distribution as the training
data. Thus, the NEUROSCALE network trained with the 7-subject training set described in Table 7.3, can
be used with this α+ subject as input in order to visualise the A and D patterns of this subject with
respect to those from the rest of the subjects. Fig. 7.4h clearly shows that the D patterns for subject 8
7.2 Visualising vigilance and sleep data together 152
−8 −6 −4 −2 0 2 4 6 8−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
6Vigilance NeuroScale, 14 means per class, 50 basis functions, 500 iterations
DrowsyAlert
(a) Means only
−8 −6 −4 −2 0 2 4 6 8−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
6Projection on Vigilance NeuroScale map (14 means per class)
DrowsyAlert
(b) All patterns
Figure 7.2: Vigilance NEUROSCALE map
lie mostly in the area where the A patterns from the others subjects are found. Given that this subject’s
EEG differs from the EEG of most of the population, it is very likely that the NEUROSCALE neural network
is extrapolating when presented with this subject’s patterns as they are not represented in its training
set. Another NEUROSCALE neural network is therefore trained, this time with subject 8’s mean patterns
added to the training set. The resulting 2-D plot for this 8-subject training set is shown in Fig. 7.5 and
the projection of subject 8’s patterns using this neural network is shown in Fig. 7.6h.
This figure reveals an interesting phenomenon which could not be seen in the 7-subject NEUROSCALE 2-D
projection. On Fig. 7.6h, the D patterns from subject 8 lie in an area where there are no patterns from
any other subject. Also subject 8’s A patterns overlap completely with her D patterns. This can be tracked
to the first five reflection coefficients for the D class which for subject 8 have mean values different from
those of the other subjects (see Fig. 7.7).
7.2 Visualising vigilance and sleep data together
To explore the relationship in feature space between the sleep and vigilance classes, a NEUROSCALE
neural network is trained with means extracted both from the sleep database classes Wakefulness (W),
REM/Light-sleep (R) and Deep-sleep (S) and from the vigilance categories Alertness (A) and Drowsiness
(D). An equal number of means is extracted for each class from the databases giving a total of 210 means.
7.2 Visualising vigilance and sleep data together 153
The resulting NEUROSCALE plot of the means is shown in Fig 7.8. The maps showing the projection of
the feature vectors for each of the five classes can be seen in Fig 7.9.
A Sammon map was also trained with the means from the combined sleep-vigilance databases. The
results shown in Fig. 7.10 are comparable to those obtained with NEUROSCALE (Figs. 7.8 and 7.9), but
the relation between pairs of two classes may be seen more clearly on the Sammon map, as shown in
Fig. 7.11.
7.2.1 Discussion
It can be seen from Fig. 7.11b that the Wakefulness class from the sleep database is broader than the
Alertness category from the vigilance database. Although the Alertness patterns are mostly mapped
onto a region of the map covered by the Wakefulness class, it is not necessarily correct to say that the
Alertness category is a subset of the Wakefulness class. On the one hand, we have the Alert patterns of
sleep-deprived subjects performing a rather boring task (see Appendix C), fighting to remain awake. On
the other hand, we have the Wakefulness patterns from subjects lying in bed, ready to sleep, in a quiet,
dark and comfortable room. It is not known whether these subjects were relaxed or not, but it is very
likely that they were not concentrating their mind on anything in particular. The overlap between these
two classes is understandable but it was also expected that there would be a region for each class not
shared with the other one. It is possible that this region may be represented by three dense Alertness
clusters at the lower edge of this class on the Sammon map, a region not visited by any other class. The
same region is seen in the NEUROSCALE plot as the right-hand side of Alertness category in Fig. 7.9d. It
is also encouraging to find a small area where the Wakefulness patterns on the Sammon map overlap the
Drowsiness patterns but not the Alertness patterns (see Figs. 7.11b and 7.11c).
The spatial relationship between Alertness, Drowsiness and REM/Light Sleep is shown in Figs. 7.11d and
7.11e. There is a large area of overlap between REM/Light Sleep and Drowsiness, but the REM/Light
Sleep area only overlaps Alertness in the area where the latter overlaps Drowsiness. This is reasonable,
as the brain cortex, fully active when the subject is alert, is randomly stimulated during REM sleep. The
7.3 Conclusions 154
Drowsiness area extends onto the Wakefulness area towards the upper-centre border of the map, where
it becomes the dominant class. The centre-left region of the map is dominated by REM/Light Sleep.
Finally, Fig. 7.11f shows two well defined completely separated clusters representing the 2-D projections
for Drowsiness and Deep Sleep. This is expected as Drowsiness only includes short bursts of θ rhythm
and no δ rhythm, while Deep Sleep patterns consists mainly of δ waves with some occasional θ rhythm.
From the visualisation maps, the following hypotheses can be formulated:
• A transition from an alert state of mind to sleep may progress from the area exclusive to Alertness
through Drowsiness’ area shared by A, W and D, Light Sleep’s area shared by A, D and R, and then
into Deep Sleep.
• Another transition from a relaxed state of Wakefulness to sleep starts from the region of Wakefulness
not shared with Alertness, moves towards Drowsiness’ area shared by W and D, and then into Light
Sleep’s area shared by R and D only, eventually reaching Deep Sleep.
7.3 Conclusions
In this chapter, we have analysed the EEG recordings from the vigilance database, which consists of 2-
hour recordings from seven healthy sleep-deprived subjects performing vigilance tasks. Two vigilance
categories were defined, namely Alertness and Drowsiness, and used to label 1-s EEG segments based on
the scores from a human expert. The EEG was processed in the same way as for the sleep database. A
near-balanced training set was built from the vigilance database by randomly selecting an equal num-
ber of patterns per subject and per class. Visualisation of the data distribution in the feature space
revealed inter-subject variability in both the Alertness and Drowsiness classes. An interesting example
was discussed, namely an α+ subject whose Drowsiness patterns seem to be different from the rest of the
feature vectors in the training set.
A further visualisation study was carried out integrating the sleep and vigilance categories in one training
set. From this analysis we may draw the following conclusions:
7.3 Conclusions 155
1. The Alertness and Drowsiness patterns give rise to two well-defined but partially overlapping clus-
ters.
2. Wakefulness (from the sleep database) is a very broad class that includes some alert patterns as
well as some drowsy ones.
3. There is a small but relatively dense area beyond Wakefulness occupied by Alertness only.
4. The area shared by Wakefulness and Drowsiness patterns only may represent the sleep onset not
included in the REM/Light Sleep region.
5. The REM/Light Sleep and Drowsiness classes overlap significantly but not totally with obvious areas
not represented by any other class.
6. Deep Sleep is a separate class, of relatively low importance for the study of vigilance.
It is obvious that the vigilance categories, Alertness and Drowsiness, are not fully represented by any of
the sleep classes and therefore require a separate neural network analysis.
7.3 Conclusions 156
(a) All subjects (b) Subject 1
(c) Subject 2 (d) Subject 3
(e) Subject 4 (f) Subject 5
(g) Subject 6 (h) Subject 7
Figure 7.3: Vigilance Sammon map showing subject’s distribution (Alertness in red and Drowsiness inblue)
7.3 Conclusions 157
−8 −6 −4 −2 0 2 4 6 8−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
6
(a) Subject 1
−8 −6 −4 −2 0 2 4 6 8−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
6
(b) Subject 2
−8 −6 −4 −2 0 2 4 6 8−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
6
(c) Subject 3
−8 −6 −4 −2 0 2 4 6 8−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
6
(d) Subject 4
−8 −6 −4 −2 0 2 4 6 8−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
6
(e) Subject 5
−8 −6 −4 −2 0 2 4 6 8−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
6
(f) Subject 6
−8 −6 −4 −2 0 2 4 6 8−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
6
(g) Subject 7
−8 −6 −4 −2 0 2 4 6 8−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
6
(h) Subject 8
Figure 7.4: Vigilance NEUROSCALE map projections for each subject (Alertness in magenta and Drowsi-ness in blue)
7.3 Conclusions 158
−8 −6 −4 −2 0 2 4 6 8 10−8
−6
−4
−2
0
2
4
6
8
10
128−subject vigilance NeuroScale, 192 means per class, 50 basis functions, 500 iterations
DrowsyAlert
(a) Means only
−8 −6 −4 −2 0 2 4 6 8 10−8
−6
−4
−2
0
2
4
6
8
10
12Projection on the 8−subject Vigilance NeuroScale map
DrowsyAlert
(b) All patterns
Figure 7.5: Vigilance NEUROSCALE map trained with all subjects, including the α+ subject
7.3 Conclusions 159
−8 −6 −4 −2 0 2 4 6 8 10−8
−6
−4
−2
0
2
4
6
8
10
12
(a) Subject 1
−8 −6 −4 −2 0 2 4 6 8 10−8
−6
−4
−2
0
2
4
6
8
10
12
(b) Subject 2
−8 −6 −4 −2 0 2 4 6 8 10−8
−6
−4
−2
0
2
4
6
8
10
12
(c) Subject 3
−8 −6 −4 −2 0 2 4 6 8 10−8
−6
−4
−2
0
2
4
6
8
10
12
(d) Subject 4
−8 −6 −4 −2 0 2 4 6 8 10−8
−6
−4
−2
0
2
4
6
8
10
12
(e) Subject 5
−8 −6 −4 −2 0 2 4 6 8 10−8
−6
−4
−2
0
2
4
6
8
10
12
(f) Subject 6
−8 −6 −4 −2 0 2 4 6 8 10−8
−6
−4
−2
0
2
4
6
8
10
12
(g) Subject 7
−8 −6 −4 −2 0 2 4 6 8 10−8
−6
−4
−2
0
2
4
6
8
10
12
(h) Subject 8
Figure 7.6: Vigilance NEUROSCALE trained with all subjects, including α+ subject (Alertness in magentaand Drowsiness in blue)
7.3 Conclusions 160
−1 −0.9 −0.8 −0.7 −0.6 −0.5
Coe
ff. 1
Subject 12 drowsiness patterns
0.2 0.4 0.6 0.8 1
Coe
ff. 2
−0.8 −0.7 −0.6 −0.5 −0.4 −0.3
Coe
ff. 3
0.4 0.5 0.6 0.7 0.8
Coe
ff. 4
−0.8 −0.7 −0.6 −0.5 −0.4 −0.3
Coe
ff. 5
0.2 0.3 0.4 0.5 0.6 0.7
Coe
ff. 6
Figure 7.7: Subject 8 reflection coefficient histogram (green) in relation to the rest of the subjects in thetraining set (magenta)
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6NeuroScale with vigilance and sleep data, 42 means per class, 50 basis functions, 500 iterations
Wakefulness REM/light−sleepDeep−sleep Drowsiness Alertness
Figure 7.8: Vigilance and sleep NEUROSCALE map
7.3 Conclusions 161
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6Wakefulness
(a) Wakefulness
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6REM/light sleep
(b) REM/light sleep
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6Deep sleep
(c) Deep Sleep
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6Alertness
(d) Alertness
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6Drowsiness
(e) Drowsiness
Figure 7.9: Vigilance and sleep NEUROSCALE projections for all the patterns in each class (colour code:W, cyan; R, red; S, green; A, magenta; and D, blue)
7.3 Conclusions 162
(a) All classes (b) Wakefulness
(c) REM/light sleep (d) Deep Sleep
(e) Alertness (f) Drowsiness
Figure 7.10: Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta; andD, blue)
7.3 Conclusions 163
(a) All classes (b) Alertness and wakefulness
(c) Wakefulness and drowsiness (d) REM/light sleep and drowsiness
(e) REM/light sleep and alertness (f) Deep Sleep and drowsiness
Figure 7.11: Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta; andD, blue)
Chapter 8
Training a neural network to track thealertness-drowsiness continuum
At the end of the previous chapter, we showed that a neural network used to assess the level of drowsi-
ness in OSA patients should be trained using exclusively vigilance labelled patterns. In this chapter we
train and test a neural network to track the alertness-drowsiness continuum using single-channel EEG
recordings from control subjects performing vigilance tests.
8.1 Neural Network training
The visualisation techniques applied to vigilance data in section 7.1.2, showed a high degree of overlap
between the A and D classes in the 2-D projection of the vigilance database feature vectors. Despite this
overlap, a neural network may be able to resolve the differences using 10-D feature vectors as inputs. We
expect that a two-class neural network trained exclusively with patterns from the extreme conditions of
fully alert (A) and fully drowsy (D), will be able to interpolate when a pattern belonging to an interme-
diate stage is presented at the input. In this way the vigilance continuum may be tracked by an output
fluctuating between the full alertness and the full drowsiness levels.
8.1.1 The training database
The 7-subject vigilance database described in chapter 7 is used for the neural network training and
testing. The set of 10 reflection coefficients extracted from the A and D patterns used for visualisation in
8.1 Neural Network training 165
§7.1.2 is now used in this chapter for the training process.
8.1.2 The neural network architecture
An MLP neural network is selected for the same reasons as in chapter 6. As with the sleep-wake contin-
uum network, the cross-entropy error function and the scaled conjugate gradient optimisation algorithm
are used during the training process. Given that only one output is required in a two-class problem1, the
configuration for the MLP is 10-J-1, the output representing the posterior probability of the input vector
belonging to the alertness class. The estimate of the number of hidden units J given by Eq. 5.72, i.e.
the geometric mean of the number of inputs times the number of outputs, is√
10 × 1 = 3.16. Hence a
search for the optimum J is done training 10-J-1 MLPs with values of J from 2 to 15. As before, the
problem of over-fitting the network is dealt with by introducing regularising terms νz and νy, one for
each weight layer. Based on results from a preliminary investigation, the values of the regularisation
parameters are varied between 10−3 and 1 for the input-to-hidden layer νz, and from 10−7 to 10−5 for
the hidden-to-output layer νy, increasing in powers of ten. To avoid being trapped in a local minima,
three different random weight initialisations are used. Cross-validation is used, as before, to optimise the
MLP architecture and regularisation parameters.
8.1.3 Choosing training, validation and test sets
Ideally, balanced training and validation sets should be assembled for the cross-validation tests, assuming
equal prior probabilities for both classes. However, inter-subject differences were found when visualising
the vigilance database (§7.1.2). All the subjects should be equally represented in the training and valida-
tion sets, but as is shown in Table 7.2, the distribution of A and D patterns among the vigilance database
is very uneven. Using the same criterion as for the NEUROSCALE training set, 800 (or fewer patterns
when this is not possible) were drawn per class for each subject, yielding 5,494 patterns for Alertness
and 5,822 for Drowsiness.
1P (D | x) = 1 − P (A | x)
8.1 Neural Network training 166
Assigning these patterns to two equal-sized sets, we obtain approximately 2,800 patterns per class in
each set. With as few as 7 subjects in our database and the high degree of inter-subject variability seen
in the visualisation studies, the best strategy for training and testing the MLP will be the leave-one-out
method [159]. This requires the leaving of one subject out of the training and validation sets, so that it
can be used as a test subject, and repeating this for each subject in turn. This method leads to 7 different
partitions of the data, as shown in Table 8.1.
Training and validation Total TestPartition subjects A D Tr Va subject
1 2, 3, 4, 5, 6, 7 4800 4412 4606 4606 12 1, 3, 4, 5, 6, 7 4800 3894 4347 4347 23 1, 2, 4, 5, 6, 7 4800 4282 4541 4541 34 1, 2, 3, 5, 6, 7 4800 3894 4347 4347 45 1, 2, 3, 4, 6, 7 4800 3894 4347 4347 56 1, 2, 3, 4, 5, 7 4800 3894 4347 4347 67 1, 2, 3, 4, 5, 6 4800 3894 4347 4347 7
Table 8.1: Partitions and distribution of patterns in training (Tr) and Validation (Va) sets
The MLP training and parameter optimising process can be summarised as follows:
1. Build training and validation sets on (n − 1) subjects using 800 (or as many as there are if this is
not possible) patterns per subject per class. Repeat this for each subject2.
2. For each partition:
(a) Normalise training, validation and test sets with respect to the training set statistics.
(b) For each set of values of the network parameters (J, νz, νy) and weight initialisation seed,
train a 10-J-1 MLP using the cross-entropy error function and the scaled conjugate gradient
optimisation algorithm.
(c) Choose the optimal MLP based on the performance on the validation set.
(d) Test the optimal MLP on the nth subject. Compare with the expert assessment.
2Here n = 7
8.1 Neural Network training 167
Hence, the MLP parameter optimisation involves the training of the following number of networks: (7
partitions) × (3 weight initialisations) × (14 values of J) × (4 values of νz) × (3 values of νy)= 3,528
networks
8.1.4 Optimal (n − 1)-subject MLP per partition
The optimisation of the MLP parameters yields the results shown in Table 8.2. Fig. 8.1 shows the average
variation in misclassification error for the validation set with respect to the number of hidden units J .
The optimum value for J is clearly between 3 and 4 for the majority of the partitions as estimated using
Eq. 5.72. Partition 5 is the only one which has a higher value for optimum J . The best 10-3-1 MLP for
partition 5 has a classification error of 18.79% on the validation set, but the optimum value of J = 13
was kept for this partition. Fig. 8.2 shows the average variation in misclassification error in the validation
set with respect to the regularisation parameters (νz,νy) for the 10-3-1 MLPs. Either from the plot or
from the table, it can be seen that the optimum regularisation parameters occur towards the end of the
ranges (10−3, 10−7) in many of the partitions, suggesting that the search could have been continued in
that direction. However, a previous investigation found that network performance on the validation set
drops significantly for smaller values of νz and νy. This is expected since the regularisation terms become
negligible with a consequent loss in generalisation.
partition J νz νy Tr error Va error
1 3 10−2 10−6 21.02 20.73
2 3 10−3 10−7 20.54 19.65
3 3 10−3 10−6 19.69 19.22
4 3 10−3 10−6 19.37 19.39
5 13 10−2 10−5 18.36 18.73
6 3 10−3 10−7 22.15 22.22
7 4 10−3 10−7 19.62 20.31
Table 8.2: Optimum MLP parameters per partitions and percentile classification error for training (Tr)and validation (Va) sets
8.2 Testing on the nth subject 168
2 4 6 8 10 12 14 1620.5
20.6
20.7
20.8
20.9
21
21.1
21.2
21.3
21.4
21.5Vigilance 6−subject MLP optimisation
number of hidden units
aver
age
clas
sific
atio
n er
ror
[%]
Figure 8.1: Average misclassification error for the validation set vs. number of hidden units J for the(n − 1)-subject MLP
−3−2.5
−2−1.5
−1−0.5
0
−7
−6.5
−6
−5.5
−519.4
19.6
19.8
20
20.2
20.4
20.6
20.8
21
21.2
log10
(νy )
Vigilance 6−subject MLP optmisation (J=3)
log10
(νz )
aver
age
clas
sific
atio
n er
ror
[%]
Figure 8.2: Average misclassification error on the validation set with respect to regularisation parameters(νz,νy) for the (n − 1)-subject MLP with J = 3 (linear interpolation used between 12 values)
8.2 Testing on the nth subject
The optimal MLP for each partition is tested using the nth subject. Given that the main goal is not classi-
fication, but the tracking of the alertness-drowsiness continuum, the assessment of MLP performance on
test data is carried out on the time course of the MLP output instead of on a number of randomly selected
1-s segment feature vectors. The time course of the MLP output is compared with the expert assessment
of the subject’s vigilance according to the Alford et al. scale described in section 3.4.2. The time courses
of the MLP output and expert scores are shown in Figs. 8.3 to 8.9. Given that the expert scored the EEG
on a 15-s basis, the MLP output is filtered using a 15-pt median filter. This allows comparison with the
8.2 Testing on the nth subject 169
expert’s discretised representation of the alertness-drowsiness continuum.
8.2.1 Qualitative correlation with expert labels
A visual inspection of the time courses and corresponding expert labels reveals that, in all cases, the time
course of the MLP output follows the fluctuations in the vigilance scale fairly closely.
There is no difference between the MLP outputs corresponding to labels Active and QWP, for which it is
almost always 1.0. Subject 1’s time course shows that the MLP is not reaching the lower values associated
with the WIθ category. The 2-D projections of this subject’s feature vectors in Figs. 7.3 and 7.4 may give a
possible explanation. It can be seen in the figures that the D patterns of this subject lie in the overlapping
area between the A and D classes. The MLP is not always able to resolve the difference between the two
classes in this area, hence the posterior probabilities of belonging to either class are approximately equal
(MLP output ≈ 0.5). Subject 2 is not affected by this problem, the network performance being generally
as expected as the MLP output sweeps the [0-1] range in synchronism with the expert labels. Large
fluctuations remain, even after the filtering, but this is expected from an individual who goes from being
totally drowsy to being fully active several times during the recording. Subject 3 is similar to subject
1, as the MLP output does not reach the drowsiness levels. His A and D patterns in the 2-D feature
space projection also lie in an area of high overlap. The performance for subject 4 is poor, the output
remaining persistently high despite the multiple occurrences of the WIθ label. There are fewer problems
with label WIα, the MLP output reaching a value of around 0.5. This subject’s A patterns seem to be
divided in two clusters far apart in the 7-subject Sammon map (Fig. 7.3), and some of its means are not
visited by the A patterns of other subjects. In contrast, the MLP analysis yields good results in general
for the next three subjects, as with subject 2. Subject 5’s MLP output matches the expert labels with only
two major exceptions, around times 00:42 and 01:57 (42 and 117 minutes), in which the MLP output
is low when the expert labels are WIα-QWP. Similar errors can be found in the time course of the MLP
output for subject 6, when for brief periods of time (around 00:25, 00:47 and 01:18), the MLP output
is high when the subject labels are WIθ-WIα. Note that this subject’s A and D patterns are the furthest
8.3 Training an MLP with n subjects 170
away in the 2-D projection of the feature space, lying in areas of little or no overlap between classes. One
possible reason for the segments with the poor correlation in the MLP output time course is the presence
of artefacts, as occurs during the interval centered on 01:18. Subject 7’s MLP performance also shows
a good correlation with the expert labels, with just two segments at times 0:17 and 1:20 for which the
output fails to indicate an intermediate to high level of alertness.
8.2.2 Quantitative correlation with expert labels
To give a more objective measure of MLP performance on each test subject in turn, the 15-pt median fil-
tered MLP output range was divided into three sub-intervals. Values between 0.0 and 0.3 are considered
to match the drowsy labels WCα, WIθ and WCθ. The second interval, bounded between 0.3 and 0.7, is
to represent the intermediate state WIα, and values between 0.7 and 1.0 are to correspond to the alert
states Active, QWP and QW. Correlation of the median-filtered MLP output with the expert labels, on a
1-s basis, according to this assignment, reinforces the visual assessment (see Table 8.3). The gap between
the best and the worst values is as narrow as 16.4%, the worst correlation being found for subject 4, as
expected, and the best for subjects 1 and 6.
partition 1 2 3 4 5 6 7correlation 60.93 53.04 50.82 44.56 47.45 58.38 49.13
Table 8.3: Percentage correlation between 1-s segments of the 15-pt median filtered MLP output and15s-based expert labels
8.3 Training an MLP with n subjects
The results in the last section show that an MLP trained with the vigilance database is able to track the
alertness-drowsiness continuum. The set of optimal MLPs for all the partitions could be used to analyse
new data as a committee of networks (see §5.3.3). The new data would be presented to all the networks
and the average of the outputs used as an estimate of the alertness posterior probability P (A | x).
However, this average may conceal rapid changes in the alertness-drowsiness continuum which may be
8.4 Summary and conclusions 171
important in the assessment of sleepiness in OSA patients. An alternative and easier approach is to train
an MLP using all seven subjects in the vigilance database in order to analyse subsequent test data. This
neural network will be referred to as the 7-subject MLP in the sections and chapters which follow.
The sequence of steps follows as:
1. Build training and validation sets on n subjects using 800 (or as many as there are if this is not
possible) patterns per subject per class. This yields a total of 2,347 D and 2,800 A patterns (see
Table 7.3) per set.
2. Normalise training and validation sets with respect to the training set statistics.
3. For each set of values of the network parameters (J, νz, νy) and weight initialisation seed, train a
10-J-1 MLP using the cross-entropy error function and the scaled conjugate gradient optimisation
algorithm. Use three different random initialisations for the weights to increase the chance of
finding a better minimum for the error function during the training process.
4. Choose the optimal MLP based on the performance on the validation set.
The range for the MLP regularisation parameters is the same as for the MLP trained with (n−1) subjects.
The number of hidden units J is varied from 2 to 10. Thus, the total number of 7-subject MLP trained to
find the optimum parameters is 324. Fig. 8.10 shows the average misclassification error for the validation
set against the number of hidden units J . The optimal MLP is found at J = 3, with regularisation
parameters (νz, νy) optimal at (10−3, 10−6). The best classification error on the validation set is 20.24%,
with a corresponding error of 20.67% on the training set.
8.4 Summary and conclusions
In this chapter, the vigilance database has been used to train a single-output MLP in order to track the
alertness-drowsiness continuum. Wakefulness EEG is more susceptible to artefacts and rapid changes
than sleep EEG. When the high degree of overlap between Alertness and Drowsiness classes is also con-
8.4 Summary and conclusions 172
sidered, this makes the analysis of the vigilance EEG a more difficult problem. As there are only 7 subjects
available and it was known from the visualisation studies that there existed a large amount of inter-subject
variability in the feature vectors, the leave-one-out method was used to train the neural network MLP. For
a 7-subject database this method yields 7 data partitions, each with 6 subjects. Training and optimisation
of the MLP parameters was carried out for each partition, and the optimal network tested in each case
with the nth subject. The correlation between the MLP output and the expert labels varies from 44.6% to
60.9% across the subjects, showing that an optimal MLP trained with (n − 1) subjects from the vigilance
database is capable of tracking the variations in the level of alertness of the nth (test) subject. For further
use with unseen data, the MLP is re-trained using all the subjects in the 7-subject database. Its use in the
evaluation of test data acquired from other subjects is considered in the following chapter.
8.4 Summary and conclusions 173
0.0
0.3
0.7
1.0
MLP output
Vig
ilanc
e su
bjec
t 1
0.0
0.3
0.7
1.0
15−pt median filtered
020
4060
8010
012
0
WC
θW
IθW
Cα
WIα
QW
QW
P
activ
e
expert labels
time
[min
utes
]
Figure 8.3: Time course of the MLP output for vigilance subject 1
8.4 Summary and conclusions 174
0.0
0.3
0.7
1.0
MLP output
Vig
ilanc
e su
bjec
t 2
0.0
0.3
0.7
1.0
15−pt median filtered
020
4060
8010
012
0
WC
θW
IθW
Cα
WIα
QW
QW
P
activ
e
expert labels
time
[min
utes
]
Figure 8.4: Time course of the MLP output for vigilance subject 2
8.4 Summary and conclusions 175
0.0
0.3
0.7
1.0
MLP output
Vig
ilanc
e su
bjec
t 3
0.0
0.3
0.7
1.0
15−pt median filtered
020
4060
8010
012
0
WC
θW
IθW
Cα
WIα
QW
QW
P
activ
e
expert labels
time
[min
utes
]
Figure 8.5: Time course of the MLP output for vigilance subject 3
8.4 Summary and conclusions 176
0.0
0.3
0.7
1.0
MLP output
Vig
ilanc
e su
bjec
t 4
0.0
0.3
0.7
1.0
15−pt median filtered
020
4060
8010
012
014
0
WC
θW
IθW
Cα
WIα
QW
QW
P
activ
e
expert labels
time
[min
utes
]
Figure 8.6: Time course of the MLP output for vigilance subject 4
8.4 Summary and conclusions 177
0.0
0.3
0.7
1.0
MLP output
Vig
ilanc
e su
bjec
t 5
0.0
0.3
0.7
1.0
15−pt median filtered
020
4060
8010
012
0
WC
θW
IθW
Cα
WIα
QW
QW
P
activ
e
expert labels
time
[min
utes
]
Figure 8.7: Time course of the MLP output for vigilance subject 5
8.4 Summary and conclusions 178
0.0
0.3
0.7
1.0
MLP output
Vig
ilanc
e su
bjec
t 6
0.0
0.3
0.7
1.0
15−pt median filtered
020
4060
8010
012
014
016
0
WC
θW
IθW
Cα
WIα
QW
QW
P
activ
e
expert labels
time
[min
utes
]
Figure 8.8: Time course of the MLP output for vigilance subject 6
8.4 Summary and conclusions 179
0.0
0.3
0.7
1.0
MLP output
Vig
ilanc
e su
bjec
t 7
0.0
0.3
0.7
1.0
15−pt median filtered
020
4060
8010
012
0
WC
θW
IθW
Cα
WIα
QW
QW
P
activ
e
expert labels
time
[min
utes
]
Figure 8.9: Time course of the MLP output for vigilance subject 7
8.4 Summary and conclusions 180
1 2 3 4 5 6 7 8 9 10 1120.5
20.6
20.7
20.8
20.9
21
21.1
21.2
21.3
21.4
21.5Vigilance 7−subject MLP optimisation
number of hidden units
aver
age
clas
sific
atio
n er
ror
[%]
Figure 8.10: Average misclassification error for the validation set vs. number of hidden units J for the7-subject MLP
Chapter 9
Testing using the vigilance trainednetwork
The MLP trained with the vigilance database can now be used to track the vigilance continuum in new
OSA patients. This chapter presents the use of the 7-subject vigilance MLP with new data obtained during
a separate vigilance study in OSA patients.
9.1 Vigilance test database
A physiological vigilance study carried out by the Osler Chest Unit staff at the Churchill Hospital, Ox-
ford, provides frontal EEG records from ten OSA subjects, with varying degrees of severity of the sleep
disorder. The EEG was recorded during a vigilance test which lasted for a maximum of 40 minutes, the
duration depending on the degree of sleepiness of the subject during the test. The test, performed in a
sleep promoting environment, consists of the subject having to respond (by pushing a button) after he
has seen a light emitting diode (LED) flash for about 1s. The LED flashes every 3 seconds and the test
finishes after the subject misses 7 consecutive stimuli. More details about the test and clinical details of
the patients’ sleep disorders can be found in Appendix D. No expert scores are provided for this database,
just the button signal for every test. This can be used as a performance measure to validate the analysis
of the EEG. A summary of this database follows:
9.2 Running the 7-subject vigilance MLP with test data 182
Number of subjects 10
Condition diagnosed with OSA
Description 4 to 6 vigilance tests denoted with the letters A to F in
chronological order.
Electrode montage Frontal
Sampling frequency 128 Hz
Number of expert scorers none, but performance measure is available
9.2 Running the 7-subject vigilance MLP with test data
9.2.1 Pre-processing
EEG signal: The EEG data was pre-processed with the 19pt-low pass FIR filter and the mean removed
as described in previous chapters. Feature extraction, using 10th-order reflection coefficients calculated
using Burg’s algorithm within a sliding 3s-window with 2s overlap, yields a 10-D vector for each second
of EEG. The complete set of these feature vectors will be referred to as the LED test set from now on.
Visual identification of artefacts in the EEG was performed to mark and discard from the analysis those
segments contaminated with saturation and artefacts caused by poor electrode contact. Subject 10’s tests
B and E were excluded from the analysis that follows, due to artefacts or the lack of regular response to
the stimuli.
Button signal: The pulse signal from the button was filtered and used to extract a performance measure
related to the number of missed stimuli. No trigger signal was provided, hence the start of each test was
set to be the second at which the subject starts pressing the button with regularity, every 3s, assuming that
the LED flashed every 3s from that moment on. A missed stimulus is then recorded as occurring when
the button is found not to have been pressed during the three seconds between flashes. The number of
consecutive missed stimuli is calculated on a 3s basis, synchronised with the stimuli, i.e. if the subject has
missed n consecutive hits at time ta seconds, then 1 missed hit has been recorded at (ta−3n) seconds, 2
9.2 Running the 7-subject vigilance MLP with test data 183
missed hits at (ta−3(n− 1)) seconds, . . . , (n−1) missed hits at (ta−3) seconds, and finally n missed hits
at ta.
9.2.2 MLP analysis
Normalisation of the reflection coefficients extracted from the EEG signals acquired during the LED tests
was performed, using the 7-subject training set statistics, and the normalised patterns were presented
to the 7-subject 10-3-1 vigilance MLP (see §8.3). Figures 9.1 to 9.22 show the MLP output time courses
along with the missed stimuli performance measure for each test for each patient. None of the MLP
outputs shown have been median-filtered. Note also the different time scales for each figure, depending
on the length of each test.
Visual inspection of the time courses does not reveal a consistent pattern of correlations between the
MLP output and the performance measure across patients, and not even between different tests for the
same patient. For instance, test A for subject 1 shows a paradoxically low value of the MLP output for the
first 3 minutes of the test, when the subject missed no more than 3 stimuli, and an increase in the MLP
output towards the second half of the test as the subject starts to miss more and more button hits. The
MLP output for the other two tests (C and D) suggests a drowsy subject struggling to keep himself awake
throughout the test, with little or no correlation with the actual performance, with the exception of the
last few seconds of test D, at which point the output goes close to zero while the missed stimuli measure
shows a severe decrease in the subject’s performance.
The next example, subject 2 test A, shows very good correlation between the MLP output and the perfor-
mance measure. The MLP output, generally high during the first half of the test, when the performance is
good, suddenly decreases, just before the performance starts to deteriorate, and remains close to drowsi-
ness levels towards the end of the test. However, the MLP output in subsequent tests for the same
subject, suggests a drowsier subject, remaining under 0.5 even when the performance is good. Some iso-
lated peaks and other oscillations close to intermediate values may indicate the subject’s struggle against
drowsiness. Subject 3 is another example of good correlation in the first and fourth tests but not in the
9.2 Running the 7-subject vigilance MLP with test data 184
other three tests, for which the MLP output is highly oscillatory with no apparent connection with the
button hits. The fourth test is similar to the first one, showing a decreasing trend as the performance
worsened.
Subject 4 endured the four tests with excellent performance, with a high level MLP output, in agreement
with the performance. Unfortunately, this subject has no data for the “drowsy stages” since he never
missed more than two hits. The flat output during the 33th minute in test C for subject 4 is an artefact
due to a loose-electrode connection. Subject 5 has a medium-to-low MLP output, with a trend that tends
to match the decrease in the performance in all the tests.
The MLP output for subject 6 shows dramatic oscillations between the extreme values of drowsiness and
alertness (0 and 1) as the performance decreases. The period of the oscillations is of the order of minutes,
and for most of the cases when the number of missed hits rises above a value of two, a dip in the MLP
output which is approximately 20s-long precedes the increase in the number of missed stimuli. Although
this indicates a better correlation than for the subjects discussed up to now, a low value in the MLP output
at the beginning of tests A, B, C and E, when the performance is very good, spoils the overall correlation
between the MLP output and the performance measure for this subject.
Subject 7 starts with a very short test, during which the MLP output is constantly low notwithstanding
the first minute of perfect stimuli response. His second test shows the expected correspondence between
the MLP output and the number of missed hits. His third test is similar to the second one, but longer,
suggesting that the subject was drowsier than in the previous tests, perhaps performing reasonably well
because he had learned to cope with the test in spite of his increasing drowsiness. His fourth test differs
from the rest in that he did not fall asleep. The MLP output suggests that he was drowsy at the start of
the test, progressively gaining a better degree of vigilance, until he starts missing several LED flashes, and
then struggles between drowsiness and alertness throughout the rest of the test, although maintaining
good performance until the end.
The 8th subject shows an MLP output close to 1.0 throughout, with a few dips, most of them matching
9.2 Running the 7-subject vigilance MLP with test data 185
the loss of ability to respond the stimuli. The one exception to this pattern is the 3-minute period at the
beginning of test C, where the MLP output sweeps through a much wider range. It is worthwhile noting
that subject 8 has the lowest value for the subjective measure of sleepiness (ESS) of any patients in this
database (see Table D.2), and also one of the lowest oxygen saturation dip rates during the overnight
sleep study (see Table D.3), values comparable to those of subject 4, whose performance was the best
as he did not fall asleep during any of the tests. These two subjects could be considered to be at the
lower end of the spectrum of OSA severity, and the MLP output seems to corroborate the clinical results.
However, we shall see later that subject 8 ’s EEG largely differs from the rest of the EEGs in the LED
database and in the original vigilance database of chapter 8, which explains why the output is almost
constant at the upper end of the scale.
The next subject, number 9 in the database, has the worst ESS value and a relatively high index of
night-time O2 de-saturations, and fell asleep very quickly in every test. The MLP output is generally well
correlated with the performance measure, in trend and locally, as for example, when the subject recovers
from a peak in the number of missed stimuli. The last subject in the database, subject 10, performed
very short tests, and there is a degree of correlation between the MLP output and the performance
measure in the first two tests. His third test shows a notch in the MLP output preceding a 9s-long lapse in
performance. His fourth test, however, does not show any significant decrease in the MLP output when
the performance drops towards the end.
To corroborate these comments, the MLP output values were plotted against the number of consecutive
missed stimuli for each LED subject. Only the values at the times at which the stimuli occur have been
considered, as we cannot make any assumption about the subject’s vigilance state in the absence of the
stimulus.
The scatter plots in Figs. 9.23 and 9.24 show that the MLP output tends to take on values below 0.5 as
the number of missed hits increases, especially in subjects 1, 2, 5 and, to some extent, in subjects 7 and
9. This cannot be said of subjects 3, 6, 8, and 10, however, although results for subject 8 can be discarded
9.2 Running the 7-subject vigilance MLP with test data 186
0
0.5
1
MLP
out
put
LED subject 1, test A
0 1 2 3 4 5 6
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test A
0
0.5
1
MLP
out
put
LED subject 1, test B
0 5 10 15 20 25 30
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test B
Figure 9.1: LED subject 1 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 187
0
0.5
1
MLP
out
put
LED subject 1, test C
0 2 4 6 8 10 12 14 16
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test C
0
0.5
1
MLP
out
put
LED subject 1, test D
0 5 10 15 20 25
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test D
Figure 9.2: LED subject 1 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 188
0
0.5
1
MLP
out
put
LED subject 2, test A
0 5 10 15 20 25
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test A
0
0.5
1
MLP
out
put
LED subject 2, test B
0 2 4 6 8 10 12 14 16
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test B
Figure 9.3: LED subject 2 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 189
0
0.5
1
MLP
out
put
LED subject 2, test C
0 5 10 15
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test C
0
0.5
1
MLP
out
put
LED subject 2, test D
0 2 4 6 8 10 12
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test D
Figure 9.4: LED subject 2 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 190
0
0.5
1
MLP
out
put
LED subject 3, test A
0 0.5 1 1.5 2 2.5 3 3.5 4
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test A
0
0.5
1
MLP
out
put
LED subject 3, test B
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test B
Figure 9.5: LED subject 3 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 191
0
0.5
1
MLP
out
put
LED subject 3, test C
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test C
0
0.5
1
MLP
out
put
LED subject 3, test D
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test D
Figure 9.6: LED subject 3 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 192
0
0.5
1
MLP
out
put
LED subject 3, test E
0 1 2 3 4 5 6
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test E
Figure 9.7: LED subject 3 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 193
0
0.5
1
MLP
out
put
LED subject 4, test A
0 5 10 15 20 25 30
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test A
0
0.5
1
MLP
out
put
LED subject 4, test B
0 5 10 15 20 25 30 35 40
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test B
Figure 9.8: LED subject 4 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 194
0
0.5
1
MLP
out
put
LED subject 4, test C
0 5 10 15 20 25 30 35 40
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test C
0
0.5
1
MLP
out
put
LED subject 4, test D
0 5 10 15 20 25 30 35 40
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test D
Figure 9.9: LED subject 4 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 195
0
0.5
1
MLP
out
put
LED subject 5, test A
0 2 4 6 8 10 12 14 16 18 20
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test A
0
0.5
1
MLP
out
put
LED subject 5, test B
0 2 4 6 8 10 12
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test B
Figure 9.10: LED subject 5 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 196
0
0.5
1
MLP
out
put
LED subject 5, test C
0 2 4 6 8 10 12 14 16
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test C
0
0.5
1
MLP
out
put
LED subject 5, test D
0 2 4 6 8 10 12 14 16 18 20
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test D
Figure 9.11: LED subject 5 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 197
0
0.5
1
MLP
out
put
LED subject 6, test A
0 2 4 6 8 10 12 14
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test A
0
0.5
1
MLP
out
put
LED subject 6, test B
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test B
Figure 9.12: LED subject 6 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 198
0
0.5
1
MLP
out
put
LED subject 6, test C
0 2 4 6 8 10 12
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test C
0
0.5
1
MLP
out
put
LED subject 6, test D
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test D
Figure 9.13: LED subject 6 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 199
0
0.5
1
MLP
out
put
LED subject 6, test E
0 2 4 6 8 10 12 14 16 18
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test E
Figure 9.14: LED subject 6 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 200
0
0.5
1
MLP
out
put
LED subject 7, test A
0 0.5 1 1.5 2
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test A
0
0.5
1
MLP
out
put
LED subject 7, test B
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test B
Figure 9.15: LED subject 7 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 201
0
0.5
1
MLP
out
put
LED subject 7, test C
0 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test C
0
0.5
1
MLP
out
put
LED subject 7, test D
0 5 10 15 20 25 30 35 40
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test D
Figure 9.16: LED subject 7 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 202
0
0.5
1
MLP
out
put
LED subject 8, test A
0 5 10 15 20
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test A
0
0.5
1
MLP
out
put
LED subject 8, test B
0 5 10 15 20 25 30
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test B
Figure 9.17: LED subject 8 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 203
0
0.5
1
MLP
out
put
LED subject 8, test C
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test C
0
0.5
1
MLP
out
put
LED subject 8, test D
0 5 10 15 20 25 30
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test D
Figure 9.18: LED subject 8 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 204
0
0.5
1
MLP
out
put
LED subject 9, test A
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test A
0
0.5
1
MLP
out
put
LED subject 9, test B
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test B
Figure 9.19: LED subject 9 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 205
0
0.5
1
MLP
out
put
LED subject 9, test C
0 0.1 0.2 0.3 0.4 0.5 0.6
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test C
0
0.5
1
MLP
out
put
LED subject 9, test D
0 0.2 0.4 0.6 0.8 1 1.2
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test D
Figure 9.20: LED subject 9 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 206
0
0.5
1
MLP
out
put
LED subject 10, test A
0 1 2 3 4 5 6
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test A
0
0.5
1
MLP
out
put
LED subject 10, test C
0 0.5 1 1.5 2 2.5 3 3.5
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test C
Figure 9.21: LED subject 10 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 207
0
0.5
1
MLP
out
put
LED subject 10, test D
0 0.5 1 1.5 2 2.5
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(a) test D
0
0.5
1
MLP
out
put
LED subject 10, test F
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
1
2
3
4
5
6
7
mis
sed
hits
time [minutes]
(b) test F
Figure 9.22: LED subject 10 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data 208
as we will see in the next section. Subject 4 lacks any data for more than 2 missed hits. It is important to
note that the reliability of the results for high values of missed hits is low, as not enough data points are
available, given that the test finishes whenever the subject fails to respond to 7 consecutive LED flashes.
0 1 2 3 4 5 6 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1LED subject 1, tests
missed hits
MLP
out
put
(a) Subject 1
0 1 2 3 4 5 6 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1LED subject 2, tests
missed hits
MLP
out
put
(b) Subject 2
0 1 2 3 4 5 6 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1LED subject 3, tests
missed hits
MLP
out
put
(c) Subject 3
0 1 2 3 4 5 6 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1LED subject 4, tests
missed hits
MLP
out
put
(d) Subject 4
0 1 2 3 4 5 6 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1LED subject 5, tests
missed hits
MLP
out
put
(e) Subject 5
Figure 9.23: LED subjects MLP output vs missed hits scatter plots
It is also clear that, when the subject responds to the stimuli (i.e. no missed hits), the MLP output can
take on any value over the whole range, with a distribution that varies from unimodal to bimodal to
uniform, as is shown in Fig. 9.25. This figure is particularly puzzling for subjects 1 and 2, as well as
subject 5, since all these have a unimodal distribution around zero or a very low value of MLP output.
Subjects 3, 7 and 9 on the other hand, present a very uniform distribution. This suggests that severe OSA
patients may perform reasonably well for some time when their brain is in a ”drowsy” state. However,
they cannot maintain this level of performance indefinitely, and it drops off sooner or later depending of
the severity of the disorder.
It can also be said, when reviewing the MLP outputs for all the subjects, that the transition to Drowsiness
can happen in a progressive manner as well as in sudden dips. Two examples of the former are shown
9.3 Visualisation analysis 209
0 1 2 3 4 5 6 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1LED subject 6, tests
missed hits
MLP
out
put
(a) Subject 6
0 1 2 3 4 5 6 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1LED subject 7, tests
missed hits
MLP
out
put
(b) Subject 7
0 1 2 3 4 5 6 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1LED subject 8, tests
missed hits
MLP
out
put
(c) Subject 8
0 1 2 3 4 5 6 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1LED subject 9, tests
missed hits
MLP
out
put
(d) Subject 9
0 1 2 3 4 5 6 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1LED subject 10, tests
missed hits
MLP
out
put
(e) Subject 10
Figure 9.24: LED subjects MLP output vs missed hits scatter plots
in Figs. 9.11 and 9.3, and two examples of the latter can be found in Figs. 9.12 and 9.6. It is not clear
why this should be but it is probably dependent on the subject rather than on the condition under which
the test is performed (e.g. the time at which the test takes place), as no subject was found to exhibit
both types of behaviour in the LED tests. The time courses of the MLP outputs from the normal sleep-
deprived subjects (Figs. 8.3 to 8.9 in the previous chapter) showed predominantly a gradual transition
to drowsiness, with a few occasional dips (for example, subject 2 at times 55 minutes and 100 minutes)
but the lack of a suitable performance measure for these records prevents us from being able to draw a
definite conclusion.
9.3 Visualisation analysis
In order to get a deeper insight into the EEG data corresponding to good performance (i.e. no missed hits
as for the histograms of Fig. 9.25), the distribution of the vectors in feature space is investigated using
the 7-subject vigilance NEUROSCALE map of section 7.1.2.
9.3 Visualisation analysis 210
0 0.5 10
50
100
150
200
250
300
350LED subject 1
0 0.5 10
100
200
300
400
500LED subject 2
0 0.5 10
10
20
30
40
50
60
70LED subject 3
0 0.5 10
200
400
600
800
1000
1200LED subject 4
0 0.5 10
50
100
150
200
250
300LED subject 5
0 0.5 10
50
100
150
200
250
300
350
400
450LED subject 6
MLP output0 0.5 1
0
50
100
150
200
250LED subject 7
MLP output0 0.5 1
0
200
400
600
800
1000
1200
1400
1600
1800LED subject 8
MLP output0 0.5 1
0
5
10
15
20
25
30LED subject 9
MLP output0 0.5 1
0
20
40
60
80
100
120LED subject 10
MLP output
Figure 9.25: LED subjects no-missed hits MLP output histogram
9.3.1 Projection on the 7-subject vigilance on the NEUROSCALE map
The LED feature vectors are normalised using the mean and variance of the 7-subject vigilance database,
and then presented to the NEUROSCALE map previously trained on the same vigilance database. Figs. 9.26
and 9.27 show the projection of the LED patterns (thick dots, grey for data points with 0 and 1 missed
hits, yellow for data points with 2, 3 or 4 missed hits, and red for data points corresponding to 5, 6 or 7
missed hits) in the 7-subject vigilance 2-D map (A means in magenta ×’s and D means in blue o’s). The
scale is the same for Figs. 9.26 to 9.28, slightly expanded on Fig. 9.29 and completely different for subject
8 on Fig. 9.30. In each case, the percentage of outliers not shown on the map is indicated in brackets.
The 2-D projections show that for more than half of the LED subjects, the patterns lie in the same area of
the vigilance A and D means, with a few patterns lying outside, mainly in the middle-lower section of the
plots and in the outer area around the A means. The LED database, in contrast with the vigilance database
which used the central electrode montage, was acquired with frontal electrodes and it was found that
9.3 Visualisation analysis 211
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 0−1 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8LED subject 1, (outliers 3.4%)
Drowsy Alert 2−4 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 5−7 missed hits
(a) LED subject 1
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 0−1 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8LED subject 2, (outliers 0%)
Drowsy Alert 2−4 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 5−7 missed hits
(b) LED subject 2
Figure 9.26: Patterns from LED subject 1 and 2 projected onto the 7-subject vigilance NEUROSCALE map
EEG contaminated by blinking artefacts or movement artefacts produce patterns that lie on the periphery
of the A means. EEG recorded with frontal electrodes is prone to these two kinds of artefacts, usually
absent in central EEG, and this could explain the outliers in the 2-D projections of the patterns for LED
subjects 4, 5, 6, 7 and 10. In contrast, subjects 1, 2, 3 and 9 produced EEG patterns which are very
likely to come from the same distribution as the vigilance database, given their projections in the 2-D
NEUROSCALE map. Subject 8’s patterns represent the other end of the range, his 2-D projection showing
a high percentage of patterns lying far away from the A and D means, towards the lower-right corner of
the map.
Qualitative correlation with the MLP results
Subject 1 Most of his 0-1 missed hits patterns lie in the region of overlap between alertness and
drowsiness. There is a tendency towards the area in the map which represents drowsiness, so that it can
9.3 Visualisation analysis 212
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 0−1 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8LED subject 3, (outliers 1.6%)
Drowsy Alert 2−4 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 5−7 missed hits
(a) LED subject 3
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 0−1 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8LED subject 5, (outliers 0.62%)
Drowsy Alert 2−4 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 5−7 missed hits
(b) LED subject 5
Figure 9.27: Patterns from LED subject 3 and 5 projected onto the 7-subject vigilance NEUROSCALE map
be said that this subject was mainly drowsy during the tests. The histogram in Fig. 9.25 for the 0-1 missed
hits MLP output is unimodal, with a peak near to zero, indicating that this subject performs well while
drowsy, although he is the second most severe case of OSA in the database according to the overnight
sleep study, the severity of the disorder being mostly assessed here according to the number of dips in
oxygen saturation per hour (see Appendix D).
Subject 2 This subject has a similar distribution of data points over the map, with more vectors in the
alertness region than subject 1, corresponding to the second peak in the bimodal distribution of the 0-1
missed hits MLP output values in Fig. 9.25. From this, it can be said that this subject is generally drowsy
but sometimes alert. The MLP output time courses reveal that the subject only presents these “alert”
9.3 Visualisation analysis 213
patterns during the first test, then he seems to have trained himself to perform in “automatic mode”1, i.e.
while being deeply drowsy. His overnight study suggests that he also has a severe case of OSA.
Subject 3 This subject’s patterns are evenly distributed over the region of overlap on the map, corre-
lating with the uniform distribution of the 0-1 missed hits MLP outputs in Fig. 9.25. These results and
the MLP output time courses suggest that the level of vigilance for this subject varies between drowsiness
and alertness. He had an average number of oxygen de-saturations the previous night.
Subject 4 This subject’s patterns in general and especially for 0-1 missed hits lie mostly within the
region of alertness area and within the region of overlap (suggesting that he was alert during the tests),
with a significant number of outliers. The histogram for the MLP output in Fig. 9.25 shows unimodal
distribution with a peak at 1.0. This peak is due not only to the patterns within the region of alertness
but also to the outliers. This subject is the second mildest case of OSA in the database.
Subject 5 His patterns are distributed in the region of overlap area on the map, with a tendency
towards alertness, so that it can be said that he was slightly more alert than drowsy. The histogram in
Fig. 9.25 shows a unimodal distribution with a mean at around 0.3. This subject represents the mildest
degree of OSA in the database.
Subject 6 The patterns for this subject present the particularity of lying mostly in the lower centre area
of the map (a region of overlap) with a large number of outliers. The distribution for the MLP output for
0-1 missed hits in Fig. 9.25 shows a peak which could be due to outliers, the distribution being uniform
otherwise (and suggesting a level of vigilance between drowsy and alert). Indeed, the MLP output time
courses show continuous fluctuations between drowsiness and alertness. The sleep study categorised this
subject as having a serious case of OSA.
1Automatic behaviour is a phenomenon reported by the sleep-deprived in which they perform relatively routine behaviourwithout having any memory of doing so [31].
9.3 Visualisation analysis 214
Subject 7 Although they lie mostly in the region of overlap area between drowsiness and alertness,
a proportion of the patterns for 0-1 missed hits lies in the alertness area, explaining the second peak in
the 0.9-1.0 bin of the histogram in Fig. 9.25. The first peak occurs around 0.25. This suggests that this
subject was more alert than drowsy, but is also able to perform well when drowsy, as his first and third
test reveal in the MLP time courses.
Subject 8 Note the completely different scale used for this subject because a large proportion of his
patterns are outliers, with a few lying in the alertness dominated region. The impulse-like histogram in
Fig. 9.25 probably owes its peak at 1.0 to the outliers.
A visual inspection of subject 8’s EEG reveals a signal rich in high frequencies. The raw signal was strongly
contaminated with mains interference, removed by the filtering process prior to analysis. Nevertheless,
the filtered signal still shows frequencies in the upper β band, i.e. as high as 25-30 Hz, characteristic of
a very alert state. The vigilance database was obtained from normal subjects who were sleep deprived,
and who were probably drowsy enough not to show up the higher β frequencies in their EEG. These are
therefore absent in the training database and appear as outliers in the NEUROSCALE map of Fig. 9.30.
Subject 9 The few patterns available from this subject lie in the region of overlap on the map, with some
of them spreading towards the alertness region. This correlates well with the nearly uniform distribution
for the histogram of MLP outputs for 0-1 missed hits in Fig. 9.25, and suggests that this subject’s vigilance
was somewhere between alertness and drowsiness. This subject, who fell asleep very quickly in every
test, rated his level of sleepiness as the worst possible (EES in Table D.2, which also shows a large number
of oxygen de-saturations during the night (severe OSA)).
Subject 10 Almost all the patterns for this subject lie in the alertness area of the map, including those
for 5-7 missed hits. This suggests that while subject 10 was mostly alert during the tests, he failed to
respond to 5, 6 and 7 LED flashes when alert! Although his histogram of MLP outputs for 0-1 missed
9.4 Discussion 215
hits is as expected, the scatter plot in Fig. 9.24 shows no correlation between the MLP output and the
performance measure. This corresponds to what is shown in the NEUROSCALE map. It is important to
note that this subject fell asleep very quickly each time and is the most severe case of OSA in the database,
with a rate of oxygen de-saturations more than double the second most severe case, and with the highest
number of movement arousals during the night.
9.4 Discussion
The results of the visualisation analysis have shown that for some subjects in the LED database, as with
subject 8 and to a lesser extent subjects 4 and 6, the EEG does not have the same characteristics as
the EEG in the training database, and hence the results from the MLP are not reliable. Also, the low
average value of the MLP output for most of the subjects may be an influential factor in the decreasing
exponential trend in the scatter plots of subjects 1, 2, 5, 7 and 9, as the statistical significance of the
plot decreases (fewer data points) with an increase in the number of consecutive missed hits. Except for
subject 6, the projection in the NEUROSCALE map shows little difference in the distribution of the feature
vectors for 0–1, 2–4 and 5–7 missed hits. Although this does not necessarily imply the same overlap in
the 10-D feature space, it is another factor to bear in mind in the interpretation of the MLP results and
when considering the correlation between the MLP output and the performance measure for the subjects
in the LED database.
9.5 Summary and conclusions
The MLP trained with the 7-subject vigilance database has been used to analyse new data from a vigilance
study in OSA patients. The study, consisting of 4 to 6 vigilance tests, provided frontal EEG recordings
and a performance measure at regular intervals. The MLP output, representing a continuum between
drowsiness (0) and alertness (1), was calculated for each test.
Visual inspection of the MLP output time courses does not show consistent correlation with the perfor-
9.5 Summary and conclusions 216
mance measure. Scatter plots of the MLP output against the performance measure reveal that the MLP
output is generally low when the performance measure indicates deep drowsiness, as expected, but takes
any value between 0 and 1 when the performance measure suggests alertness. Similar results have been
found by other researchers in a random visual stimulus response test [88][32]. OSA patients seem to
perform relatively well even when their electrophysiological signals indicate drowsiness or even light
sleep. Kecklund and Akerstedt have also found that lorry drivers seem to be able to drive in spite of the
appearance of alpha activity in their EEG [83]. However, the reduction in the statistical significance of
the correlation between MLP output and performance measure as the performance deteriorates prevents
us from making any strong statement about the “drowsy” EEG of OSA subjects.
The NEUROSCALE visualisation technique was also applied to the database tested in this chapter in order
to validate the MLP results. The analysis strongly suggests that the EEG of one of the subjects in the
database, subject 8, is very different from that in the vigilance database and therefore the MLP results
for this subject should be discarded, as the MLP produces no reliable results when it extrapolates. Form
the rest of the subjects in the database, 7 out of 9 seem to have EEG patterns belonging to the same
distribution as that found in the vigilance database, and hence the MLP trained with normal subjects can
be used in the study of these OSA patients.
9.5 Summary and conclusions 217
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 0−1 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8LED subject 7, (outliers 1.1%)
Drowsy Alert 2−4 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 5−7 missed hits
(a) LED subject 7
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 0−1 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8LED subject 9, (outliers 0.12%)
Drowsy Alert 2−4 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 5−7 missed hits
(b) LED subject 9
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 0−1 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8LED subject 10, (outliers 2.1%)
Drowsy Alert 2−4 missed hits
−6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
Drowsy Alert 5−7 missed hits
(c) LED subject 10
Figure 9.28: Patterns from LED subject 7, 9 and 10 projected onto the 7-subject vigilance NEUROSCALEmap
9.5 Summary and conclusions 218
−6 −4 −2 0 2 4 6 8−15
−10
−5
0
5
Drowsy Alert 0−1 missed hits
−6 −4 −2 0 2 4 6 8−15
−10
−5
0
5LED subject 4, (outliers 8.9%)
Drowsy Alert 2−4 missed hits
−6 −4 −2 0 2 4 6 8−15
−10
−5
0
5
Drowsy Alert 5−7 missed hits
(a) LED subject 4
−6 −4 −2 0 2 4 6 8−15
−10
−5
0
5
Drowsy Alert 0−1 missed hits
−6 −4 −2 0 2 4 6 8−15
−10
−5
0
5LED subject 6, (outliers 0.61%)
Drowsy Alert 2−4 missed hits
−6 −4 −2 0 2 4 6 8−15
−10
−5
0
5
Drowsy Alert 5−7 missed hits
(b) LED subject 6
Figure 9.29: Patterns from LED subject 4 and 6 projected onto the 7-subject vigilance NEUROSCALE map
9.5 Summary and conclusions 219
−5 0 5 10 15−30
−25
−20
−15
−10
−5
0
5
10
Drowsy Alert 0−1 missed hits
−5 0 5 10 15−30
−25
−20
−15
−10
−5
0
5
10LED subject 8, (outliers 0.54%)
Drowsy Alert 2−4 missed hits
−5 0 5 10 15−30
−25
−20
−15
−10
−5
0
5
10
Drowsy Alert 5−7 missed hits
Figure 9.30: Patterns from LED subject 8 projected onto the 7-subject vigilance NEUROSCALE map
Chapter 10
Conclusions and future work
10.1 Overview of the thesis
The main objective of the research described in this thesis has been to develop neural network methods
to study the sleep and wake EEG of subjects with the severe breathing disorder OSA. In chapter 6, which
describes the analysis of sleep studies, an MLP neural network was trained with AR model reflection
coefficients as inputs. These were extracted from a single-channel of EEG recorded during the sleep of
normal subjects and the network was trained to track the sleep-wakefulness continuum. An automated
system based on this MLP output and a set of logical rules was developed and tested with OSA sleep EEG.
The results, validated against scores from a human expert, show that the automated system is able to
detect most of the μ-arousals in the EEG of these patients with accuracy, not only in occurrence (with a
median sensitivity of 0.97, and a median positive predictive accuracy of 0.94), but also in starting time
and duration (with a median correlation index of 0.82). In chapter 7 visualisation algorithms applied
to the sleep EEG database and the wake EEG database (acquired from sleep-deprived normal subjects),
showed the need for another MLP network to analyse the wake EEG, as its characteristics differ from
those of the EEG in the sleep database. The vigilance analysis, covered in chapters 8 and 9, was again
carried out by training MLP neural networks with AR model reflection coefficients at the input. This time,
these were extracted from a single-channel of EEG recorded from normal sleep-deprived subjects and the
network was trained to track the alertness-drowsiness continuum. The trained MLPs were tested with
data from normal sleep-deprived subjects as well as from OSA patients performing a visual vigilance task.
10.2 Discussion of results 221
The test on normal subjects was correlated with a human expert assessment of the EEG, and the mean
correlation was found to be 52.0% (sd 5.9%). A performance measure was used to evaluate the MLP
output on the EEG from OSA subjects. The results of this analysis, although not totally conclusive, have
raised important questions about the effect of OSA on the EEG.
10.2 Discussion of results
While the sleep studies yielded very good results in general, the correlation between the MLP output
and the performance measure in OSA subjects was highly variable. It is well known, however, that the
effectiveness of performance measures in the assessment of sleepiness depends largely on the task char-
acteristics. There is no perfect task to evaluate the decrease of vigilance. The physiological-behavioural
link is not straightforward, the task itself being intrusive in the natural process of drowsiness. Many fac-
tors such as motivation, circadian rhythm and habituation can make a very drowsy subject perform well
or better than otherwise. Pivik [130] stressed the relevance of long-practice effects, which can improve
the performance on a given task without an improvement in the physiological condition. Also, not all
investigations of sleep loss have shown adverse effects in performance [139]. The effects of sleep loss are
similar to those of OSA, as the latter fragments the sleep and diminishes its total time. Dinges et al. re-
view the literature in the area [42] and conclude that performance variance increases with sleep loss, that
habituation to a repetitive task is augmented in a sleepy brain, that performance depends non-linearly
on sleep loss and time of the day (related to the circadian rhythm), and that motivation or “willingness
to perform” may have a distinct effect on the capacity to perform. The attentional task used to build
the LED database is repetitive and could have caused habituation to the task after a few minutes of the
first test or in subsequent tests as is the case of subjects 1 and 2. Variance in performance as the subject
gets drowsy might explain the poor correlation found for subjects 3, 6 and 10 (see Figs. 9.23 and 9.24).
Circadian rhythm and/or motivation may explain why subject 7 fell asleep after two and a half minutes
in his first test at midmorning, and performed without falling asleep for 40 minutes in the last test, early
in the afternoon.
10.3 Main research results 222
EEG subject inter-variability was a problem encountered many times in this thesis. The μ-arousal au-
tomated system results were very satisfactory for 5 out of 7 subjects, but disappointing for two of the
subjects. One of these two subjects presents mixed frequency EEG at the time of the μ-arousal and the
other one shows an EEG with much higher content in the upper frequency bands than for the rest of the
database. In a recent paper [46] Drinnan et al. have found that μ-arousal inter-scorer agreement tends
to be poor when the μ-arousal occurs embedded in high-frequency EEG. Also, in the vigilance EEG study
(chapter 7), a case of an α+ subject highlighted the variation of wake EEG patterns in the general popu-
lation, and showed the need for special considerations of these subjects who represent a non-negligible
fraction of the total population.
10.3 Main research results
As mentioned in section 1.1, there had been no prior work in the computerised analysis (using the same
framework) of both sleep disturbance and vigilance from the EEG before the research described in this
thesis. In the course of this research, several findings have been made. Amongst these are:
1. A compromise was found between the stationarity requirements of AR modelling and the variance
of the AR reflection coefficients by using a 3-second analysis window with a 2-second overlap.
The AR model is still able to follow rapid changes whose duration is 3 seconds or more. To our
knowledge, this is the first time that AR modelling has been used in μ-arousal detection.
2. A μ-arousal may cause an increase in the δ rhythms of the EEG at the same time as it causes an
increase in the amplitude of the higher frequencies (α and/or β bands). Hence a μ-arousal is
not necessarily just a “shift” in frequency, as often described in the literature related to μ-arousal
detection.
3. The visualisation analysis described in chapter 7 revealed that Alertness and Drowsiness in vigilance
tests are not the same as Wakefulness and REM/Light Sleep in a sleep-promoting environment.
4. MLP analysis and visualisation techniques applied to wake EEG in OSA subjects show that these
10.4 Conclusions 223
patients can present “drowsy” EEG while performing well during a visual vigilance test.
5. The MLP analysis of the wake EEG in OSA patients has shown that the transition to Drowsiness may
occur progressively as well as in sudden dips.
6. The Alertness EEG patterns of OSA may not be the same as the Alertness patterns of normal sleep-
deprived subjects. Instead they seem to resemble more closely the Drowsiness patterns of these
normal subjects.
10.4 Conclusions
From the sleep study results we conclude that the automatic scoring system, based upon a neural network
trained with normal sleep EEG, can be reliably used as a supporting diagnosis tool in the detection of μ-
arousals in OSA patients.
As for the vigilance study, the neural network proved useful in the assessment of the EEG of these patients
in relation to their performance during the task. More work needs to be done to improve the statistical
significance of these results and to validate them against the scores from a human expert, as it is well
known that correlation of EEG with performance measures is not consistent, the task performance being
influenced by many factors other than physiological sleepiness.
In summary, the use of neural network methods, namely the NEUROSCALE algorithm for EEG data vi-
sualisation, and the MLP network for description of the EEG state in sleep and in vigilance, have led to
a better understanding of the effects of the OSA disorder on the sleep EEG, as we have obtained more
insight into the changes during a μ-arousal. Also, we have found that the EEG alertness patterns of OSA
subjects may present similar characteristics to those of the drowsiness patterns in normal subjects (after
minor sleep deprivation).
10.5 Future work 224
10.5 Future work
Several issues are left open, which should be considered if further work is going to be carried out in the
analysis of the EEG in OSA patients. The most important of these are:
1. There is a missing link between alertness in sleep-deprived normal subjects, alertness in OSA sub-
jects and wakefulness in normal subjects prior to sleep onset. To fill this gap, a study should be
carried out to acquire EEG data from normal fully alert subjects performing vigilance tasks.
2. The various databases used in the work described in this thesis come from three different sleep
laboratories. This has some repercussions on the results as the databases differ in several aspects.
For instance, the wake EEG signals acquired with the frontal electrode montage show variations
in the α and θ content of the signal with respect to those acquired with the central electrode
montage. The vigilance training database was acquired using the central electrode montage while
the test database was recorded using frontal electrodes. This could have a significant effect on the
assessment of alertness by the neural network. Human expert scoring based on the same scale
as used for the training database is desirable on the LED database, in order to validate the MLP
results, given the controversy surrounding the reliability of performance measures in the evaluation
of drowsiness. Also, the task should be redesigned in order to control the habituation factor, and to
increase the amount of data for low performance.
3. The wake EEG from some OSA subjects presented patterns which differ in some degree from nor-
mality. Is the EEG of OSA patients when they are awake different from that of normal subjects? In
other words, is their alertness EEG more like the drowsiness EEG for normal individuals? Are they
constantly drowsy, behaving as if they were alert as a result of habituation? Little or no attention
has so far been given to this issue which has a major effect on the quality of life for many people.
4. There is a good deal of controversy surrounding the treatment of OSA [173]. Some evidence has
been found to support the use of nasal continuous positive pressure (nCPAP) therapy [59][32] as a
10.5 Future work 225
means of keeping the airways open during sleep. Pre- and post-treatment analysis of the EEG and
its relation to performance is suggested to validate nCPAP as an effective therapy for OSA. Once
this is done, the results could be used to find out if the EEG recovers after treatment (so that both
alertness and drowsiness patterns become similar to those of normal subjects) or whether there is
an irreversible long-term effect on the EEG.
5. The α content of the wake EEG has been described as distractive by clinicians [154], as it largely
differs across the population, and shows inconsistent variations from alertness to drowsiness. Some
of the work done in vigilance assessment [170] (and reviewed in section 3.4) uses α power with
eyes open and eyes closed, as a reference. We suggest that this procedure should be considered
(making sure that the subject is alert during this “calibration”), in order to pave the way for subject-
adaptive vigilance analysis. This would imply re-training the neural network using eyes-open and
eyes-closed α power as reference in order to adjust the results for inter-subject α differences.
6. The work in this thesis has taken the application of linear model as far as possible, but the assump-
tion of stationarity must break down regularly for wake EEG. Therefore, non-linear features, such
as complexity as in the work of Rezek et al. [137], time-delay embedding and ICA as in the work
of Lowe [98], should be investigated, as they have been reported as giving better discrimination
than linear methods in preliminary studies on the changes in the EEG of subjects either asleep or
performing vigilance tasks.
7. A more generalised approach in learning theory, called Support Vector Machine (SVM), has been
used for regression and classification [169], reporting better generalisation than neural networks
[144]. SVMs have been discarded for the work presented in this thesis as they lack probabilistic
outputs. However, a Bayesian framework has recently been developed for SVM, introducing the
Relevance Vector Machine (RVM) which does not suffer from the above disadvantage, and demon-
strates comparable generalisation performance to SVM [162]. Future work should explore RVM as
an alternative to the use of neural networks in posterior probability estimation for the sleep and the
10.5 Future work 226
vigilance problems.
Appendix A
Discrete-time stochastic processes
A.1 Definitions
The definitions presented in this section have been taken from [63] and [62].
Stochastic Process: A statistical phenomenon that evolves in time according to probabilistic laws. Fromthe definition of a stochastic process one may be confused and interpret it as a function of the discretevariable time n1, when indeed it represents an infinite number of different realisations ξ of the processu(n, ξ). An ensemble represents a set of realisations of the same process.
Time Series: A realisation ξo of a discrete-time stochastic process is called a time series, u(n), consist-ing of a set of observations generated sequentially in time. A time series of interest is a sequence ofobservations u(n), u(n − 1), ..., u(n − M) generated at discrete and uniformly spaced instants of time,n, n − 1, ..., n − M1.
Statistical description of a discrete-time stochastic process: Consider a stochastic process repre-sented by the ensemble shown in Fig. A.1. Each time series ui(n) represents a random variable along thetime axis, but a set of observations at a specific time n1 represents a random variable as well, in this caseacross the ensemble.
First and second order moments may be defined across the process (ensemble). The mean-value functionof the process is defined as:
μ(n) = E[u(n)] (A.1)
The autocorrelation function of the process may be defined as:
r(n, n − k) = E[u(n)u(n − k)], k = 0,±1,±2, ... (A.2)
Another second order moment, the autocovariance function is defined as:
c(n, n − k) = E[(u(n) − μ(n))(u(n − k) − μ(n − k))], (A.3)
for k = 0,±1,±2, ...
The autocorrelation and the autocovariance functions are related by:
c(n, n − k) = r(n, n − k) − μ(n)μ(n − k) (A.4)
1For convenience time is normalised with respect to the sampling period.
Stochastic processes 228
u1
u2
u3
u4
u5
sample in time
iu (n )1
n1n
(n)
(n)
(n)
(n)
(n)
Figure A.1: Stochastic process ensemble
So, for partial characterisation of a stochastic process through its first and second moments it will besufficient to specify the mean value and either the autocorrelation or the autocovariance function.
Stationary process: A stochastic process will be stationary in the strict sense if all of its moments areconstant. For example, the mean value will be:
μ(n) = μ (A.5)
For such a process the autocorrelation and autocovariance functions depend only on the lag k.
r(n, n − k) = r(k)
c(n, n − k) = c(k)(A.6)
Note that for a stationary process the autocorrelation function at k=0 equals the mean-square value:
r(0) = E[ | u(n) |2 ] (A.7)
and the autocovariance for k=0 equals the variance:
c(0) = E[ | u(n) − μ |2 ] = σ2u (A.8)
If the first and second moments of a process satisfy the conditions described above, it is at least stationaryto the second order, and if the variance is finite, wide-sense stationarity conditions will be satisfied.
Ergodicity Consider a stationary (in the wide-sense) process in which the time moments are constantas well and equal to their equivalents across the process. This is very convenient because it allows us tocharacterise the process with suitable measurements of one of its time series.
We may estimate the mean of the process computing the time average of one of its realisations using:
μ(N) =1N
N−1∑n=0
u(n) (A.9)
Stochastic processes 229
where N is the number of observations or samples of the time series u(n). We expect that this timeaverage will converge to the ensemble mean as N increases. The mean-square error defines a criterionfor this convergence:
limN→∞
[ (μ − μ(N))2 ] = 0 (A.10)
If we repeat the estimation for some more realisations and find the expected value of the square error,we may find that:
limN→∞
E[ | μ − μ(N) |2 ] = 0 (A.11)
In this case it can be said that the process is mean ergodic. In other words,a wide-sense stationary processwill be mean ergodic in the mean-square error sense if the mean-square value of the error between theensemble mean μ and the time average μ(N) approaches zero as the number of samples N approachesinfinity.
This criterion may be extended to other time averages of the process. The estimate used for the autocor-relation function is:
r(k,N) =1N
N−1∑n=0
u(n)u(n − k) (A.12)
for 0 ≤ k ≤ N − 1.
In this case, the process will be correlation ergodic in the mean-square error sense if the mean-squarevalue of the difference between the ensemble autocorrelation r(k) and the time estimate r(k,N) ap-proaches zero as the number of samples approaches infinity.
Transmission of a discrete-time stationary process through a linear filter Let the time series y(n) bethe output of a discrete time shift invariant linear filter with unit-sample response h(n) and input u(n).Assume that u(n) represents a single realisation of a wide-sense stationary discrete-time process. Then,y(n) also represents a single realisation of a stationary wide-sense discrete-time stationary process withautocorrelation ry(k) given by:
ry(k) =∞∑
i=−∞
∞∑�=−∞
h(i)h(k)ru(� − i + k) (A.13)
Correlation matrix If the M × 1 vector u(n) represents the time series as:
u(n) = [ u(n), u(n − 1), ..., u(n − M + 1) ]T (A.14)
where the superscript T denotes transposition. The M × M correlation matrix R may be defined as:
R = E[ u(n)uT (n) ] (A.15)
Expanding this expression:
R(n) =
⎡⎢⎢⎢⎣
[E[u(n)u(n)] E[u(n)u(n − 1)] . . . E[u(n)u(n − M + 1)]E[u(n − 1)u(n)] E[u(n − 1)u(n − 1)] . . . E[u(n − 1)u(n − M + 1)]
......
. . ....
E[u(n − M + 1)u(n)] E[u(n − M + 1)u(n − 1)] . . . E[u(n − M + 1)u(n − M + 1)]]
⎤⎥⎥⎥⎦
(A.16)
Stochastic processes 230
If the process is stationary in the wide-sense:
R =
⎡⎢⎢⎢⎣
r(0) r(1) . . . r(M − 1)r(−1) r(0) . . . r(M − 2)
......
. . ....
r(−M + 1) r(−M + 2) . . . r(0)
⎤⎥⎥⎥⎦ (A.17)
From the property of the autocorrelation function of a wide-sense stationary process:
r(−k) = r(k) (A.18)
we find that the matrix R is symmetric. According to this, only M values of the autocorrelation functionr(k) are needed to calculate the correlation matrix R.
R =
⎡⎢⎢⎢⎣
r(0) r(1) . . . r(M − 1)r(1) r(0) . . . r(M − 2)
...... . . .
...r(M − 1) r(M − 2) . . . r(0)
⎤⎥⎥⎥⎦ (A.19)
As can be seen from Eq. A.19 the correlation matrix of a wide-sense stationary process is Toeplitz, i.e. allthe elements along the main diagonals are equal. A Toeplitz correlation matrix guarantees wide-sensestationarity.
A general property, valid for all stochastic processes is that the correlation matrix is always nonnegativedefinite and almost always positive definite. If it is positive definite it will be nonsingular as well. The rarecondition of a singular correlation matrix represents linear dependency between the elements of the timeseries. This arises only when the process u(n) consists only of a sum of K ≤ M sinusoids. Although thissituation is almost impossible in practice, the correlation matrix may be ill-conditioned if its determinantis very close to a zero value.
Gaussian processes A particular strictly stationary stochastic process, common in the physical sciences,is the Gaussian process, which has the property that it can be fully statistically characterised with onlythe first and second moments. We may call a process, u(n), Gaussian if any linear functional of u(n) is aGaussian distributed random variable. A linear functional is defined by Eq. A.20 as,
Y =∫ T
0
g(t)u(t)dt (A.20)
where g(t) is a weighting function such that the mean-square value of the random variable Y is finite.
For a discrete-time stochastic process, the linear functional becomes a linear function of all the samplesof u(n) up to the time n, Y =
∑ni=0 gix(i). The Gaussian distribution probability density function fY (y)
is shown in Eq. A.21.
fY (y) =1√
2πσ2Y
exp
(− (y − μY )2
2σ2
)(A.21)
where μY is the mean and σ2Y is the variance of the random variable Y . Usually a Gaussian process will
be denoted as N (μ,R). As the mean is a constant value that can be subtracted from the time series, wewill consider only zero mean Gaussian processes, N (0,R).
The joint probability density function of N samples of a Gaussian process is described by:
fU(u) =1
(2π)N/2det(R)1/2exp(−1
2uTR−1u) (A.22)
Stochastic processes 231
Note that fU(u) is N -dimensional for a real-valued process. For the case N = 1, matrix R becomes thevariance of the process, σ2. One particularly interesting property of a Gaussian process, derived from itsdefinition, is that if a Gaussian process u(n) is applied to a stable linear filter, then the output of the filteris a Gaussian process as well.
Appendix B
Conjugate gradient optimisationalgorithms
The description of the algorithms in this Appendix can be found in [19]. For further details the reader isdirected to [53] and [113].
B.1 The conjugate gradient directions
Let us assume line searching takes place along the direction d(τ). At the minimum the derivative of E inthe direction d(τ) vanishes:
d
dλE(w(τ) + λd(τ)) = 0 (B.1)
Let us set the new weight vector w(τ+1) at this minimum in d(τ). Eq. B.1 implies that the gradient vectorin w(τ+1) is orthogonal to the searching direction. By adopting g≡∇E as a short-hand notation for thegradient of the error function, we can write the orthogonality property as:
g(τ+1)Td(τ) = 0 (B.2)
We would like to find a new searching direction d(τ+1) such that the property described in Eq. B.2 holdsfor all the points in this new direction:
g(w(τ+1) + λd(τ+1))Td(τ) = 0 (B.3)
By using the first order expansion of g around w(τ+1):
(g(τ+1) + g′ (τ+1)Tλd(τ+1))T d(τ) = 0
⇒ g(τ+1)Td(τ) + λd(τ+1)Tg′ (τ+1)d(τ) = 0(B.4)
The first term on the left hand side vanishes as a result of the property given in Eq. B.2 and g′ is noneother than the Hessian matrix, so we can write Eq. B.4 as:
d(τ+1)THd(τ) = 0 (B.5)
The directions d(τ+1)T and d(τ)T are said to be non-interfering or conjugate. Suppose that we can find aset of W vectors which are mutually conjugate with respect to H so that:
dTj Hdi = 0, i �= j (B.6)
Optimisation Algorithms 233
It can be shown [19, pp.277] that these vectors are linearly independent, if H is positive definite, andthat they form a complete, non-orthogonal basis set in W. Starting at w1, we can write the differencebetween the minimum w∗ in W and the point w1 as:
w∗ −w1 =W∑i=1
αidi (B.7)
If we define wj as:
wj = w1 +j−1∑i=1
αidi (B.8)
an iterative equation can be written in the form:
wj+1 = wj + αjdj (B.9)
Eq. B.9 represents a succession of line searching steps in the conjugate directions, with the jth step lengthcontrolled by the parameter αj . To find the parameters αj let us assume the quadratic form for the errorfunction:
E(w) = EQ(w) = c + bTw +12wT Hw (B.10)
with constant parameters c, b and H, where the latter is a positive definite matrix, and gradient g(w) isgiven by:
g(w) = b + Hw (B.11)
which vanishes at the minimum in w∗. For this error function, let us pre-multiply Eq. B.7 by dTj H:
dTj Hw∗ − dT
j Hw1 =W∑i=1
αidTj Hdi (B.12)
Given that b + Hw∗ = 0, and by using the orthogonality property described in Eq. B.6, we can writeEq. B.12 as:
−dTj (b + Hw1) = αjdT
j Hdj (B.13)
from which we can express the αj as:
αj = −dT
j (b + Hw1)dT
j Hdj(B.14)
By proceeding in a similar way with Eq. B.8 we find the relationship:
dTj Hwj = dT
j Hw1 (B.15)
which can be used in the numerator in the expression for αj to yield:
αj = −dT
j (b + Hwj)dT
j Hdj
= −dT
j g(wj)dT
j Hdj
(B.16)
Optimisation Algorithms 234
By noting that:
gj+1 − gj = H(wj+1 −wj)
= αjHdj
(B.17)
and substituting the value found in Eq. B.14 in Eq. B.17 and premultiplying by dTj , we find that:
dTj gj+1 = 0 (B.18)
Similarly, if we pre-multiply Eq. B.17 by dTk , with k < j ≤ W , we get:
dTk gj+1 = dT
k gj, for k < j ≤ W (B.19)
It can be found easily by induction that:
dTk gj = 0, for k < j ≤ W (B.20)
Eq. B.20 shows that, for a quadratic error function, at every step the gradient at wj is orthogonal to theprevious conjugate directions dk and the minimum is reached in W steps.
Using the relationships found for this quadratic error function, we can find a set of mutually conjugatedirections by choosing the first one as the negative gradient:
d1 = −g1 (B.21)
Once the minimum w1 on d1 is found, the next direction can be chosen as a linear combination of theprevious one and the gradient at w1:
dj+1 = −gj+1 + βjdj (B.22)
The parameters βj can be found by pre-multiplying Eq. B.22 by dTj H:
βj =gj+1Hdj
dTj Hdj
(B.23)
To avoid the computation of the Hessian, we can use Eq. B.17 in the equation for the βj , getting:
βj =gj+1(gj+1 − gj)dT
j (gj+1 − gj)(B.24)
This expression can be simplified further by using Eq. B.21 and the orthogonality property in Eq. B.20:
βj =gj+1(gj+1 − gj)
gTj gj
(B.25)
This last formula, known as the Polak-Ribiere form, gives better results than the other ones because ittends to reset the conjugate direction in the direction of the gradient if the algorithm is making littleprogress (i.e. gj+1 ≈ gj), restarting in this way the conjugate gradient procedure. A caveat of thisalgorithm is that the Hessian matrix can be negative definite in some regions of weight space for ageneral non-linear error surface. In this case, a robust procedure should make sure that the error doesnot increase at any step.
Optimisation Algorithms 235
B.1.1 The conjugate gradient algorithm
A description of the algorithm follows:
1. Choose an initial set of weights w1
2. Evaluate the gradient g1
3. Set d1 =−g1
4. Initialise j=1
5. Find the minimum of the error function along dj and call this point wj+1
6. If E(wj+1) < ε, stop the procedure and set the neural network weights to wj+1,otherwise continue.
7. Evaluate the gradient gj+1
8. If j is a multiple of W then reset the procedure by setting dj+1 =−gj+1 andgo to step 11.
9. Compute βj using the Polak-Ribiere formula (Eq. B.25)
10. Calculate the new direction as dj+1 =−gj+1 + βjdj
11. Increment j by one and go back to step 5.
B.2 Scaled conjugate gradients
In Eq. B.14, the product of the vector dj with the Hessian matrix (defined as H ≡ ∇(∇E)) can beapproximated by substituting v for dj in the following equation:
vT H = vT∇(∇E) =∇E(w + εv) −∇E(w)
ε+ O(ε) (B.26)
where v O(ε) is a residual term of the order of ε. This residual term can be reduced by one order by usingcentral differences:
vT H = vT∇(∇E) =∇E(w + εv) −∇E(w − εv)
2ε+ O(ε2) (B.27)
However, in the case of a non-quadratic error function, the conjugate gradient approach can lead to anincrease in the error if the Hessian matrix is not positive definite. In such a case, the product vT Hv willnot be positive. To make sure that the denominator of Eq. B.14 remains positive for a negative definiteHessian, the matrix H can be replaced by:
Hmod = H + λI (B.28)
where I is the identity matrix and λ is a scaling factor. The condition over λ is to make the denominatorin Eq. B.14 positive:
dTj Hdj + λ ‖dj ‖2> 0 (B.29)
Since the size of the step αj depends inversely on the scaling factor, λ also controls the step size. If thevalue of λ is too small, the searching region is large. This can be a problem if the error function is farfrom being quadratic in the searching region. If the quadratic approximation is not valid, the conjugategradient formulae may not be effective in the search for the minimum. In such a case the step sizeshould be reduced. Conversely, if the approximation is good the step size can be safely increased. Hence,the scaling factor will have two functions: to make sure that the error decreases when the Hessian is
Optimisation Algorithms 236
negative definite, and to control the searching region based on a measure of the goodness of the quadraticapproximation. Its value will be adjusted at each iteration j.
Starting with λ1 = 0, the denominator of Eq. B.14 can be written as:
DENj = dTj Hjdj + λj ‖dj ‖2 (B.30)
Note that the Hessian now has subindex j indicating that in general it is not constant as its value maychange with each step.
If DENj < 0 the value of λj should be increased. Denoting the new values for λj and the denominatorwith an upper bar, we have:
DEN j = dTj Hjdj + λj ‖dj ‖2
= DENj + (λj − λj) ‖dj ‖2(B.31)
To make DEN j >0 the new scaling factor should be:
λj > λj −DENj
‖dj ‖2(B.32)
By choosing double the value of the right hand side in inequality B.32 we get:
DEN j = −dTj Hjdj (B.33)
Then, replacing the denominator in Eq. B.14 by the right-hand side of Eq. B.33, we can calculate the valueof the step αj . To check if the quadratic approximation is valid, the following index has been proposed[53]:
Δj =E(wj) − E(wj + αjdj)E(wj) − EQ(wj + αjdj)
(B.34)
where EQ(w) is the local quadratic approximation of the error function in the neighbourhood of wj ,given by:
EQ(wj + αjdj) = E(wj) + αjdTj gj +
12α2
jdTj Hjdj (B.35)
It is clear from the above equation that Δj will be close to 1 if the approximation is good, and close tozero if the error function differs largely from the quadratic assumption made. If the approximation isgood then the value of the scaling factor can be decreased for the next iteration. On the contrary, if thevalue of Δj is very small, then the value of λ for the next iteration should be increased. A negative indexΔj indicates that the step will move the weights to a point where the Hessian matrix is negative definite,therefore the weights should not be updated and the value of λ should be decreased accordingly1 tore-calculate the step αj .
Recalling the definition of αj (Eq. B.14) the expression of EQ(wj + αjdj) can be written as:
EQ(wj + αjdj) = E(wj) +12αjdT
j gj (B.36)
Substituting this expression in Eq. B.34 yields:
Δj =2{E(wj) − E(wj + αjdj)}
αjdTj gj
(B.37)
1A decrease given by λj = λj + DENj1−Δj
‖dj‖2 has been suggested [113]
Optimisation Algorithms 237
Lower and upper thresholds for Δj can be 0.25 and 0.75 respectively, for example [53]. The increase ordecrease in λ is also arbitrarily chosen. An example of the quadratic approximation quality check couldbe:
• If Δj >0.75, the approximation is good, decrease the scaling factor, λj+1 = λj/4
• If Δj <0.25, the approximation is poor, increase the scaling factor, λj+1 = 4λj
• If 0.25≤Δj ≤0.75, leave the scaling factor as it is, λj+1 = λj
• If Δj <0, the Hessian has become negative definite with the step αj , increase the scalingfactor as shown in footnote (1) on page 236, then recalculate the modified Hessian andαj and check Δj again.
The scaling technique has been called the model trust region method because the model, in this case thequadratic, is only trusted in a region defined by the scaling factor.
B.2.1 The scaled conjugate gradient algorithm
The scaled conjugated algorithm can be summarised as follow:
1. Choose an initial set of weights w1
2. Set λ1 = 0
3. Choose a very small value for ε
4. Evaluate the gradient g1
5. Set d1 =−g1
6. Initialise j=1
7. Estimate dTj H by central differences
8. Evaluate the denominator of DENj; if negative, increase λj to yield DEN j
9. Calculate αj
10. Check the quality of the quadratic approximation and modify λj correspondingly.If Δj <0 go back to step 8, otherwise continue.
11. If E(wj+1) < ε, stop the procedure and set the neural network weights to wj+1,otherwise continue.
12. Evaluate the gradient gj+1
13. If j is a multiple of W then reset the procedure by setting dj+1 =−gj+1
and go to step 16.
14. Compute βj using the Polak-Ribiere formula (Eq. B.25)
15. Calculate the new direction as dj+1 =−gj+1 + βjdj
16. Increment j by one and go back to step 7.
Appendix C
Vigilance Database
The central channel EEG from eight healthy young adults, performing various vigilance tasks for morethan 2 hours, was recorded and digitised with 12-bit precision and 256 Hz sampling rate. The sub-jects were asked to stay awake the night before and to abstain from caffeine or any other stimulatorysubstances 24 hours before and during the tests. Subject age and gender are shown in Table C.1. Therecording montage consisted of electrode pairs C4−A1 (central right), C3−A2 (central left) and A1−A2
(mastoid), EOG left, EOG right and submental EMG.
Subject ID number Gender Age [years]1 3 female 202 4 female 193 6 male 244 7 male 185 9 female 236 10 female 217 11 male 208 12 female 21
Table C.1: Bristol subjects
The test consisted of three different vigilance tasks, a tracking task, a reaction time task and a serialattention task. In the tracking task the subject is asked to follow a rectangle on a computer screen bymoving a pointing device. The rectangle moves randomly. In the reaction time task, the subject has topress the space bar of the computer keyboard every time a 3x3 mm red square appears on the screen.The square appears at random intervals at an average rate of 18 times per minute. The serial attentiontask consists of a digit display with values within the [−9,+9] interval. The value decreases or increasesat random times and the subject has to hit the left or right button of the mouse to keep it at zero value.Performance indices taken for these tasks are:
• Tracking error for the tracking task, or deviation of the position indicator from the rectangle.
• Reaction time or time-interval in milliseconds from appearance of the red rectangle and the pressingof the space bar in the reaction task.
• Missed stimuli, number of times when the subject did not react when the red rectangle showed upin the reaction time task.
• Serial attention task error, the absolute value of the display in the serial attention task.
A previous study [50] found very little or no correlation between these performance indices and theexpert scoring of the EEG. For instance, the reaction time remains almost constant for all the vigilance sub-categories, while the increase in the tracking error is not significant as the subject gets drowsy. Although
The vigilance database 239
it is well known that lapses of alertness due to sleepiness or fatigue lead to decreased performance,quantifying the loss of performance and correlating it with physiogical measures of sleepiness have provedto be a difficult task [7][157]. Non-related factors like motivation and distractions may affect the results[121][130][35][42]. Therefore the performance indices in the vigilance database are not used in thisthesis.
Appendix D
LED Database
D.1 Method
A frontal-channel EEG1 was recorded from ten OSA patients, performing a behavioural version of theMaintenance of Wakefulness Test (MWT), and digitised with 12-bit precision at a sampling rate of 128Hz. Each subject performed at least four tests, on the same day at 9:00, 11:00, 13:00 and 15:00 hours, ina darkened room with the subject lying on a couch at 45 degrees. The subjects were asked to stay awakefor as long as possible. Each test lasts for a maximum of 40 minutes. A light emitting diode flashes ared light which is displayed for approximately one second every three seconds throughout the test. Thesubject is asked to touch a button on a hand-piece every time the light flashes. Each flash that a subjectfails to respond to is recorded. When seven flashes in succession are not responded to (total time is21s) the test is terminated automatically and the subject is considered to have fallen asleep. The subjectwears headphones through which white noise is played on a pre-recorded tape. This is to reduce anyinterference due to background noise. All subjects were asked to abstain from alcohol for 24 hours priorto the study and coffee and tea for 12 hours prior to the study. Also, subjects were asked not to sleepduring the day of testing. Results from the tests are shown in Table D.1.
D.2 Demographic data
The patients have a mean age of 50.4 years (standard deviation -sd- of 11.3 years), average body massindex (BMI)2 41.1 (sd 8.6). All 10 subjects had been diagnosed with OSA, with an Epworth SleepinessScale (ESS) score greater than 10 indicating subjective daytime sleepiness, mean ESS 16.7 (sd 4.6), anda positive overnight sleep study (performed the night before the vigilance tests) with the number ofoxygen saturation (SaO2) dips of greater than 4% per hour, mean SaO2 dips/hour of 30.4 (sd 19.7), anda number of movements per hour of sleep with a mean of 76.9 (sd 40.4). Full data for each subject canbe found in Tables D.2 and D.3.
1Other channels recorded are right and left Mastoid and reference on either mastoid.2BMI is determined by dividing the weight in kilograms by the square of the height in metres
The LED database 241
TestSubject 09:00 11:00 13:00 15:00 comments
1 06:21 (A) 29:12 (B) 16:57 (C) 26:00 (D)2 25:57 (A) 17:39 (B) 15:00 (C) 12:09 (D)3 03:06 (A) 10:00 (B) 07:12 (D) 05:12 (E) Repeat as the patient said
08:27 (C) he didn’t fall asleep4 40:00 (A) 40:00 (B) 40:00 (C) 40:00 (D)5 20:03 (A) 13:00 (B) 17:12 (C) 19:21 (D)6 13:27 (A) 10:24 (B) 11:39 (C) (D)
10:45 (E) Repeat as the patient fell asleep at start7 02:33 (A) 08:45 (B) 07:33 (C) 40:00 (D)8 21:30 (A) 32:30 (B) 09:03 (C) 31:30 (D)9 07:51 (A) -a (B) 00:21 (C) 00:51 (D) Falling asleep all the time10 05:18 (A) 00:21 (B) 02:57 (D) 05:21 (F) Repeat as the patient said
06:03 (C) 01:33 (E) he didn’t fall asleep
atoo short
Table D.1: Time of falling asleep (in mm:ss) measured by the clinician from the start of the MWT test.The letter used in this thesis to refer to a given test is shown in brackets
subject age [years] height [m] weight [Kg] BMI [Kg2/m] ESS1 57 1.75 152.40 50 152 33 1.85 112.94 33 203 56 1.78 118.40 37 124 55 1.83 159.00 48 115 41 1.73 111.00 37 166 52 1.78 115.20 36 207 53 1.83 150.32 45 198 72 1.75 87.90 29 109 37 1.70 127.50 44 24
10 48 1.75 110.00 36 20
Table D.2: Subject demographic details
Subject O2 dip rate [hr−1] Movement [hr−1]1 55 772 30.5 123 16.3 1075 13.7 124 14.3 616 30.1 1007 17.4 588 18.0 1079 36.4 116
10 72.7 119
Table D.3: Overnight sleep study results
Bibliography
[1] H.D.I. Abarbanel, T.W. Frison, and L.Sh. Tsimring. Obtaining order in a world of chaos. IEEE SignalProcessing Magazine, pages 49–65, May 1998.
[2] P. Achermann, R. Hartmann, A. Gunzinger, W. Guggenbuhl, and A.A. Borbely. All-night sleep EEGand artificial stochastic control signals have similar correlation dimensions. Electroencephalogr.Clin. Neurophysiol., 90(5):384–7, May 1994.
[3] L.A. Aguirre, V.C. Barros, and A.V. Souza. Nonlinear multivariable modeling and analysisof sleep apnea time series. Comput Biol Med, 29(3):207–28, 1999. Abstract available at:http://www.websciences.org/cftemplate/NAPS/indiv.cfm?ID=19992088.
[4] T. Akerstedt. Work hours, sleepiness and the underlying mechanisms. J. Sleep Res., 4(Suppl.2):15–22, Apr 1995.
[5] T. Akerstedt and M. Gillberg. Subjective and objective sleepiness in the active individual. Int. J.Neurosci., 52(1-2):29–37, May 1990.
[6] C. Alford. EEG, performance and subjective sleep measures are not the same: implications forassesment of daytime sleepiness. In Abstracts: British Sleep Society 4th Annual Meeting, page 10.British Sleep Society, 1992.
[7] C. Alford, C. Idzikowski, and I. Hindmarch. Are electrophysiological measures of sleep tendencyrelated to subjective state and performance? In Abstracts: British Sleep Society 3rd Annual Meeting,page 31. British Sleep Society, 1991.
[8] C. Alford, N.Rombaut, J. Jones, S. Foley, and C. Idzikowski. Acute effects of hydroxyzine onnocturnal sleep and sleep tendency the following day: a C-EEG study. Human Psychopharmacology,7, 1992.
[9] P. Anderer, S. Roberts, A. Schlogl, G. Gruber, G. Klosch, W. Herrmann, P. Rappelsberger, O. Filz,M.J. Barbanoj, G. Dorffner, and B. Saletu. Artifact processing in computerized analysis of sleepEEG - a review. Neuropsychobiology, 40(3):150–7, Sep 1999.
[10] N.O. Andersen. On the calculation of filter coeficients for maximum entropy spectral analysis.Geophysics, 19(1):69–72, 1970.
[11] Atlas task force of the American Sleep Disorders Association, EEG arousals: Scoring rules andexamples. Sleep, 15(2):174–184, 1992.
[12] Kemp B. A proposal for computer-based sleep/wake analysis. J. Sleep Res., 2(3):179–85, 1993.Consensus Report.
[13] I.N. Bankman, V.G. Sigillito, R.A. Wise, and P.L. Smith. Feature-based detection of the K-complexwave in the human electroencephalogram using neural networks. IEEE Transactions on BiomedicalEngineering, 39(12):1305–10, Dec 1992.
[14] J. S. Barlow. Methods of analysis of nonstationary EEGs, with emphasis on segmentation tech-niques: A comaprative review. Journal of Clinical Neurophysiology, 2(3):267–304, 1985.
Bibliography 243
[15] R. Baumgart-Schmitt, W.M. Herrmann, and R. Eilers. On the use of neural network techniques toanalyze sleep EEG data. third communication: robustification of the classificator by applying analgorithm obtained from 9 different networks. Neuropsychobiology, 37(1):49–58, 1998.
[16] R. Baumgart-Schmitt, W.M. Herrmann, R. Eilers, and F. Bes. On the use of neural network tech-niques to analyse sleep EEG data. first communication: application of evolutionary and geneticalgorithms to reduce the feature space and to develop classification rules. Neuropsychobiology,36(4):194–210, 1997.
[17] M.A. Bedard, J. Montplaisir, F. Richer, and J. Malo. Nocturnal hypoxemia as a determinant ofvigilance impairment in sleep apnea syndrome. Chest, 100(2):367–70, Aug 1991.
[18] L.S. Bennett, B.A. Langford, J.R. Stradling, and R.J.O. Davies. Sleep fragmentation indices aspredictors of daytime sleepiness and NCPAP response in OSA. The Osler Chest Unit, ChurchillHospital, Headington, Oxford, England.
[19] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.
[20] M.H. Bonnet and D.L. Arand. We are chronically sleep deprived. Sleep, 18(10):908–11, Dec 1995.
[21] G. E. P. Box and G. M. Jenkins. Time series analysis : forecasting and control. Holden-Day series intime series analysis. Holden-Day, San Francisco, rev. edition, 1976.
[22] G. Bremer, J.R. Smith, and I. Karacan. Automatic detection of the K-complex in sleep electroen-cephalograms. IEEE Transactions on Biomedical Engineering, 17(4):314–23, Oct 1970.
[23] D.M. Brittenham. Artifacts: Activities not arising from the brain. In Daly and Pedley [37].
[24] P. Brown and C.D. Marsden. What do the basal ganglia do? The Lancet, 351:1801–4, June 1998.
[25] J. P. Burg. Maximum entropy spectral analysis. PhD thesis, Stanford University, Stanford, California,1975.
[26] J. P. Burg, D. G. Luenberger, and D. L. Wenger. Estimation of structured covariance matrices.Proceedings of the IEEE, 70(9):963–974, Sep 1982.
[27] M.A. Carskadon and W.C. Dement. Daytime sleepiness: quantification of a behavioral state. Neu-rosci. Biobehav. Rev., 11(3):307–17, 1987.
[28] R Caton. The electric currents of the brain. British Medical Journal, (2):278, 1875.
[29] K. Cheshire, H. Engleman, I. Deary, C. Shapiro, and N.J. Douglas. Factors impairing daytimeperformance in patients with sleep apnea/hypopnea syndrome. Arch. Intern. Med., 152(3):538–41, Mar 1992.
[30] S. Chokroverty, editor. Sleep disorders medicine: basic science, technical considerations, and clinicalaspects. Butterworth-Heinemann, Oxford, 2nd edition, 1999.
[31] Circadian Technologies Inc. Alertness Technologies, 2000. Available at: http://www.circadian.com.
[32] R. Conradt, U. Brandenburg, T. Penzel, J. Hasan, A. Varri, and J.H. Peter. Vigilance transitions in re-action time test: a method of describing the state of alertness more objectively. Clin. Neurophysiol.,110(9):1499–509, Sep 1999.
[33] R. Conradt, T. Penzel, U. Brandenburg, and J.H. Peter. Description of vigilance in the EEG dur-ing reaction time test in patients with sleep apnea. In Proceeding of the European Medical andBiological Engineering Conference EMBEC’99, volume 1, pages 414–15, Vienna, Austria, Nov 1999.International Federation for Medical and Biological Engineering.
[34] J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series.Mathematics of Computation, 19(90):297–301, Apr 1965.
[35] M. Corsi-Cabrera, J. Ramos, C. Arce, M.A. Guevara, M. Ponce de Leon, and I. Lorenzo. Changes inthe waking EEG as a consequence of sleep and sleep deprivation. Sleep, 15(6):550–5, Dec 1992.
Bibliography 244
[36] A.C. da Rosa, A. L. N. Fred, and J. M. N. Leitao. Stochastic model of awake and sleep EEG.In M. Holt, C. Cowan, P. Grant, and W. Sandham, editors, Signal Processing VII: Theories andApplications. European Association for Signal Processing, 1994.
[37] D.D. Daly and T.A. Pedley, editors. Current practice of clinical electroencephalography. Raven Press,1990.
[38] R.S. Daniel. Alpha and theta EEG in vigilance. Perceptual and Motor Skills, 25:697–703, 1967.
[39] R.J. Davies, P.J. Belt, S.J. Roberts, N.J. Ali, and J.R. Stradling. Arterial blood pressure responsesto graded transient arousal from sleep in normal humans. J. Appl. Physiol., 74(3):1123–30, Mar1993.
[40] F. De, Carli, L. Nobili, P. Gelcich, and F. Ferrillo. A method for the automatic detection of arousalsduring sleep. Sleep, 22(5):561–72, Aug 1999.
[41] D. F. Dinges. An overview of sleepiness and accidents. J. Sleep Res., 4(Suppl. 2):4–14, 1995.
[42] D.F. Dinges and N. Barone-Kribbs. Performing while sleepy: effects of experimentally-inducedsleepiness. In Monk [114], pages 97–128.
[43] K. Doghramji. Maintenance of wakefulness test. In Chokroverty [30].
[44] N. J. Douglas. The sleep apnoea/hypopnoea syndrome and snoring. In C. M. Shapiro, editor, ABCof Sleep Disorders. BMJ, 1993.
[45] N. J. Douglas. The sleep apnoea/hypopnoea syndrome. In R. Cooper, editor, Sleep. Chapman andHall Medical, 1994.
[46] M.J. Drinnan, A. Murray, G.J. Gibson, and C.J. Griffiths. Interobserver variability in recognizingarousal in respiratory sleep disorders. Am. J. Respir. Crit. Care Med., 158(2):358–62, 1998.
[47] M.J. Drinnan, A. Murray, J.E. White, A.J. Smithson, G.J. Gibson, and C.J. Griffiths. Evaluation ofactivity-based techniques to identify transient arousal in respiratory sleep disorders. J. Sleep Res.,5:173–180, 1996.
[48] M.J. Drinnan, A. Murray, J.E. White, A.J. Smithson, C.J. Griffiths, and G.J. Gibson. Auto-mated recognition of EEG changes accompanying arousal in respiratory sleep disorders. Sleep,19(4):296–303, 1996.
[49] J. Durbin. The fitting of time series models. Revue de l’Institut international de statistique, 28:233–44, 1960.
[50] M. Duta. The Study of Vigilance using Neural Networks Analysis of EEG. PhD thesis, University ofOxford, 1998.
[51] Nervous system. In Encyclopdia Britannica Online, page<http://search.eb.com/bol/topic?eu=119939& sctn=1>. Encyclopdia Britannica, Inc., 1994-2000. [Accessed 23 June 2000].
[52] J. Fell, J. Roschke, K. Mann, and C. Schaffner. Discrimination of sleep stages: a comparisonbetween spectral and nonlinear EEG measures. Electroencephalogr. Clin. Neurophysiol., 98(5):401–10, May 1996.
[53] R. Fletcher. Practical methods of optimization. Wiley, Chichester, 2nd edition, 1987.
[54] J.M. Gaillard, M. Krassoievitch, and R. Tissot. Automatic analysis of sleep by a hybrid system: newresults. Electroencephalography and Clinical Neurophysiology, 33(4):403–10, Oct 1972.
[55] I. Gath and E. Bar-On. Computerized method for scoring of polygraphic sleep recordings. Comput.Programs Biomed., 11(3):217–23, Jun 1980.
[56] C. F. George and A. Smiley. Sleep apnea and automobile crashes. Sleep, 22(6):790–5, 1999.
[57] C.J. Goeller and C.M. Sinton. A microcomputer-based sleep stage analyzer. Computer Methods andPrograms in Biomedicine, 29(1):31–6, May 1989.
Bibliography 245
[58] C. Guilleminault, M. Partinen, M.A. Quera, Salva, B. Hayes, W.C. Dement, and G. Nino-Murcia.Determinants of daytime sleepiness in obstructive sleep apnea. Chest, 94(1):32–7, Jul 1988.
[59] M. Hack, R.J. Davies, R. Mullins, S.J. Choi, S. Ramdassingh-Dow, C. Jenkinson, and J.R. Stradling.Randomised prospective parallel trial of therapeutic versus subtherapeutic nasal continuous posi-tive airway pressure on simulated steering performance in patients with obstructive sleep apnoea.Thorax, 55(3):224–31, 2000.
[60] P. Halasz, O. Kundra, P. Rajna, I. Pal, and M. Vargha. Micro-arousals during nocturnal sleep. ActaPhysiologica Academia Scientiarum Hungaricae, 54(1):1–12, 1979.
[61] J. Hasan, K. Hirvonen, A. Varri, V. Hakkinen, and P. Loula. Validation of computer analysedpolygraphic patterns during drowsiness and sleep onset. Electroencephalogr. Clin. Neurophysiol.,87(3):117–27, Sep 1993.
[62] S. S. Haykin. Communication Systems. Wiley, New York, 3rd edition, 1994.
[63] S. S. Haykin. Adaptive Filter Theory. Information and systems sciences series. Prentice-Hall, NewJersey, 3rd edition, 1996.
[64] H. Head. The conception of nervous and mental energy II. vigilance: a physiological state of thenervous system. Br. J. Psychol., 14:125–147, 1923.
[65] R. Hess. The electroencephalogram in sleep. Electroenceph. clin. Neurophysiol., 16:44–55, 1964.
[66] S.L. Himanen and J. Hasan. Limitations of the Rechtschaffen and Kales. Sleep Medicine Reviews,4(2):149–67, Apr 2000.
[67] B. Hjorth. EEG analysis based on time domain properties. Electroencephalography and ClinicalNeurophysiology, 29:306–310, 1970.
[68] C.A. Holzmann, C.A. Perez, C.M. Held, M. San Martin, F. Pizarro, J.P. Perez, M. Garrido, andP. Peirano. Expert-system classification of sleep/waking states in infants. Medical and BiologicalEngineering and Computing, 37(4):466–76, 1999.
[69] J. Horne. Why we sleep : the functions of sleep in humans and other mammals. Oxford UniversityPress, Oxford, 1988.
[70] J.A. Horne. Dimensions to sleepiness. In Monk [114], pages 169–96.
[71] E. Huupponen, A. Varri, J. Hasan, J. Saarinen, and K. Kaski. Sleep arousal detection with neuralnetwork. Medical & Biological Engineering & Computing, 34(suppl.1):219–20, 1996.
[72] K. Inoue, K. Kumamaru, S. Sagara, and S. Matsuoka. Pattern recognition approach to human sleepEEG analysis and determination of sleep stages. Memoirs of the Faculty of Engineering, KyushuUniversity, 42(3):177–95, Sep 1982.
[73] Wu J., E.C. Ifeachor, E.M. Allen, and N.R. Hudson. A neural network based artefact detectionsystem for EEG signal processing. In Proceedings of the International Conference on Neural Net-works and Expert Systems in Medicine and Healthcare, pages 257–66, Plymouth, UK, 1994. Univ.Plymouth.
[74] B. H. Jansen. Time series analysis by means of linear modelling. In R. Weitkunat, editor, DigitalBiosignal Processing. Elsevier Science Publishers, 1991.
[75] B.H. Jansen, A. Hasman, and R. Lenten. Piecewise analysis of EEGs using AR-modeling and clus-tering. Comput. Biomed. Res., 14(2):168–78, Apr 1981.
[76] H.H. Jasper. The 10-20 system of the international federation. Electroencephalography and ClinicalNeurophysiology, 10:371–5, 1958.
[77] G. M. Jenkins and D. G. Watts. Spectral analysis and its applications. Holden-Day series in timeseries analysis. Holden-Day, San Francisco, 1968.
Bibliography 246
[78] M. Jobert, H. Escola, E. Poiseau, and P. Gaillard. Automatic analysis of sleep using two param-eters based on principal component analysis of electroencephalography spectral data. BiologicalCybernetics, 71(3):197–207, 1994.
[79] T. Jokinen, T. Salmi, A. Ylikoski, and M. Partinen. Use of computerized visual performance test inassessing day-time vigilance in patients with sleep apneas and restless sleep. Int. J. Clin. Monit.Comput., 12(4):225–30, 1995.
[80] T.P. Jung, S. Makeig, M. Stensmo, and T.J. Sejnowski. Estimating alertness from the EEG powerspectrum. IEEE Transactions on Biomedical Engineering, 44(1):60–69, 1997.
[81] S. M. Kay and S. L. Marple. Spectrum analysis-a modern perspective. Proceedings of the IEEE,69(11):1380–1419, November 1981.
[82] S.M. Kay. Recursive maximum likelihood estimation of autoregressive processes. IEEE Transactionson Acoustics, Speech, and Signal Processing, 31(1):56–65, Feb 1983.
[83] G. Kecklund and T. Akerstedt. Sleepiness in long distance truck driving: an ambulatory EEG studyof night driving. Ergonomics, 36(9):1007–17, Sep 1993.
[84] S.A. Keenan. Polysomnographic technique: An overview. In Chokroverty [30].
[85] P. Kellaway. An orderly approah to visual analysis: characteristics of the normal EEG of adults andchildren. In Daly and Pedley [37].
[86] B. Kemp, E. W. Groneveld, A. J. M. W. Jansen, and J. M. Franzen. A model-based monitor of humansleep stages. Biological Cybernetics, 57:365–378, 1987.
[87] L. G. Kiloh, A. G. McComas, and J. W. Osselton. Clinical Electroencephalography. Butterwoths,fourth edition, 1981.
[88] K. Kinnari, J.H. Peter, A. Pietarinen, L. Groete, T. Penzel, A. Varri, P. Laippala, A. Saastamoinen,W. Cassel, and J. Hasan. Vigilance stages and performance in OSAS patients in a monotonousreaction time task. Clinical Neurophysiology, 111(6):1130–6, 2000.
[89] J.R. Knott, F.A. Gibbs, and C.E. Henry. Fourier transform of the electroencephalogram during sleep.J. Exp. Psychol., 31:465–77, 1942.
[90] T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics,43:59–69, 1982.
[91] M.H. Kryger and P.J. Hanly. Cheyne-Stokes respiration in cardiac failure. In Sleep and Respiration,pages 215–26. Wiley-Liss, Inc., 1990.
[92] M. Kubat, G. Pfurtscheller, and D. Flotzinger. AI-based approach to automatic sleep classification.Biol. Cybern., 70(5):443–8, 1994.
[93] St. Kubicki, W.M. Herrmann, and L. Holler. Critical comments on the rules by rechtschaffen andkales concerning the visual evaluation of EEG sleep records. In St. Kubicki and W.M. Herrmann,editors, Methods of sleep research, pages 19–35. Gustav Fischer Verlag, Stuttgart, 1985.
[94] A. Kumar. A real-time system for pattern recognition of human sleep stages by fuzzy systemanalysis. Pattern Recognition, 9(1):43–6, Jan 1977.
[95] N. Levinson. The Wiener RMS (root-mean-square) error criterion in filter design and prediction.Journal of Mathematics and Physics, 25:261–278, 1947.
[96] A.L. Loomis, E.N. Harvey, and G.A. Hoart III. Cerebral stages during sleep, as studied by humanbrain potentials. J. exp. Psychol., 21:127–144, 1937.
[97] I. Lorenzo, J. Ramos, C. Arce, M.A. Guevara, and M. Corsi-Cabrera. Effect of total sleep deprivationon reaction time and waking EEG activity in man. Sleep, 18(5):346–54, Jun 1995.
[98] D. Lowe. Feature space embeddings for extracting structure from single channel wake EEG usingRBF networks. In Neural Networks for Signal Processing VIII. Proceedings of the 1998 IEEE SignalProcessing Society Workshop, pages 428–37, New York, 1998. IEEE.
Bibliography 247
[99] R. Luthringer, R. Minot, M. Toussaint, F. Calvi-Gries, N. Schaltenbrand, and J.P. Macher. All-nightEEG spectral analysis as a tool for the prediction of clinical response to antidepressant treatment.Biol. Psychiatry, 38(2):98–104, Jul 1995.
[100] P.M. Macey, J.S. Li, and R.P. Ford. Deterministic properties of apnoeas in an abdominal breathingsignal. Med. Biol. Eng. Comput., 37(3):335–43, May 1999.
[101] P.M. Macey, J.S.J. Li, and R.P.K. Ford. Expert system for the detection of apnoea. EngineeringApplications of Artificial Intelligence, 11(3):425–38, Jun 1998.
[102] D.J.C. MacKay. The evidence framework applied to classification networks. Neural Computation,4(5):720–36, Sep 1992.
[103] D.J.C. MacKay. A practical bayesian framework for backpropagation networks. Neural Computa-tion, 4(3):448–72, May 1992.
[104] S. Makeig and T.P. Jung. Changes in alertness is principal component of variance in the EEGspectrum. NeuroReport, 7:213–216, 1995.
[105] J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):561–580, 1975.
[106] M. Matsuura, K. Yamamoto, H. Fukuzawa, Y. Okubo, H. Uesugi, T. Kojima M. Moriiwa, and Y. Shi-mazono. Age development and sex differences of various EEG elements in healthy children andadults, quantification by a computerized wave form recognition method. Electroencephalogr. Clin.Neurophysiol., 60(5):394–406, May 1985.
[107] W.T. McNicholas. Sleep apnoea and driving risk. european respiratory society task force on ”publichealth and medicolegal implications of sleep apnoea” [editorial]. Eur. Respir. J., 13(6):1225–7,Jun 1999.
[108] L. T. McWhorter and L. L. Scharf. Nonlinear maximum likelihood estimation of autoregressivetime series. IEEE Transactions on Signal Processing, 43(12):2909–2919, 1995.
[109] R.G. Miller. The jackknife, a review. Biometrika, 61(1):1–15, Apr 1974.
[110] A. Mitchell. Liquid genius. New Scientist, 13 March 1999.
[111] M.M. Mitler, K.S. Gujavarty, and C.P. Browman. Maintenance of wakefulness test: a polysomno-graphic technique for evaluation treatment efficacy in patients with excessive somnolence. Elec-troencephalogr. Clin. Neurophysiol., 53(6):658–61, 1982.
[112] M.M. Mitler, J.S. Poceta, and B.G. Bigby. Sleep scoring technique. In Chokroverty [30].
[113] M. Møller. A scaled conjugated gradient algorithm for fast supervised learning. Neural Networks,6(4):525–33, 1993.
[114] T.H. Monk, editor. Sleep, sleepiness and performance. Human performance and cognition. JohnWiley & Sons, Chichester, England, 1991.
[115] M. Moore-Ede. We have ways of keeping you alert. New Scientist, pages 30–5, Nov. 13th 1993.
[116] MTI Research’s Alertness Technology. Alertness Monitor Technical summary. Available at:http://www.mti.com.
[117] S. S. Narayan and J. P. Burg. Spectral estimation of quasi-periodic data. IEEE Transactions onAcoustics, Speech, and Signal Processing, 38(3):512–518, March 1990.
[118] R.D. Ogilvie, D.M. McDonagh, S.N. Stone, and R.T. Wilkinson. Eye movements and the detectionof sleep onset. Psychophysiology, 25(1):81–91, Jan 1988.
[119] M.M. Ohayon and C. Guilleminault. Epidemiolgy of sleep disorders. In Chokroverty [30].
[120] B.S. Oken and K.H. Chiappa. Short-term variability in EEG frequency analysis. Electroencephalogr.Clin. Neurophysiol., 69(3):191–8, Mar 1988.
[121] J. P. Howe on behalf of the Council of Scientific Affairs. Fatigue, sleep disorders, and motor vehiclecrashes. Technical Report CSA Report 1-A-96, American Sleep Disorders Association, 1996.
Bibliography 248
[122] A.V. Oppenheim and R.W. Schafer. Digital Signal Processing. Prentice-Hall, Englewood Cliffs, NJ.,1975.
[123] J. Pardey, S. J. Roberts, L. Tarassenko, and J. Stradling. A new approach to the analysis of thehuman sleep-wakefulness continuum. Journal of Sleep Research, pages 201–210, 1996.
[124] B. Parks, M. Olsen, and P. Resnik. WordNet: A machine-readable lexical database organized bymeanings. Available at: http://work.ucsd.edu:5141/cgi-bin/http webster, 1991-98.
[125] T.A Pedley and R.D. Traub. Physiological basis of the EEG. In Daly and Pedley [37].
[126] T. Penzel and R. Conradt. Computer based sleep recording and analysis. Sleep Medicine Reviews,4(2):131–48, Apr 2000.
[127] T. Penzel and J. Petzold. A new method for the classification of subvigil stages, using the Fouriertransform, and its application to sleep apnea. Comput. Biol. Med., 19(1):7–34, 1989.
[128] P. Philip, J. Taillard, C. Guilleminault, M.A. Quera-Salva, B. Bioulac, and M. Ohayon. Long distancedriving and self-induced sleep deprivation among automobile drivers. Sleep, 22(4):475–80, Jun1999.
[129] D. Pitson, N. Chhina, S. Knijn, M. van Herwaaden, and J. Stradling. Changes in pulse transit timeand pulse rate as markers of arousal from sleep in normal subjects. Clin. Sci. Colch., 87(2):269–73,Aug 1994.
[130] R.T. Pivik. The several qualities of sleepiness: psychophysiological considerations. In Monk [114],pages 3–37.
[131] W. H. Press, S. A. Teukolsky, W. T Vetterling, and B. P. Flannery. Numerical Recipes in C The Art ofScientific Computing. Cambridge Uinversity Press, Cambridge, 2nd edition, 1994.
[132] J.C.. Principe, S.K.. Gala, and T.G. Chang. Sleep staging automaton based on the theory of evi-dence. IEEE Transactions on Biomedical Engineering, 36(5):503–9, May 1989.
[133] J.C. Principe and J.R. Smith. SAMICOS a sleep analyzing microcomputer system. IEEE Transactionson Biomedical Engineering, 33(10):935–41, Oct 1986.
[134] P.F. Prior and D.E. Maynard. Monitoring cerebral function: long-term monitoring of EEG and evokedpotentials. Elseview, 1986.
[135] Cooper R., C.D Binnie, and Fowler C.J. Origins and technique. In C. D. Binnie and J. W. Osselton,editors, Clinical Neurophysiology: EMG, nerve conduction and evoked potentials / EEG technology.Butterworth-Heinemann Ltd, Oxford, 1995.
[136] A. Rechtschaffen and A. Kales. A Manual of Standardized Terminology, Techniques and ScoringSystem for Sleep Stages of Human Subjects. Public Health Service, U.S. Government Printing Office,Washington D.C., 1968.
[137] I. A. Rezek and S. J. Roberts. Stochastic complexity measures for physiological signal anal-ysis. IEEE Transactions on Biomedical Engineering, 45(9):1186–91, 1998. Available at:http://www.robots.ox.ac.uk/ sjrob/pubs.h.
[138] B D Ripley. Statistical theories of model fitting. volume 168 of NATO ASI series. Series F, Computerand systems sciences, Cambridge, U.K., August 1998. NATO Advanced Study Institute on General-ization in Neural Networks and Machine Learning, Springer.
[139] S. Roberts, I. Rezek, R. Everson, H. Stone, S. Wilson, and C. Alford. Automated assessment ofvigilance using committees of radial basis function analysers. IEE Proceedings Science, Technologyand Measurement, 147(6):333–338, 2000.
[140] T. Roth, T.A. Roehrs, and L. Rosenthal. Measurement of sleepiness and alertness: Multiple sleeplatency test. In Chokroverty [30].
[141] J. W. Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers,C-18(5):401–409, 1969.
Bibliography 249
[142] J. Santamaria and K.H. Chiappa. The EEG of drowsiness in normal adults. J. Clin. Neurophysiol.,4(4):327–82, Oct 1987.
[143] N. Schaltenbrand, R. Lengelle, and J.P Macher. Neural network model: application to automaticanalysis of human sleep. Comput. Biomed. Res., 26(2):157–71, Apr 1993.
[144] B. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing supportvector machines with gaussian kernels to radial basis function classifiers. IEEE Trans. Sign. Pro-cessing, 45:2758–65, 1997. Available at http://www.kernel-machines.org/papers/AIM-1599.ps.
[145] F.W. Sharbrough. Electrical fields and recording techniques. In Daly and Pedley [37].
[146] F.Z. Shaw, R.F. Chen, H.W. Tsao, and C.T Yen. Algorithmic complexity as an index of corticalfunction in awake and pentobarbital-anesthetized rats. J. Neurosci. Methods, 93(2):101–10, Nov1999.
[147] D.K. Siegwart, L. Tarassenko, S.J. Roberts, J.R. Stradling, and J. Partlett. Sleep apnoea analysisfrom neural network post-processing. In Proceeding of Fourth International Conference on ‘ArtificialNeural Networks‘, pages 427–32, London, UK, 1995. IEE.
[148] D.W. Skagen. Estimation of running frequency spectra using a Kalman filter algorithm. Journal ofBiomedical Engineering, 10(3):p.275–9, May 1988.
[149] J. R. Smith. Automated analysis of sleep EEG data. In F. H. Lopes da Silva, W. Storm van Leeuwen,and A. Remond, editors, Handbook of Electroencephalography and Clinical Neurophysiology, vol-ume 2. Elsevier Science Publishers, 1986.
[150] J.R. Smith and I. Karacan. EEG sleep stage scoring by an automatic hybrid system. Electroen-cephalography and Clinical Neurophysiology, 31(3):231–7, Sep 1971.
[151] J.R. Smith, I. Karacan, and M. Yang. Automated analysis of the human sleep EEG. Waking andSleeping, 2:75–82, 1978.
[152] E. Stanus, B. Lacroix, M. Kerkhofs, and J. Mendlewicz. Automated sleep scoring: a comparativereliability study of two algorithms. Electroencephalogr. Clin. Neurophysiol., 66(4):448–56, Apr1987.
[153] M.B. Sterman, G.J. Schummer, T.W. Dushenko, and J.C. Smith. Electroencephalographic correlatesof pilot performance: simulation and in-flight studies. In Electric and Magnetic Activity of theCentral Nervous System: Research and Clinical Applications in Aerospace Medicine, pages 31/1–16,Neuilly sur Seine, France, Feb 1988. AGARD.
[154] J.R. Stradling. Personal communication.
[155] J.R. Stradling. Handbook of Sleep-Related Breathing Disorders. Oxford University Press, Oxford,1993.
[156] J.R. Stradling, D.J. Pitson, L. Bennett, C. Barbour, and R.J.O. Davies. Variation in the arousal pat-tern after obstructive events in obstructive sleep apnea. Am. J. Respir. Crit. Care. Med., 159(1):130–6, Jan 1999.
[157] K. Swingler and L.S. Smith. Producing a neural network for monitoring driver awareness. NeuralComputing and Applications, 4:96–104, 1996.
[158] Shimada T., Shiina T., and Saito Y. Detection of characteristic waves of sleep EEG by neuralnetwork analysis. IEEE Transactions on Biomedical Engineering, 47(3):369–79, 2000.
[159] L. Tarassenko. A Guide to Neural Computing Applications. Arnold, London, 1998.
[160] L. Tarassenko, J. Pardey, S. Roberts, H. Chia, and M. Laister. Neural network analysis of sleepdisorders. In Proceedings of ICANN’95, Paris, Oct 1995. European Neural Network Society.
[161] J. Teran-Santos, A. Jimenez-Gomez, and J. Cordero-Guevara. The association between sleep apneaand the risk of traffic accidents. Cooperative group Burgos-Santander. New England Journal ofMedicine, 340(11):847–51, 1999.
Bibliography 250
[162] M.E. Tipping. The relevance vector machine. In S.A. Solla, T.K. Leen, and K-R. Muller, editors,Advances in Neural Information Processing Systems, volume 12. MIT Press, Cambridge, Mass, 2000.Available at http://www.kernel-machines.org/papers/upload 10444 rvm nips.ps.
[163] M.E. Tipping and D. Lowe. Shadow targets: a novel algorithm for topographic projections byradial basis functions. Neurocomputing, 19(1-3):211–22, Mar 1998.
[164] L. Torsvall and T. Akerstedt. Extreme sleepiness: Quantification of EOG and spectral EEG parame-ters. Intern J. Neuroscience, 38:435–441, 1988.
[165] N. Townsend and L. Tarassenko. Micro-arousals in human sleep: An initial evaluation of auto-matic detection. Robotics Research Group, Department of Engineering Science, Oxford University,Oxford, 1996.
[166] U. Trutschel, R. Guttkuhn, C. Ramsthaler, M. Golz, and M Moore-Ede. Automatic detection ofmicrosleep events using a neuro-fuzzy hybrid system. In 6th European Congress on Intelligent Tech-niques and Soft Computing. EUFIT’98, volume 3, pages 1762–6, Verlag Mainz, Aachen, Germany,1998.
[167] S. Uchida, I. Feinberg, J.D. March, Y. Atsumi, and T Maloney. A comparison of period amplitudeanalysis and FFT power spectral analysis of all-night human sleep EEG. Physiol. Behav., 67(1):121–31, Aug 1999. I haven’t read it all.
[168] S. Uchida, M. Matsuura, S. Ogata, T. Yamamoto, and N. Aikawa. Computerization of Fujimori’smethod of waveform recognition. a review and methodological considerations for its applicationto all-night sleep EEG. J. Neurosci. Methods, 64(1):1–12, Jan 1996.
[169] V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regres-sion estimation, and signal processing. In M. Mozer, M. Jordan, and T. Petsche, editors, Advancesin Neural Information Processing Systems, volume 9, pages 281–287. MIT Press, Cambridge, Mass,1997. Available at http://www.kernel-machines.org/papers/vapgolsmo96.ps.
[170] A. Varri, K. Hirvonen, J. Hasan, P. Loula, and V. Hakkinen. A computerized analysis system forvigilance studies. Comput. Methods Programs Biomed., 39(1-2):113–24, Sep-Oct 1992.
[171] R. Venturini, W.W. Lytton, and T.J. Sejnowski. Neural network analysis of event related potentialsand electroencephalogram predicts vigilance. In J.E. Moody, S.J. Hanson, and R.P. Lippmann,editors, Advances in Neural Information Processing Systems 4, pages 651–658. Morgan KaufmannPublishers, San Mateo, CA, 1992.
[172] M.L. Vis and L.L. Scharf. A note on recursive maximum likelihood for autoregressive modeling.IEEE Transactions on Signal Processing, 42(10):2881–3, Oct 1994.
[173] J. Wright, R. Johns, I. Watt, A. Melville, and T. Sheldon. Health effects of obstructive sleep apnoeaand the effectiveness of continuous positive airways pressure: a systematic review of the researchevidence. British medical journal, 314:851–60, Mar 1997.
[174] G.U. Yule. On a method of investigating periodicities in disturbed series, with special reference toWolfer’s sunspot numbers. Philosophical transactions of the Royal Society of London, A226:267–98,1927.
[175] M. Zamora. How disturbed is your sleep? the study of arousals using neural networks. In NeuralComputing Application Forum Meeting, Oxford, England, Sep 1998. NCAF.
[176] Tarassenko L. Zamora, M. The study of micro arousals using neural network analysis of the eeg.In IEE Ninth International Conference on Artificial Neural Networks, volume 2, pages 625–30, Edin-burgh, Scotland, Sep 1999. IEE.