Post on 26-Aug-2020
On the Variability of TEOAE Human Identification and Verificationsystem
by
Jin Sung Kang
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
The Edward S. Rogers Sr. Department of Electrical & Computer Engineering.University of Toronto
c© Copyright 2018 by Jin Sung Kang
Abstract
On the Variability of TEOAE Human Identification and Verification system
Jin Sung Kang
Master of Applied Science
The Edward S. Rogers Sr. Department of Electrical & Computer Engineering.
University of Toronto
2018
This study presents a deep neural network architecture that achieves state of the art multi-
session verification and identification performance for Transient Evoked Otoacoustic Emission
(TEOAE) biometric system. TEOAE is a 20ms long response generated by the ear that is
naturally strong against falsification, and replay attacks. It can be measured using a device
with a speaker and multiple microphones. Previous TEAOE authentication methods focused
on single-session or mixed-session performance. Our method focuses on multi-session authenti-
cation performance. We train a neural network model that generates a TEOAE embedding that
is separable in Euclidean space by using the triplet loss function. These embeddings are used
to create identity templates which are used to authenticate the user. We show that our method
has 7.56% performance increase for identification scenarios and 13.3% performance increase for
verification scenarios over previous methods when averaged across all tests.
ii
Acknowledgements
I would like to thank Professors Dimitrios Hatzinakos and Yuri Lawryshyn for guiding me with
their mentorship and support. Their expertise has helped me become a better researcher and
a better person.
I would like to thank my family. My mom, my dad, and my sister whom without their love
and support, I would not be here. They have stuck by me every step of the way, through all
my highs and lows and were evermore patient. Mom and dad, ”you da real MVP”.
I really want to thank Yanshuai Cao for his expertise and encouragement. I do not know
how I would have finished this work without your help. Every time I struggled, you were there
to bail me out. I would like to thank Foteini Agrafioti for her constant support and motivation
to help me finish this work. To Joey Bose for accompanying me to the much-needed caffeine
walks and for his positivity during the stressful times. I would also like to thank everyone at
Borealis AI for their support. Every time when I was stuck and needed to vent, they complained
about how hard their schooling was in a genuine effort to make me feel better. I would like to
thank my labmates Mahjid Komeli, Sherif Seha, and Umang Yadav who helped me get through
the courses and provided me with their expertise in biometric security research.
I would like to thank Andrew Persaud, Sohaib Qureshi, Kevin Lee, Bhavik Vyas, Alan Li,
Andre Yang, Dominic Cheng, Fortunato Guanlao, George Jose, Henry Liu, Jawad Ateeq, Rahjee
Martanda, Rahul Udasi, Sam Haruna, SeungWan Choi, Shehzad Akbar, and Van Nguyen for
their support. There were definitely times where I was losing my mind, and you were there to
check up on me and helped me stay sane. Thank you all for sending your positive vibes my
way.
I would like to thank Vivosonic, and Royal Bank of Canada for their funding.
iii
To my parents, who will always be my heroes
iv
Contents
Acknowledgements iii
Dedication iv
List of Tables viii
List of Figures ix
List of Acronyms xi
1 Introduction 1
1.1 Research Goals and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature Review 6
2.1 TEOAE Biometrics Literature Review . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Biometrics using Neural Networks . . . . . . . . . . . . . . . . . . . . . . 9
3 Methodology 11
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Components of Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 1D Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Rectified Linear Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3 Layer Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.4 Max pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.5 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.6 Triplet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.7 Triplet Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
v
3.4.1 Convolution Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Embedding Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.3 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.4 Full Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.5 Fusion Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.1 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.2 Hyperparameter Explanation . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.3 Hyperparameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.4 Weight Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.5 Mini-batch Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.6 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Templates and Comparison Metrics 41
4.1 Identity Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Mean Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.2 SVM Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Comparison Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Eucledian Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Cosine Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.3 Pearson Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.1 Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.3 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Experiments 49
5.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.1 54 subject test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.2 24 subject test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.3 10 subject test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.4 Dataset Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.5 Neural Network Generalization . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.6 Tested Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Single Ear Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
vi
5.3.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 Both ears . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5 Training Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.6 Inference Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6 Conclusions 66
Bibliography 67
A Performance 75
A.1 File Size Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
B Failed Experiments 77
B.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
B.2 Simple Convnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
B.3 One Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
B.4 Auto Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
B.5 CWT and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
B.6 Quadruplet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
C Data Splits 82
vii
List of Tables
3.1 TEOAE recording protocol [17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Network Hyperparameters are shown in this table. The size in and size out is
shown as features × channels. Kernel for convolution block is shown as kernel
size × channels, Maxpooling is shown as kernel size as k and stride as s, and
embedding block is described as kernel size × channels, embedding size. . . . . . 37
3.3 List of Training Hyperparameters and their values . . . . . . . . . . . . . . . . . 37
5.1 Number of responses in the training data set averaged across 20 different data
splits. The training set of the 24 subject test includes responses from 30 subjects,
and the training set of the 10 subject test includes responses from 40 subjects.
The data splits and the amount of responses in the training set for each data
split is shown in Table C.2 and Table C.1 . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Verification performance of different methods for 54 subject test . . . . . . . . . 55
5.3 Verification performance of different methods for 24 subject test . . . . . . . . . 55
5.4 Verification performance of different methods for 10 subject test . . . . . . . . . 56
5.5 Identification performance of different methods for 54 subject test . . . . . . . . 59
5.6 Identification performance of different methods for 24 subject test . . . . . . . . 59
5.7 Identification performance of different methods for 10 subject test . . . . . . . . 59
5.8 Verification performance of different methods for fusion of ear scenario . . . . . . 62
5.9 Identification performance of different methods for fusion of ear scenario . . . . . 62
5.10 Training time for neural networks with different data sizes . . . . . . . . . . . . . 64
5.11 Training time for CWT/LDA method with different data sizes . . . . . . . . . . 64
5.12 Inference time comparison between neural network and CWT/LDA . . . . . . . . 65
A.1 PyTorch model file sizes for different hyperparameters . . . . . . . . . . . . . . . 76
C.1 Random seed and Test Subject distribution for 10 subject test . . . . . . . . . . 82
C.2 Random seed and Test Subject distribution for 24 subject test . . . . . . . . . . 83
viii
List of Figures
1.1 Diagram showing how the OAE response acquisition device operates. The speaker,
and microphones are on the earpiece. [36] . . . . . . . . . . . . . . . . . . . . . . 2
3.1 The distributions of the data in TEOAE database . . . . . . . . . . . . . . . . . 13
3.2 Location of the first and last 10 responses relative to the recording session.
Length of each recording session for each subject are different and they range
from 23 to 336 responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 First 10 TEAOE responses for subject 1 . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Last 10 TEAOE responses for subject 1 . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Last 10 TEAOE averaged responses for multiple subjects . . . . . . . . . . . . . 17
3.6 Proposed Network architecture for single ear authentication. The number beside
the convolution block denote the dimension of the convolutional filter, K is the
kernel size, and S is the stride of the max pooling layer. . . . . . . . . . . . . . . 19
3.7 The components of the convolution and the embedding block. K represents the
Kernel size, and P represents the padding size. /2 for 1D-Convolution means
that the number of channel dimensions is reduced by half. . . . . . . . . . . . . . 20
3.8 Example of a 1D-convolution. This diagram shows a 1-D convolution layer with
a kernel size of 3, a stride of 1, and padding of 1. The input to the layer is
shown in grey. The original input is shown in dark grey, and the added padding
is shown in light grey. The kernel slides over the input, and the inner product
between the input and the kernel is the output to the layer. The equation for
calculating the output y2 is y2 = w1 ∗ x1 + w2 ∗ x2 + w3 ∗ x3 . . . . . . . . . . . 21
3.9 Graph of the Rectified Linear Unit(ReLU) function . . . . . . . . . . . . . . . . . 22
3.10 Example of a Layer Normalization layer. Layer norm calculates mean and the
variance along the feature dimension. The total number of features learned are
2×mini− batchsize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.11 Example of a max pooling layer. This diagram shows a max pooling layer with
a kernel size of 3, a stride of 1, and a padding of 1. The input to the layer is
shown in grey. The original input is shown in dark grey, and the added padding
is shown in light grey. The kernel slides kernel slides over the input, and the
maximum value of the input is produced as the output . . . . . . . . . . . . . . . 24
ix
3.12 How a standard Neural network differs from neural network with dropout [59] . . 25
3.13 Diagram for Triplet loss. Error is calculated when the negative is closer to anchor
than the positive. The end goal for training is the diagram on the right. . . . . . 27
3.14 Proposed Network architecture for fusion of both ears. The number beside the
conv block denote the dimension of the convolutional filter, K is the kernel size,
and S is the stride of the max pooling layer. . . . . . . . . . . . . . . . . . . . . . 33
4.1 System diagram for enrolling a new user. SVM Template Enrollment is shown
on the top and Enrollment using Mean Template is shown on the bottom. . . . . 47
4.2 System diagram for Verification of a prob-sample is shown. Verifying using the
SVM template is shown at the top, and Mean template is shown at the bottom . 48
4.3 System diagram for identifying a probe-sample is shown. Identification using
SVM template is shown at the top, and Mean Template is shown at the bottom. 48
5.1 Verfication scenario EER graphs for subject 0 . . . . . . . . . . . . . . . . . . . 57
5.2 CMC curve for single ear identification scenario . . . . . . . . . . . . . . . . . . . 60
5.3 CMC curve for fusion ear identification scenario . . . . . . . . . . . . . . . . . . . 63
B.1 Simple Architecture used to perform classification . . . . . . . . . . . . . . . . . . 78
B.2 Architecture diagram for training the convolutional neural network without shared
parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
B.3 The final objective of quadruplet loss . . . . . . . . . . . . . . . . . . . . . . . . . 81
x
List of Acronyms
BEMD Modified Bivariate Empirical Mode Decomposition. 6
BioSec Biometrics Security Lab. 11
CMC Cumulative Match Characteristic. 53
CPU Central Processing Unit. 9
CWT Continuous Wavelet Transform. 4
EER Equal Error Rate. 7
FAR False Acceptance Rate. 52
FRR False Rejection Rate. 52
GB Gigabyte. 34
GPU Graphics Processing Unit. 9
HBM High Bandwidth Memory. 34
IoT Internet of Things. 1
LDA Linear Discriminant Analysis. 7
MLE Maximum Likelihood Estimation. 6
PCIe Peripheral Component Interconnect Express. 34
PDF Probability Density Function. 6
ReLU Rectified Linear Unit. 21
ROC Receiver Operating Characteristic. 46
xi
SGD Stochastic Gradient Descent. 8
STFT Short-Time Fourier Transform. 52
SVM Support vector machine. 42
TEOAE Transient Evoked Otoacoustic Emission. 1
WWR Whole Wave Reproducibility. 11
xii
Chapter 1
Introduction
Smartphones are increasingly storing more of our personal information; our communication with
one another, photos we capture, locations we visit, our banking information and other sensitive
data can all be accessed through a smartphone. With more devices connecting to the internet
due to the increase in the Internet of Things (IoT) devices and continual improvements to 5G
networks, the risk of security failure has never been higher. The latest smartphones implement
biometric authentication systems using fingerprint, face, and iris. These technologies have seen
significant improvements over the years largely impart to machine learning and deep neural
networks. As these security systems advance, forgery and spoofing techniques also evolve. For
example, generating a realistic video using limited photos is becoming simpler [62], images
can be manipulated in a manner unnoticeable to classification [20, 41, 53], and Bose et al. [7]
demonstrated that advanced face detectors could be fooled. With the advent of social media,
images of faces can be easily obtained and misused to attack facial biometric authentication
systems.
Additionally, fingerprints can be left on glass surfaces to create moulds to bypass security.
In addition to these weaknesses, there is a significant problem with these biometric modalities:
the biometric information is compromised forever when the information is stolen. Fingerprints,
face, and iris information are very difficult to replace.
Transient Evoked Otoacoustic Emission (TEOAE) is a response generated by the ear after
applying a low-level transient click stimulus, and it is present in most individual’s ears (99%+)
[23]. It is mainly used to diagnose hearing loss in infants and the elderly. TEOAE responses are
dependent on factors such as ear structure, and genetics [29]. The cochlea generates a response
1
Chapter 1. Introduction 2
that quickly dissipates after the stimulus is applied. The average length of the response is about
20ms and is measured using an earphone like device inserted into the ear. The measuring device
contains a speaker and multiple microphones. The speaker produces a stimulus signal, and the
microphones record the responses in quick succession. The diagram for the device operation is
shown in Figure 1.1
Figure 1.1: Diagram showing how the OAE response acquisition device operates. The speaker,and microphones are on the earpiece. [36]
Unlike the biometric modality available on smartphones, TEOAE is naturally strong against
replay and falsification attacks because it requires the attacker to have a complete copy of the
inner ear. Everything in the ear, including the membranes and the fluid consistency, has
to be matched to generate the same response. TEAOE is known to produce responses with
different stimulus signals [29] and further studies are required to prove that the underlying
structure of TEOAE responses produced with different stimulus signals is also different. If
the studies prove that authentication can be done using a response from a different stimulus,
TEOAE may mitigate some of the risks associated when biometric information is compromised.
When TEOAE biometric information is stolen, compromised individuals can re-register using
a TEOAE produced with a different stimulus signal to restore their credentials.
Currently, specialized equipment is needed to capture TEOAE responses. The smallest
TEAOE measurement device available is approximately the size of a smartphone [64]. Although
Chapter 1. Introduction 3
there are current technological limitations to integrate TEOAE measurement into smartphones,
it is not hard to imagine integrating a TEOAE capturing device into headphones. There are
more than 350 million earphones sold every year [60]. An audio device with TEOAE can contin-
uously authenticate users to provide a different dimension to human-computer interaction and
enable new forms of entertainment. Tracking immunization and hospital record for babies with-
out the proper form of identification is a complicated process in developing countries. Newborns
already receive hearing tests using a TEOAE device, and TEOAE biometric authentication can
be integrated using these devices to keep track of health records. Search and rescue missions
requiring rescuers to be on the move and require hand dexterity can use TEOAE authentication
because fingerprint and face recognition would be implausible. Any other applications where
the face has to be covered due to protection or privacy can apply TEOAE authentication.
1.1 Research Goals and Contributions
The primary focus of this work is to test the viability of multi-session identification and verifi-
cation. Data measured in one seating is defined as one session, and a multi-session problem uses
multiple sessions. Previous methods have not focused on the viability of TEOAE biometrics
on multiple session data. We work on increasing the authentication performance compared to
previous methods which have focused on a single, or mixed sessions. Our work contributes the
following:
• We designed a deep neural network model that can learn the characteristics of a TEOAE
response. This model generalizes across multiple sessions and outperforms previous meth-
ods in identification and verification scenarios.
• We designed a neural network architecture that does not require retraining when a new
subject is registered. Neural networks are challenging to train, and conventional neural
network identification methods require retraining a model to incorporate the newly reg-
istered identities. We present an architecture that can register new identities without
retraining the network.
• We propose neural network training techniques to train the network faster and gener-
Chapter 1. Introduction 4
alize better by incorporating the latest advancements in neural networks. Architecture
designs from multi-task learning were used to reduce the number of parameters. Our
design has reduced the number of parameters which enables faster processing, and better
generalization.
• We simplified the feature extraction step required in previous methods. The neural net-
work was designed to extract features directly from a normalized response, while previous
methods have used Continuous Wavelet Transform (CWT) to extract the features. Two
hyperparameters are required to extract features using CWT: the mother wavelet, and
the scale. These hyper-parameters had to be tuned separately for the left ear, the right
ear, and the dataset to get the best result. The combination of multiple CWT scales
results in worse performance than a system that uses a tuned scale. We designed a model
that works without specific hyper-parameters for different ears and datasets.
• We tested the effectiveness of our method using the TEOAE biometric database. Perfor-
mance of identification and verification scenarios for a single ear TEOAE were tested, and
extended to a fusion of left and right ear TEOAE. The TEOAE database was collected by
Biometrics Security Laboratory, at the University of Toronto under the protocol reference
# 23018.
1.2 Thesis Organization
The remainder of the thesis is organized as follows:
Chapter 2 presents the background for TEAOE biometric authentication using neural net-
works. We discuss previous research that shows TEOAE response as a viable biometric authen-
tication modality. We provide background information on neural networks and deep learning
and examine the application of deep neural networks in other biometric systems that contribute
to our work.
Chapter 3 describes the methodology. First, the TEOAE dataset and the TEOAE responses
from different subjects are presented. The collection process and the protocol used for collecting
the dataset are discussed. It also provides background information for neural network layers
Chapter 1. Introduction 5
and discusses the architecture of our proposed neural network model. It includes the reasoning
behind choosing each neural network layer, and how the layers fit into the final architecture.
Finally, the training methods and algorithms that are implemented to train the neural network
are presented.
Chapter 4 introduces various identity templating methods and discusses the pros and cons of
each method. Distance metrics used to compare a probe sample to the template are examined.
This chapter also explains the system architecture required to build an authentication system.
Chapter 5 presents the experimental settings and various tests along with metrics used to
compare these tests. The results and discussion of the tests for both verification and authentica-
tion scenarios are presented to show the effectiveness of our approach. Computational efficiency
between previous methods and neural network model are also discussed.
Chapter 6 concludes our work and provides directions for future studies.
Chapter 2
Literature Review
The TEOAE has been investigated before to assess its effectiveness as a biometric modality
for authentication. TEOAE biometric modality is not heavily researched, and the University
of Toronto Biometric Security group is one of the few institutions continuing the research in
TEAOE. In this section, previous research in TEOAE biometrics authentication system is pre-
sented, along with a short background in neural networks, and a discussion on the advancements
in other areas of machine learning that contribute to our work.
2.1 TEOAE Biometrics Literature Review
The original study done by Swabey et al. [63] showed that TEOAE response was a viable biomet-
ric modality for an authentication system. The paper visually investigated the inter-class and
intra-class differences in TEOAE response using a dataset with hundreds of subjects spanning
over a six month period. The methods in this study mathematically modelled the responses in
the time domain without any transformation. The inter-class and the intra-class distances were
estimated by Maximum Likelihood Estimation (MLE) to approximate the Probability Den-
sity Function (PDF), and the distance between two responses was calculated using Euclidean
distances. The study concluded that TEOAE responses were not only different among individ-
uals but were repeatable with a high degree of reliability, making them suited for a biometric
authentication system.
Agrafioti et al. [18] worked on a model that applied a Modified Bivariate Empirical Mode
6
Chapter 2. Literature Review 7
Decomposition (BEMD) to build an identification system. BEMD with an auditory model was
applied to decompose a response to generate multi-level local oscillation components. These
components were then used to calculate matching scores. Their work first explored the possi-
bility of fusing results from both ears for identification to increase identification performance.
Liu and Hatzinakos continued the work by applying a neural network autoencoder using
CWT features [37]. Their work reduced the dimensions of a TEOAE response using a neural
network to generate an embedding. These embeddings were compared using the Euclidean
distance metric. Equal Error Rate (EER) is a metric used to compare different biometric
systems. It measures the accuracy at which the proportion of the false matches and the false
non-matches are the same for a given test set. This method had a high system level EER, but a
low individual level EER. The results showed that the embeddings were separable individually,
but the variability among the individuals was very large. This work did not explore identification
scenarios.
Another work by Liu and Hatzinakos used CWT and Linear Discriminant Analysis (LDA)
[36] to achieve state of the art performance on verification and identification scenarios. This
paper used CWT as a feature extractor and trained an LDA model to reduce the dimensionality
of the features. Pearson correlation distance was used to determine the similarity between two
responses. This work showed great promise for single session authentication scenarios, but it
did not explore multi-session identification scenarios. The method for identification presented
in this work could only be applied to a test set where all individuals are registered on the
system. Even when the response of an unregistered individual was presented to the system,
the model always predicted a registered individual. The identification method was difficult to
apply in real-world systems.
Komeili et al. [30] presented their work to reduce TEOAE acquisition time, generalize among
multi-session, and reduce computational complexity for verification scenarios. Ten responses
were randomly picked from the first quarter of a recording session to reduce acquisition time. It
was an improvement over the previous methods that picked the last ten samples from a recording
session. This work also aimed to generalize verification scenarios to multi-session data. The
algorithm proposed in this work learns to select a subset of features that maximizes verification
performance from a list of pre-determined features. The pre-processing step required to generate
Chapter 2. Literature Review 8
the list of features in this work was time-consuming and computationally burdensome. This
paper did not explore identification scenarios.
For our work, we attempt to overcome the shortcomings of previous methods. Most works
have not tested their methods against both verification and identification scenarios, did not
test their algorithms on multi-session authentication, had a feature extractor which limited
real-world application, and had to be retrained for every new registrant. In this thesis, we
focus on building a multi-session authentication system, building an effective TEAOE feature
extractor, and reducing the number of retraining steps when registering a new identity.
2.2 Neural Networks
Neural networks were originally designed as an attempt to mimic the brain. A neural network
contains multiple neurons that are connected with each other. Connections between neurons
have weights and biases that are learned from various examples and data. Neural network
models are trained using the backpropagation algorithm [52]. The objective function calculates
the error of a neural network when it produces the wrong output. Gradients of the error with
respect to the model weights are used by the backpropagation algorithm to update neural
network weights. Since gradients have to be calculated for backpropagation to work, it is
important that neural networks be differentiable.
Neural networks require a big training set to be useful, but because of the computational
constraints, the full training set often cannot fit into memory. A training set is divided into
smaller sets known as mini-batches. Neural networks cannot compute full gradient information
from a mini-batch because a mini-batch is only a subset of the training set. Instead, an approx-
imated gradient from a mini-batch is used to train a network. The optimization algorithm to
find the optimal neural network weights is called Stochastic Gradient Descent (SGD). SGD is
known to converge to a global minimum under relaxed constraints when the objective function
is convex, and to a local minimum otherwise [8]. Neural networks are in general non-convex
and non-linear, but SGD works well in practice.
Mini-batches are often randomly sampled without replacement [21,38]. Smaller mini-batches
can be computationally efficient but may not converge due to high variance per mini-batch.
Chapter 2. Literature Review 9
Bigger mini-batches are computationally inefficient and may get stuck in a local minimum due
to low variance. Tuning the mini-batch size is an essential part of a neural network training
process.
Neural networks have two operation modes: the training mode and the inference mode.
In training mode, network weights are updated using the backpropagation algorithm. In the
inference mode, a network makes predictions based on input data. Depending on the type of
layer used in a neural network architecture, a neural network model is computed differently in
the training and the inference modes [26]. Both modes perform forward propagation to calculate
the output of the network, but training step has an additional backpropagation step to update
the network [52]. For efficient training, a Graphics Processing Unit (GPU) is required [31].
The inference mode is faster because it does not require the backpropagation step. Depending
on the size of a network, inferences can be made using a Central Processing Unit (CPU) in a
reasonable time.
The configuration of a network is dependent on the type of problem, and the amount of
training data. Neural networks contain many hyperparameters, and some guidelines exist to
determine their range, but it varies depending on the problem and the dataset. Bergstra
and Bengio [5] have shown that random search is exceptionally effective for neural network
hyperparameter search.
2.2.1 Deep Neural Networks
The current boom in artificial intelligence is due to deep neural networks [31]. With sufficient
data and a deeper neural network, problems that were difficult to solve by traditional machine
learning algorithms can be solved. Deep Neural networks are constructed by stacking multiple
layers of neural network components on top of each other.
2.2.2 Biometrics using Neural Networks
Deep neural networks have been applied to vision-based authentication systems, and it has
shown to increase the performance of these systems. Fingerprints [35], signature recognition [10]
have applied neural networks since the early 1990s.
In face-reidentification tasks, the Siamese network architecture proposed by Copra et al. [14]
Chapter 2. Literature Review 10
have been commonly used. A Siamese network produces embeddings that are closer together
when two samples are from the same identity, and further when data points are from different
identities. The contrastive loss function [22] is used to train a Siamese network. Further work
was done by Koch et al. [28] to enable one-shot learning for face re-identification tasks. One-shot
learning focuses on training a new class using one or a small number of samples. This learning
method allows neural networks to be trained even when the training data is not plentiful. One-
shot learning is useful in a biometric authentication system because registration needs to be
quick, and the amount of data is limited. In this thesis, we apply the techniques in one-shot
learning and parameter sharing to allow registration of new users with a limited number of
samples.
Parameter sharing has been used extensively for multi-task learning [51]. This learning
technique is used when multiple tasks are related to each other. Examples of this would be
classifying types of clothes. A T-shirt, dress, and buttoned up shirt share similar attributes,
but classifying a specific type of t-shirt or a dress may be too difficult for a neural network.
Instead of learning the different types using multiple networks, an architecture can be designed
to learn common attributes using shared parameters, and different attributes with individual
parameters. Li et al. [65] have applied parameter sharing for to the network model to help it
learn both the global and the local perspectives for person re-identification tasks. We apply
a similar concept to our network architecture to guide our model to learn similar attributes
between the left and right ear TEOAE, but also use individual weights to learn the difference.
Schroff et al [55] have used triplet loss networks [25] to achieve state of the art results for face
re-identification. Triplet loss networks are similar to the Siamese networks; a network learns to
keep the data points in the same label close, and different labels far away. The best method
for choosing a triplet pair to reduce training time and increase accuracy was also explored.
This paper proposes choosing hard triplets that maximize the loss of the loss function. Further
discussion for triplet loss network is presented in the next chapter.
Outside of image-based biometric systems, time-signal based biometric systems have also
been using neural networks. Biometric modalities such as ECG [46] and EEG [3,54,56,68] are
modelled using neural networks to perform identification and verification scenarios. Continuing
the trend, we use neural networks to learn TEOAE response characteristics.
Chapter 3
Methodology
3.1 Dataset
The TEOAE dataset was collected at the University of Toronto by the Biometrics Security
Lab (BioSec) [6]. The Vivosonic integrity V500 system was used to record this dataset in an
office environment. During the collection process, ambient noise and conversations were not
controlled in an effort to mimic real-life scenarios. The protocol used for collecting TEOAE
responses is noted in Table 3.1. There were no post-processing steps after the data was collected.
Whole Wave Reproducibility (WWR) was used as the metric to ensure the quality of the
TEOAE dataset. WWR measures how close two responses are correlated. The number of
responses for both ears, sessions, and subjects are different because the recording was stopped
when the response reached a steady-state (WWR > 90%). As shown in Figure 3.1, the length
of each session varies from 23 to 336 responses. Medical journals [29] show that the time taken
to reach steady state is highly dependent on the individual. The last ten responses are the
steadiest responses because the response has reached a steady state.
Two sessions per individual were collected in total, and the time difference between the two
was a minimum of one week. Signals from both the left and right ear were collected for all
participants. TEOAE signals were measured with two microphones per ear in short succession,
and these values were saved into two buffers. In total, TEOAE responses for 54 individuals
were collected in the database. The BioSec TEOAE database is the only TEOAE database
collected for biometric security. The distribution for the data is presented in Figure 3.1.
11
Chapter 3. Methodology 12
The location of the first ten and the last ten responses in a single recording session is
shown in Figure 3.2. Figure 3.3 shows the first ten responses and Figure 3.4 shows the last ten
responses. As it can be seen in the figures, the first ten responses are very noisy compared to
the last ten responses. Looking at Figure 3.3.a and Figure 3.3.c, the values of the same response
in Buffers A and B are slightly different due to different sensor positions in the ear, and the
physical property of the microphones. To mitigate the positioning issue and to reduce noise,
we average the two buffers and use the mean response for training and testing. The averaged
responses between the A and B buffers are shown in Figure 3.4.e, Figure 3.4.f. and Figure 3.3.a
and Figure 3.3.b show that the responses between the left ear and right ear are different. The
responses from different subjects can be seen in Figure 3.5.
Table 3.1: TEOAE recording protocol [17]
StimulusParameters
STI-Mode Non-LinearClick Interval 21.12 ms
Click Duration 80 µsSound Level 80dB peSPL
TestControl
Record Window 20msLow Pass Cut-Off 6000HzHigh Pass Cut-Off 750Hz
Artifact Rejection Threshold 55dB SPL
Chapter 3. Methodology 13
Figure
3.1:
Th
ed
istr
ibu
tion
sof
the
dat
ain
TE
OA
Ed
atab
ase
Chapter 3. Methodology 14
Figure 3.2: Location of the first and last 10 responses relative to the recording session. Lengthof each recording session for each subject are different and they range from 23 to 336 responses.
Chapter 3. Methodology 15
(a) Left Ear Buffer A (b) Right Ear Buffer A
(c) Left Ear Buffer B (d) Right Ear Buffer B
(e) Left Ear Averaged (f) Right Ear Averaged
Figure 3.3: First 10 TEAOE responses for subject 1
Chapter 3. Methodology 16
(a) Left Ear Buffer A (b) Right Ear Buffer A
(c) Left Ear Buffer B (d) Right Ear Buffer B
(e) Left Ear Averaged (f) Right Ear Averaged
Figure 3.4: Last 10 TEAOE responses for subject 1
Chapter 3. Methodology 17
(a) Subject 40 Left Ear (b) Subject 40 Right Ear
(c) Subject 24 Left Ear (d) Subject 24 Right Ear
(e) Subject 26 Left Ear (f) Subject 26 Right Ear
Figure 3.5: Last 10 TEAOE averaged responses for multiple subjects
Chapter 3. Methodology 18
3.2 Pre-processing
The majority of signal pre-processing is done on the Vivosonic integrity V500 sensor as it
captures the data. The sensor removes the noise and the stimulus signal from the output.
TEOAE is recorded using two microphones and saved into two separate buffers. The sensor
outputs a vector of size 660. The responses in the two buffers are averaged to reduce noise. We
pre-process by normalizing the averaged response. The normalization is done as follows:
xi ∈ X
µ =1
D
D∑i=1
xi
σ =
√√√√ 1
D
D∑i=0
(xi − µ
)2Y =
X − µσ
(3.1)
where X ∈ R660 is a raw TEOAE response, and D is the size of the input vector.
Normalization removes inconsistencies in the data, stabilizes neural networks, and speeds
up training. From a feature perspective, it prevents a model from learning the strength of a
response. The strength changes over time [29], and removing it helps a model generalize better
for multi-session data. Also, a TEOAE response is different between the left and right ear.
Without the normalization step, we would not be able to train a model using data from both
ears because the data distribution would be vastly different.
3.3 Components of Neural Network
We present our architecture diagram for single ear authentication in Figure 3.6 and the building
blocks for our architecture is presented in Figure 3.7. We first present the components of our
neural network and the reasoning behind using them.
Chapter 3. Methodology 19
Figure 3.6: Proposed Network architecture for single ear authentication. The number besidethe convolution block denote the dimension of the convolutional filter, K is the kernel size, andS is the stride of the max pooling layer.
Chapter 3. Methodology 20
Figure 3.7: The components of the convolution and the embedding block. K represents theKernel size, and P represents the padding size. /2 for 1D-Convolution means that the numberof channel dimensions is reduced by half.
3.3.1 1D Convolutional Layer
A TEOAE response is a one-dimensional time signal, so our network uses a one-dimensional
convolutional layer. Lecun and Bengio were first to use a convolutional neural network for
image, speech, and time series data [32].
A Convolutional layer imposes a structure of local connectivity and assumes that a local
group of neurons matter more than neurons further away. Making this assumption allows us to
structure neural networks in a way that reduces the number of trainable parameters. For time
series data, it forces networks to look at data closer to each other in time than data further
away. The convolution neural network learns location invariant features. A convolutional layer
operation is as follows:
hkij = (W k ∗ x)ij + bk
xij ∈ Xj
k = {1..K}, j = {1..J}
(3.2)
where Xj is the input vector, J is the number of training examples, and K is the output channel
Chapter 3. Methodology 21
dimension. W ∈ RK×Fw is the convolutional filter weights, where Fw is the filter size. b ∈ RK
is the bias vector. Output of the convolutional layer is H ∈ RJ×D×K . Figure 3.8 illustrates the
convolution operation.
Figure 3.8: Example of a 1D-convolution. This diagram shows a 1-D convolution layer witha kernel size of 3, a stride of 1, and padding of 1. The input to the layer is shown in grey. Theoriginal input is shown in dark grey, and the added padding is shown in light grey. The kernelslides over the input, and the inner product between the input and the kernel is the output tothe layer. The equation for calculating the output y2 is y2 = w1 ∗ x1 + w2 ∗ x2 + w3 ∗ x3
.
3.3.2 Rectified Linear Unit
Rectified Linear Unit (ReLU) [19] is the most commonly used non-linear activation function in
designing neural networks [33, 49]. The non-linear activation function allows neural networks
to learn a non-linear input output relationship. The ReLU function is better than tanh or
sigmoid functions in reducing the vanishing gradient problem [40] which can be problematic
Chapter 3. Methodology 22
when training large networks. The ReLU activation function is as follows:
hlij = max(hl−1ij , 0
)hl−1ij ∈ H
l−1j
(3.3)
where H l−1j is the output from the previous layer, and hlij and hl−1
ij are scalar values. l denotes
the current layer, and the l − 1 denotes the previous layer. The graph in Figure 3.9 shows the
output of the ReLU function with an input x.
Figure 3.9: Graph of the Rectified Linear Unit(ReLU) function
3.3.3 Layer Normalization
Similarly to normalization done in the pre-processing step, the output of each neural network
layer is also normalized. As neural networks train, the distribution of outputs from each layer
shifts considerably and causes a problem known as covariance shift [57]. Layer normalization [34]
is a technique to normalize the output before the next neural network layer. It calculates the
mean and the variance on a per sample basis. The normalization helps networks converge faster,
and achieve better performance. The mean and the variance of the layer outputs are calculated
as follows:
µj =1
D
D∑i=0
hl−1ij (3.4)
σj =
√√√√ 1
D
D∑i=0
(hl−1ij − µj
)2j = {1..J} i = {1..I}
(3.5)
Chapter 3. Methodology 23
where hl−1ij is the output from the previous layer, D is the input vector size, and J is the size
of a mini-batch, and I is the number of features. Output of the layer is calculated by:
hlij = γjhl−1ij − µj√σ2j + ε
+ βj j = {1..J} (3.6)
where γj and βj are parameters learned through training, µj is calculated using Eq. 3.4,
and σj is calculated by Eq. 3.5 of the network and they scale the normalization. The layer
normalization step is illustrated in Figure 3.10
Figure 3.10: Example of a Layer Normalization layer. Layer norm calculates mean and thevariance along the feature dimension. The total number of features learned are 2 × mini −batchsize
3.3.4 Max pooling
The number of parameters in neural networks needs to be reduced for faster computation, faster
convergence, and smaller network size. A Max pooling layer reduces the number of parameters
by selecting the most important features. The filter with size k is moved along the input by
shifting its index by a stride s. The maximum input value within the shifting filter is chosen as
the output. Max pooling reduces the total output dimension of a layer when s > 1 and makes
neural networks local scale invariant [39]. The equation for a max pooling layer is given by:
Chapter 3. Methodology 24
Y = max(hl−1s∗k , . . . , h
l−1s∗k+w)
k = {1...K}(3.7)
K =
⌊L− w
s+ 1
⌋(3.8)
where L is the output dimension from hl−1. The example for a max-pooling layer is shown in
Figure 3.11.
Figure 3.11: Example of a max pooling layer. This diagram shows a max pooling layer witha kernel size of 3, a stride of 1, and a padding of 1. The input to the layer is shown in grey.The original input is shown in dark grey, and the added padding is shown in light grey. Thekernel slides kernel slides over the input, and the maximum value of the input is produced asthe output
Chapter 3. Methodology 25
3.3.5 Dropout
Dropout is a technique pioneered by Srivastava et al. [59] to prevent overfitting of a neural
network. A dropout layer randomly sets some of the neurons to zero with probability p. When
neural networks train, each neuron becomes highly dependent on other neurons to generate
useful features. Dropout randomly breaks these dependencies and forces each neuron to become
a better feature extractor themselves. Dropping random neurons could also be seen as training
neural networks in multiple configurations. To get the best inference performance, all outputs
from all configurations should be averaged, but keeping track of all possible configurations is
infeasible and computationally costly. The solution is to approximate the output by scaling the
output of the trained neural network without applying dropout by a factor of p.
Figure 3.12: How a standard Neural network differs from neural network with dropout [59]
3.3.6 Triplet Loss
The triplet loss [66] is very commonly used in face and person recognition tasks [13, 55] and
speaker identification tasks [9, 67]. The triplet loss [66] is used to train neural networks with
weight w as an embedding function fw : RDin → RDembed , where Din is the input vector
dimension, and Dembed is the embedding vector dimension. A triplet consists of xa, xp, xn
samples which are called the anchor, positive and negative sample. Samples from the anchor
and the positive are from the same class, but the negative is from a different class. A minimum
Chapter 3. Methodology 26
of two separate classes are needed to generate a triplet pair. The triplet loss is used to train
a neural network model that learns to optimize the embedding space such that the Euclidean
distance between the anchor(ea), and the positive(ep) are closer than the distance between the
anchor(ea) and the negative(en). The embeddings ea, ep, en are computed by e = fw(x) where
e is the embedding, and x is the data sample. The Euclidean distance function is defined as g.
The loss is calculated by:
Ltriplet =N∑i=1
[g(eai , e
pi )
2 − g(eai , eni )2 + α
]+
yai = ypi , yai 6= yni
(3.9)
where y is the label for each embedding e, and α is the threshold for the distance margin.
Figure 3.13 illustrates the triplet loss objective. The triplet configuration on the left shows
the negative closer to the anchor than the positive. The loss of this configuration would be a
positive value. The configuration on the right shows the positive closer than the negative. As
long as g(ea, ep) + α is smaller than g(ea, en), loss will be zero, as it has achieved the goal of
increasing the inter-class distance.
One of the problems with using the triplet loss objective is its large intra-class variance.
The requirements for the loss function is to satisfy maximum inter-class distance, but it does
not explicitly set requirements for intra-class distances. Intra-class distance requirements are
somewhat implicitly enforced as an embedding with a large intra-class distance will also fail to
maximize inter-class distances. The triplet loss will ignore intra-class distances as long as it
can maximize inter-class distances. For a biometric security system, small intra-class distance
is essential, as it helps a network generalize better to data points that it has not seen before.
Since our model will operate on data that it has not been trained on, generalization is essential.
To improve the intra-class variance problem, we tried using a quadruplet loss [12] objective,
but it resulted in a lower authentication performance. More discussion about the quadruplet
loss is given in Appendix B.6
Chapter 3. Methodology 27
Figure 3.13: Diagram for Triplet loss. Error is calculated when the negative is closer to anchorthan the positive. The end goal for training is the diagram on the right.
3.3.7 Triplet Mining
Like most data sampling techniques used in machine learning practices, triplets can be picked at
random. However, the number of triplet pairs grows exponentially with the number of samples
and to compute all triplet pairs is computationally infeasible. As training time increases, more
triplet pairs become perfectly separable and result in zero loss. These pairs are not helpful
for neural network training as there is nothing new to learn. Samples that do not maximize
Eq. 3.9 or results in zero loss are called easy triplets, and the opposite are called hard triplets.
Training using easy triplet pairs results in a network with slow convergence speed and lower
accuracy as it cannot learn distinguishing features. To maximize training speed, and achieve
better results, picking a hard triplet is essential.
Two strategies to pick the ideal triplet are the online and the offline strategies. The offline
strategy uses the current state of the network to pick the next hard sample. It selects hard
triplets by calculating the Euclidean distance of the embeddings in the dataset. These triplets
are used to calculate the loss and update the neural network weights. The online strategy
Chapter 3. Methodology 28
randomly samples mini-batches from the dataset ahead of training, and constructs the hard
triplet from each mini-batch during training. The offline strategy can pick harder negatives
than the online strategy because it looks at a larger pool of data. However, picking the hardest
global triplet may have adverse effects on training due to the possibility of mislabeled data.
Picking the triplets inside a mini-batch can mitigate the problem.
In this thesis, online batch hard mining strategy for selecting triplet pairs is used. This
selection method has been used to improve the convergence rate and accuracy in face re-
identification tasks [55]. A mini-batch is constructed by sampling of N responses from each
of the M identities. Additionally, N samples from the identities that are not part of the M
identities are randomly sampled and added to the mini-batch. The final sample size of the
mini-batch is (M + 1)×N samples. Firstly, the embeddings for all samples in a mini-batch are
computed. Secondly, the Euclidean distance between all anchor and positive pairs, and anchor
and negative pairs are computed. Thirdly, all combinations of (ea, en) and (ea, ep) pairs that
result in zero loss are removed, and the loss is calculated. The algorithm for choosing hard
negatives is given in Algorithm 1.
We have experimented with a selection method which also picked the hardest (ea, ep) pair
in a mini-batch and combining it with the (ea, en)pair, but it did not yield any improvements.
Enforcing another rule for picking the triplet pair limits the number of samples used to calculate
the loss. The limited training samples could explain the lack of improvements for this selection
method.
3.4 Network Architecture
We build a convolutional block with the components discussed in previous sections and stack
them to build our neural network architecture.
3.4.1 Convolution Block
The convolution block is designed by stacking a 1D-convolution, a ReLU, and a layer normal-
ization layer. Figure 3.7 shows the convolution block architecture. The output from each layer
is passed on to the next layer. The input to the block is of size R660×Ci and the output of the
Chapter 3. Methodology 29
Algorithm 1: Online batch hard triplet selection
Input:
TEOAE embedding e
Label Y
Threshold α1
Output:
Triplet (ea, ep, en)
initialize triplet = []for label ∈ set(Y ) do
pos pair = anchor, positive pair combinationsneg pair = anchor, negative pair combinationsdist pos = anchor positive pair distancedist neg = anchor negative pair distancefor pos idx ∈ pos pair do
losses = []for neg idx ∈ neg pair do
losses.append(dist pos[pos idx]− dist neg[neg idx] + α)endindex = argmax(losses)if losses[index] > 0 then
anchor = pos pair[anchor][pos idx]positive = pos pair[positive][pos idx]negative = neg pair[negative][index]triplet.append(anchor, positive, negative)
end
end
endreturn triplet
Chapter 3. Methodology 30
block is R660×Co , where Ci is the number of input channels, and Co are the number of filters
used in the 1D-convolutional layer.
The order of the layers inside the block was chosen based on experiments. The input
is passed through a convolutional layer and a non-linearity. The non-linearity modifies the
output values and causes the mean and variance to shift. Intuitively it makes sense to apply
normalization as the last step before passing it to the next layer, but based on experiments,
the order of a non-linearity and a normalization layer does not seem to matter.
3.4.2 Embedding Block
The embedding block is designed to generate an embedding of size RDembed from an input
R660×Cin . A max pooling layer is first used to reduce the number of features to half, and a
1D-convolution layer is used to reduce the number of channels by half. Reducing the number
of neurons in the layer before the fully connected layer reduces the size of a model because
a fully connected layer has weights connecting every input to every output. The output of a
1D-covolution layer is then passed through a ReLU, a layer normalization layer, and a fully
connected layer to generate an embedding. The Figure 3.7 shows the structure of an embedding
block. The dimension of the embedding block output Dembed is a hyperparameter that requires
tuning.
3.4.3 Parameter Sharing
The TEOAE dataset contains responses from both the left and right ears. Medical studies
show that the responses from both ears are different even when measured within the same
session [29]. However, a structural similarity of the ears suggests that there may be similarities
between responses. These similarities allow neural network architectures to be optimized.
Previous methods were effective because CWT was an excellent feature extractor for TEOAE
responses. The newly designed feature extractor needs to be more powerful than CWT to
increase the performance of the biometric system. For feature extraction using CWT, a CWT
scale and a mother wavelet had to be tuned. Instead of choosing a CWT scale, multiple CWT
scales can be combined to eliminate tuning, but the experiments show that combining multiple
scales reduce the performance due to extra noise in the features. To increase generalization,
Chapter 3. Methodology 31
a TEOAE response feature extractor that does not need to be tuned based on the dataset is
required. In theory, neural network feature extractors should have better performance than
CWT because the neural network is optimized to extract features from the TEOAE responses.
A TEOAE feature extractor can be trained using an encoder-decoder scheme. However, this
scheme is not ideal as it requires multiple training steps. Firstly, an encoder-decoder network
needs to be trained. Secondly, the encoder needs to be separated and placed on top of another
network that can generate TEOAE embeddings. Because a single network cannot effectively
learn the TEOAE structure of both ears, two embedding networks need to be trained from
the encoder output. In total, this scheme would require training three different networks. To
reduce the number of training steps, we design our architecture so it can be trained in one step.
Following the multi-task learning strategies discussed in the literature review, the proposed
network architecture has two sections: the common feature section, and the individual feature
section. The common feature section is a TEOAE feature extractor, and it is common for both
ears and shares the weights. This section learns features that are common to both ears and
replaces the CWT feature extractor that was used in previous methods. The Individual feature
section has separate sets of weights and is designed to learn features unique to each ear.
The common feature section reduces the number of parameters in a neural network by
sharing the parameters. It allows a network to be trained using data from both ears. Training
with more data increases the accuracy and generalization as shown by Sun et al. [61]. In total,
the number of parameters is reduced from 2 ∗Ncommon + 2 ∗Nind to Ncommon + 2 ∗Nind, where
Ncommon is the number of trainable parameters in the common feature section, and Nind is the
number of trainable parameters in the individual ear section.
3.4.4 Full Architecture
The diagram for the architecture is given in Figure 3.6. The arrows show the direction of the
output. A normalized TEOAE response is used as an input to the network. The first three
blocks are part of the common feature section and the two blocks after the division are part of
the individual feature section.
After the common feature section computes the input, the network has two pathways to
compute the embeddings. The pathways are chosen depending on which ear a TEOAE response
Chapter 3. Methodology 32
is collected. This double pathway architecture forces the network to learn the separate distri-
butions for each ear. The two convolutional blocks that come after the split are part of the
individual feature section. After the individual feature section, the embedding block is used to
compute the final embedding. The output of the convolution layer in the embedding block is
of dimension R330×Co . This output is then transformed into a one-dimensional vector of size
330× Co and passed to a fully connected layer.
3.4.5 Fusion Architecture
Diagram for the fusion architecture is shown in Figure 3.14. Fusion architecture combines the
left and right ear pathway outputs from the neural network and concatenates them to create
a one-dimensional vector. The embedding block produces a vector of size RDembed and the
concatenated output is of size R2∗Dembed . We choose this architecture because using one neural
network to train for both ears is adverse to the authentication performance. The concatenated
one-dimensional vector is reduced to the final output with dimension RDembed using a fully
connected layer.
Chapter 3. Methodology 33
Figure 3.14: Proposed Network architecture for fusion of both ears. The number beside theconv block denote the dimension of the convolutional filter, K is the kernel size, and S is thestride of the max pooling layer.
Chapter 3. Methodology 34
3.5 Training
This section discusses the training process of our neural network model. We discuss the charac-
teristics of a GPU, the hyperparameters of the model, the neural network weight initialization
technique, and the mini-batch sampling process.
3.5.1 Graphics Processing Unit
Training neural networks require a GPU to perform the computation efficiently. The explosion
of deep neural networks is due to the explosion in computing power brought by GPUs. A single
computation core on a GPU is slower than a CPU, but the number of cores on a GPU is much
larger. The increased number of cores allow parallelization which speeds up neural network
training.
The biggest bottleneck for using a GPU is the data transfer time between a CPU and a GPU.
To reduce this transfer time GPUs have separate onboard memory. The amount of memory
on a GPU is far smaller than the amount on a CPU. A typical GPU would have between 8
Gigabyte (GB) to 32GB of internal memory whereas CPU memory would be in the hundreds
of GB. The GPU memory and its cores are connected with a High Bandwidth Memory (HBM)
Bus to achieve maximum throughput. The data connection between a CPU and a GPU uses
a Peripheral Component Interconnect Express (PCIe) bus. The current generation PCIe bus
is about 30 times slower than the current generation HBM2 memory bus. Tuning the data
transfer between CPU memory and GPU memory is essential for maximum performance.
When the computational complexity is low, transferring data in small amounts can starve
a GPU from data and reduce its throughput. The main hyperparameter that controls the data
transfer between the GPU and CPU is the mini-batch size. Transferring larger data can be
more efficient than transferring smaller data due to the reduced overhead required to facilitate
the data transfer. Multiple mini-batches can be combined into a bigger batch and transferred
to a GPU to increase performance.
When training neural networks, GPU memory is predominantly used by data, models, and
gradients. Gradient information for every operation of a neural network is saved and used by
the backpropagation algorithm to update the weights. The memory impact increases when the
Chapter 3. Methodology 35
size of a mini-batch increase due to more data being stored on a GPU. Large mini-batch size
increases the amount of computation, and as a by-product increases the amount of gradient
information saved on a GPU. The size of a model also similarly increases memory consumption.
Firstly, models need to be saved on the GPU. Secondly, bigger models require more operations
which produces more gradient information.
3.5.2 Hyperparameter Explanation
There are numerous hyperparameters in our neural network as discussed in the above sections.
The dataset hyperparameters and the network hyperparameters have to be tuned to achieve
generalization and authentication performance.
Dataset Hyperparameters
The size of a mini-batch is an important dataset hyperparameter as it determines the number of
triplet pairs a network will process at once. Values for M identities and N samples are chosen
to optimize training time, and authentication performance. The M and N have to be large
enough that sufficient hard triplet pairs can be selected, but also small enough to fit into GPU
memory. When choosing these values, the amount of memory available on a GPU determines
the upper bound, and the quality of hard triplet pairs in a mini-batch determines the lower
bound.
Convolution Block Hyperparameters
Majority of the hyperparameters in the convolution block are used to tune the performance of
the 1D-convolution layer. The kernel size, stride, padding, and the number of output channels
are the hyperparameters to be optimized. These values are chosen to reduce the effort required
to stack multiple blocks. As discussed above, deep neural networks gain performance by stacking
more blocks. Keeping the input channels and output channels constant makes stacking easier.
The padding is calculated by:
P =(K − 1)
2(3.10)
Chapter 3. Methodology 36
where P is the amount of padding, and K is the kernel size. We keep the stride to one so that
the convolution filter slides over the whole input vector. With these constraints, only the kernel
size and the output channel size is left to be tuned. These constraints are common in the latest
deep neural network architectures such as VGG network [58], and ResNet [24]. As discussed
in the literature review, random hyperparameter search is used to find the kernel size and the
output channel size.
Embedding Block Hyperparameters
The embedding block is tuned similarly to the convolution block but has the embedding size
as an additional hyperparameter. The neural network embeds a TEOAE response vector of
size 660 to a smaller dimensional space. The embedding space needs to be sufficiently big
enough to separate the classes, but small enough to reduce the embedding space to allow for
efficient computation. It is the most crucial hyperparameter for a neural network because it
impacts the intra-class separability in the embedding space [55]. The 1D-convolution layer in
the embedding block is tuned similarly to the convolution block. Kernel size is chosen based
on a random search, but the output channel size is chosen to be half the size of the number of
input channels to reduce the number of trainable parameters.
Training Hyperparameters
The number of blocks for the common feature section and the individual feature section needs
to be tuned to determine the final architecture. Choosing a small number for each section
results in a model that does not learn, and choosing a large number of results in a model that
overfits.
The number of epochs to train the network is another hyperparameter that needs to be
tuned along with the learning rate of the network. Learning too slowly will result in the
network getting stuck in local minima, but a big learning rate might cause the network to
bounce around the loss surface without converging.
Chapter 3. Methodology 37
3.5.3 Hyperparameter Selection
Each hyperparameter configurations was tested at least five times to compute the mean and
the variance of the results. The number of blocks for the common feature section and the
individual feature section, the output channels for the one-dimensional convolutional layer, and
the embedding size of the network were tested with great depth. The parameters were randomly
searched and fine-tuned by hand. Table 3.2 shows the number of trainable parameters for each
block of the network and the hyperparameters chosen for each layer. Table 3.3 shows the chosen
hyperparameters.
Table 3.2: Network Hyperparameters are shown in this table. The size in and size out isshown as features × channels. Kernel for convolution block is shown as kernel size × channels,Maxpooling is shown as kernel size as k and stride as s, and embedding block is described askernel size × channels, embedding size.
LayerName section Size In Size Out Kernel # of Params
Conv Block 1 Common Feature 660 × 1 660 × 64 3 × 64 8,448Conv Block 2 Common Feature 660 × 64 660 × 64 3 × 64 20,544Conv Block 3 Common Feature 660 × 64 660 × 64 3 × 64 20,544Conv Block 4 Common Feature 660 × 64 660 × 64 3 × 64 20,544Conv Block 5 Individual Feature 660 × 64 660 × 32 3 × 32 10,272Conv Block 6 Individual Feature 660 × 32 660 × 32 3 × 32 7,200Conv Block 7 Individual Feature 660 × 32 660 × 32 3 × 32 7,200Max Pooling Individual Feature 660 × 32 329 × 32 k = 3, s = 2 0Embed Block Embedding 329 × 32 128 3 × 16, 128 677,392
Total 1,484,208
Table 3.3: List of Training Hyperparameters and their values
Hyperparameter Values
Optimizer AMSGrad/ADAMLearning Rate 0.001MiniBatch Size 125
MiniBatch (Number of classes) 25MiniBatch (Samples per class) 5
Epochs 30
Chapter 3. Methodology 38
3.5.4 Weight Initialization
The filter weights are randomly initialized as the optimal filter weights are not known. The
training problems that are caused by bad weight initialization tend to be in two categories:
vanishing or exploding gradients [4]. The vanishing gradient problem is caused when a network
does not have enough gradient information to update its weights. It happens when the weights
initialized are close to zero. The exploding gradients problem is caused when the gradients are
too big to make any meaningful changes to a network. This problem occurs when the initial
weights of a network are large. Weights are randomly sampled from a normal distribution with
zero mean and a variance calculated by:
σ =√
2 ∗Nparam (3.11)
where Nparam is the number of trainable parameters in a layer. The initialization step stabilizes
the network training and allows a network to generalize across different random seeds, and
different dataset splits. This initialization scheme has been used by Schroff et al. to initialize
a neural network [24].
3.5.5 Mini-batch Sampling
In this section, we discuss the implementation details for selecting the mini-batches for single
ear and fusion of both ears scenario.
Single Ear
For a single ear scenario, the training dataset is first divided into the left, and right dataset.
The mini-batches are constructed for each ear using the separate dataset. For every mini-batch,
5 TEOAE responses from 25 individuals are randomly sampled. 5 additional responses are
randomly sampled from the individuals that are not part of the previously chosen 25 individuals
to increase variability in the mini-batch. Since the mini-batches for the left and right are
separately constructed with random sampling, there is no guarantee that all identities in the
left mini-batches would also be in the right mini-batches.
As was mentioned in Chapter 3.1, the number of samples for every identity is different due
Chapter 3. Methodology 39
to the collection process. The training dataset is unbalanced with each class having a different
number of responses. Two schemes for constructing mini-batches were tested. The first scheme
samples the mini-batch without replacement. We disregard the class size difference and pick
every sample in a class until there are no samples left. The next scheme involves sampling every
identity with replacement until the number of samples matches that of the identity with the
most number of samples. The results of the two schemes were similar, but the first scheme had
lower computational overhead than the second scheme. Therefore, the first scheme was chosen.
The total number of samples between the left and right dataset were also different. The
difference in the number of samples resulted in a small difference between the number of mini-
batches for the left and right. The performance impact for the difference in mini-batches is
minimal, as all samples would be used in the training process with a sufficient number of
iterations and random permutations.
Fusion of Both Ears
Fusion of both ears requires a different method for generating mini-batches because a single
training sample requires both the left and right ear TEOAE response. Firstly, the training
datasets for the left and right ears were combined into a joint dataset. Secondly, 5 responses from
25 individuals were randomly sampled from the joint dataset. The sampled dataset contains
a mix of the left, and right TEOAE responses. Lastly, for the responses that were sampled,
a response from the opposite ear of the same individual was randomly sampled to create a
mini-batch with responses from both ears.
The test dataset was constructed by concatenating the last ten responses from the second
session for both ears. Since the left and right ear responses were collected at different times,
there is no time overlap between the two ears. We test the fusion scenario by combining the
steadiest ten samples from both ears.
3.5.6 Training Procedure
The network is trained in two stages. Firstly, the loss is computed using left channel mini-
batches. Secondly, the right channel loss is computed using the right channel mini-batches. The
losses from these two channels are summed and backpropagated to update the weights of the
Chapter 3. Methodology 40
network. Reversing the training process does not make a difference because the backpropagation
happens at the same time for both channels.
We also tested training the left channel for a full iteration and training the right channel for
a full iteration after. The network went through what is known as catastrophic forgetting [50].
After the left iteration was complete, the common feature section forgot what it learned for the
right channel. The solution to this problem is still being investigated [27].
In this chapter, we discussed the dataset and the neural network implementation details. In
the next chapter, we present the various templating methods, comparison metrics, and biometric
security system architecture.
Chapter 4
Templates and Comparison Metrics
In this section, we discuss different methods to create identity templates, and the distance met-
rics used to accept or reject an identity. We also discuss how the verification and identification
system is implemented.
4.1 Identity Templates
Identity templates are the user information that is created to register a user on a biometric
security system. The identity templates are used to determine the identity of a probe-sample.
A probe-sample is a sample that is presented to the biometric system for authentication. The
distance or the probability is measured between a probe-sample and a template to decide to
accept or reject the user from the system. In this section, we discuss the pros and cons of two
templating methods: the Mean template and the SVM template.
4.1.1 Mean Template
The Mean template is the easiest templating method to implement, and it is created by averaging
TEOAE embeddings generated by fw as discussed in 3.3.6. By calculating the average, we are
locating the centroid of the embeddings. The mean template is generated by:
T =1
N
(N∑i=1
ei
)(4.1)
41
Chapter 4. Templates and Comparison Metrics 42
where N is the number of samples used for the enrollment session. This templating method
only requires data from one identity. No information from other identities or extra training is
required.
4.1.2 SVM Template
Support vector machine (SVM) [15] is a machine learning technique that is commonly used for
classification or regression. The goal of the method is to find a linear boundary that maximizes
the distance between the binary class multi-dimensional data. The separating hyperplane is
described as wx + b where x is the input, b is the bias, and w is the learned weights. When
wx+ b ≥ 0 the sample is labeled as a +1 class, and when wx+ b < 0 the model labels it a −1
class. The hyperplane is found by solving the following constrained optimization problem:
minimizew,b,ξ
‖w‖2 + C
N∑i=1
ξi
subject to yi(w · xi + b) ≥ 1− ξi,
ξi ≥ 0
where ξ is the slack variable to relax the constraint because the data may not be completely
separable by the hyperplane, xi is the training sample, yi is the training label, N is the number
of samples, ei is the embedding generated by fw for response i, and parameter C controls
whether to maximize margin or minimize training loss.
Since SVM separates the classes using a hyperplane, it is not useful when dealing with data
that is not linearly separable. A Specialized SVM kernel can be used to project the data to a
higher dimensional space. When the data is in a higher dimensional space, it may be possible
to separate the data with a hyperplane. SVMs are used for binary classification but can be
extended to multi-class classification by training multiple one vs. one, or one vs. all classifiers.
The SVM template is created by training one vs. all SVM classifier. Linear SVM classifiers
are known to improve verification performance in biometric systems [16,30,47,48]. We train an
SVM classifier for every identity against all other identities. When registering a new individual,
data for all individuals previously registered in the system is required for training. As the
number of registered identities grows, so does the amount of data required to train a template.
Chapter 4. Templates and Comparison Metrics 43
4.1.3 Comparison
The SVM template has the potential to produce better authentication results compared to the
Mean template because all the identities are known to the system. The SVM template can
increase the intra-class distance of registered individuals. However, as the number of registered
individuals grow, the authentication performance of SVM template might suffer because the
data points may not be linearly separable.
The main problem for SVM template is that it needs to be retrained when a new identity
is added. When the number of identities registered in the system is low, SVM template can
be useful. Retraining templates are quick, and an SVM model may be able to separate data
efficiently. As the number of identities grows, problems such as dataset size, and training time
becomes a significant problem. The Mean template does not require retraining every time a
new identity is registered.
A SVM template cannot be trained with only one identity because SVM models are designed
to separate a binary class with a maximum distance. Without the negative identities, the
boundary cannot exist. A predefined negative dataset needs to be used to train a model to
develop a single user verification system. On the other hand, mean template can set a distance
threshold to reject probe-samples that are far from the registered template.
The Mean template method depends on the embedding function fw to generate an embed-
ding that is separable by Euclidean distance. The database does not need to save previous
registration data because training a mean template does not require data points from other
identities. An embedding function that accounts for various intra-class differences is a prereq-
uisite for mean template method. Training a better embedding function requires more data,
and when the training dataset is small, the mean template method might not be an option.
4.2 Comparison Metrics
This section describes the distance functions used to calculate the similarities between a tem-
plate and a probe-sample. These functions are used to calculate the distance for CWT/LDA
method and the mean template method. The closer a probe-sample and a template are, the
higher the probability that a probe-sample is from the same identity as the template.
Chapter 4. Templates and Comparison Metrics 44
4.2.1 Eucledian Distance
Euclidean distance is the most common distance metric, and it is used by the triplet loss to
optimize the embedding space. It calculates the straight line distance between two points. The
distance is calculated by:
d(X,Y ) =
√√√√ n∑i=1
(xi − yi)2 (4.2)
where X and Y are the vectors being compared and n is the dimension of the comparison
vectors.
4.2.2 Cosine Distance
Cosine similarity is commonly used when comparing similarities between embeddings. It com-
putes the angle between the two vectors. When the orientation of the two vectors are the same,
the resulting value is 1, and 0 when they are orthogonal. We calculate the cosine similarity by:
similarity = cos−1
(X · Y
||X|| · ||Y ||
)(4.3)
where X and Y are the two vectors that are being compared. We then change the similarity
into a distance by computing
d(X,Y ) = 1− similarity (4.4)
4.2.3 Pearson Distance
Pearson correlation distance measures the linear correlation between the two vectors. The
output of the function is 0 when the two vectors are not correlated, −1 when the vectors
are negatively correlated, and 1 when the vectors are positively correlated. The correlation
coefficient is calculated by:
ρx,y =cov(X,Y )
σxσy(4.5)
where X and Y are the two vectors that are being compared, and σx and σy are the standard
deviation for vectors X and Y . The pearson distance is calculated by:
d = 1− ρx,y (4.6)
Chapter 4. Templates and Comparison Metrics 45
where ρx,y is the pearson correlation.
4.3 System Architecture
There are three main modes to a biometric security system: enrollment, identification, and
verification. In this section, we discuss the three modes.
4.3.1 Enrollment
The enrollment mode registers an individual into the system. The goal of the system is to
recognize these individuals who are in the system and reject imposters. The enrollment process
requires the TEOAE responses, and the identity of the user registering on to the system.
Multiple responses are collected during registration to create an accurate template.
Figure 4.1 shows the enrollment process for both the SVM template and the Mean template.
For both methods, responses are first acquired using the sensor and pre-processed by normaliz-
ing them. The embedding e is computed from the normalized response using the neural network
which acts as the embedding function fw.
The SVM templates uses both the embeddings from the new responses and the embeddings
of the identities stored in the enrollment database. The trained template is saved in the template
database with the identity of the individual as the primary key. The new responses used during
registration are saved into the enrollment database. For the Mean template, the average of the
embeddings is computed and then saved to the Template Database.
4.3.2 Verification
The verification mode is a one-to-one matching mode. An individual first claim to be of a
certain identity, and secondly, the system checks the presented probe-sample against a template
saved in the system. This mode is commonly used on smartphones. When someone presents
a fingerprint on a smartphone, the system assumes that the owner of the phone is attempting
to gain access. The fingerprint security system checks the probe-sample against the fingerprint
template of the owner and accepts or rejects a probe-sample based on a decision threshold. The
decision threshold can be set based on each identity or could be set based on the overall system.
Chapter 4. Templates and Comparison Metrics 46
Determining the threshold requires a careful inspection of EER graph and Receiver Operating
Characteristic (ROC). The explanation for the two metrics is provided in section 5.2.
Figure 4.2 shows the operation of a verification system. First, a claimed identity and
TEOAE responses are collected from an individual. The identity template is retrieved from the
database, and the embedding generated using the response is compared against the template.
The system decides to accept or reject the identity by using the comparison metrics discussed
in sections above or using the SVM class probability.
4.3.3 Identification
The identification mode is a classification mode. The system is given a probe-sample with an
unknown identity, and the system has to find the identity if the identity is registered in the
system. The system also has to reject a probe-sample if it is not registered. Testing a system
using a dataset with only the individuals known to the system is called a closed-set problem.
Testing a system using a dataset containing individuals that are unknown to the system is called
an open-set problem. Biometric identification is commonly seen in crime scene investigation,
where the investigators match a fingerprint found in a crime scene to one registered in the
database. The system checks a probe-sample against every registered template to find the best
match. Identification mode is more time-consuming than the verification mode because it has
to make N comparisons, where N is the number of templates registered in the system.
Figure 4.3 shows the identification system. A TEOAE response with an unknown identity is
presented to the system. The system pre-processes the response and computes an embedding.
The template that is the closest to the embedding is chosen as the identity for the given response.
Chapter 4. Templates and Comparison Metrics 47
Figure 4.1: System diagram for enrolling a new user. SVM Template Enrollment is shown onthe top and Enrollment using Mean Template is shown on the bottom.
Chapter 4. Templates and Comparison Metrics 48
Figure 4.2: System diagram for Verification of a prob-sample is shown. Verifying using theSVM template is shown at the top, and Mean template is shown at the bottom
Figure 4.3: System diagram for identifying a probe-sample is shown. Identification usingSVM template is shown at the top, and Mean Template is shown at the bottom.
Chapter 5
Experiments
In this section, we discuss the tests used to compare different methods and present the results
for identification and verification for single ear, and for fusion of both ears.
5.1 Experimental Setting
There are a total of 54 subjects and two sessions for each subject in the TEOAE dataset. The
dataset is divided into three parts: a training set, a template set, and a test set. A training
set is used to train the neural network, and it is not used in CWT based system because the
CWT feature extractor does not require training. A template set is used to build templates for
each subject, and a test set is used as probe samples. We test the system for verification and
identification scenarios. Within each scenario, three different training, testing, and template
sets were used to test different aspects of our neural network model. These tests are the 54
subject test, the 24 subject test, and the 10 subject test. This section describes each test.
5.1.1 54 subject test
Liu and Hatzinakos [36] used the 54 subject test to evaluate their biometric authentication
system. The training set was constructed using first session responses from all subjects after
removing the last ten responses from each subject. The last ten responses were removed from
each subject to ensure that the responses used in the training set are not used in the template
set. The template set was created using the last ten responses from each subject in the first
49
Chapter 5. Experiments 50
session. The last ten responses in the second session were used as the test set. This test includes
the same subjects in both the training set and the test set. We use this test to see how well
our model can separate the classes that the model has seen before.
5.1.2 24 subject test
The 24 subject test has been used by Liu et. al [37], and Majid et al. [30]. This test divides the
subjects rather than the sessions into a training set and the test set. The training set includes
30 subjects, and the test set includes the other 24 subjects. The training set includes responses
for both sessions from the 30 subjects. The template set and the test set are created using the
24 subjects. The template set includes the last ten responses from the first session, and the
test set includes the last ten responses from the second session. The number of samples for this
test and the 54 subject test are roughly equal, as shown in Table 5.1. We use the 24 subject
test test to see how the methods perform for the subjects not in the training set.
5.1.3 10 subject test
The 10 subject test is similar to the 24 subject test but with an increased number of subjects
in the training set. The training set includes responses from both sessions for 44 subjects. The
template set and the test set includes the last ten responses from both sessions for 10 subjects.
The last ten responses from the first session are used as the template set, and responses in the
second session are used as test set. This test has more training data and subjects compared to
the 24 subject test. By comparing the results of the 24 subject test and this test, the performance
difference of each method when it is trained using more data can be evaluated. We also compare
the results between CWT and mean template method for this test to determine which method
is better at extracting TEOAE features for authentication.
5.1.4 Dataset Generalization
The 24 subject test and the 10 subject test can have different training and testing data depending
on how the subjects are split. One train and test split cannot test the variability of our neural
network model performance. Our neural network is tested across multiple dataset splits to
capture the variability in a dataset.
Chapter 5. Experiments 51
Table 5.1: Number of responses in the training data set averaged across 20 different datasplits. The training set of the 24 subject test includes responses from 30 subjects, and thetraining set of the 10 subject test includes responses from 40 subjects. The data splits and theamount of responses in the training set for each data split is shown in Table C.2 and Table C.1
Test Dataset Size
54 Subject test 25,60224 Subject test 27,528.710 Subject test 40,793.7
We test the methods using 20 different dataset splits. Firstly, 20 integers between one
and one hundred were picked to be used as random seeds for initializing the neural network.
Secondly, the dataset is split into a test set and training set based on random values generated
by the random number generated with a given random seed. These splits are used to test all
other methods. The dataset split with random seeds are shown in Table C.2 for the 24 subject
test, and Table C.1 for the 10 subject test. The dataset for the 54 subject test is only tested
using one split because the dataset does not vary.
5.1.5 Neural Network Generalization
The weights of the neural network are randomly sampled as described in the sections above.
The stochastic nature of neural networks needs to be tested to prove the generalization. We test
neural network generalization by initializing the neural network using multiple random seeds.
The CWT and CWT/LDA method are not stochastic, so we only test it once for 54 subject
test.
5.1.6 Tested Methods
We test our method against three other methods. Liu and Hatzinakos [36] developed two
methods: CWT and CWT/LDA. In verification mode, both CWT and CWT/LDA methods
use CWT to generate TEOAE features from the responses. For CWT/LDA method, an LDA
model is trained on top of the CWT features. Both methods calculate the Pearson correlation
distance to accept or reject an identity. In the identification method, instead of the Pearson
correlation distance, a multinomial logistic regression model trained to identify individuals. The
logistic regression model outputs the probabilities of each identity registered in the system, and
Chapter 5. Experiments 52
the identity with the highest probability is chosen as the output of the model.
The third method is called the No-FS method. We use this method as a baseline for
verification scenarios. This method combines features that have shown to work for time signal
based biometric modality and trains a Linear SVM model in one v.s. all configuration. The
features used in this method are CWT, Short-Time Fourier Transform (STFT), autocorrelation,
maximum standard deviation, kurtosis, skewness, and cepstrum features. These features have
been recommended by [1, 2, 11, 30, 42, 44]. Work was done by Majid et al. [30] has used this
method as a baseline. The size of the feature vector is 7522.
5.2 Metrics
In this section, we discuss the performance metrics used to evaluate the verification, and iden-
tification mode.
5.2.1 Verification
We use EER as a metric for the verification mode. EER is the rate at which False Acceptance
Rate (FAR) and False Rejection Rate (FRR) are equal. The FAR is calculated by:
FAR =False Acceptance
False Acceptance + True Rejection(5.1)
where False Acceptance is the number of negative samples that were classified as positives, True
Rejection is the number of negative samples that were classified as negative. Therefore, FAR is
the ratio between the falsely classified negative samples and the total negative samples. FRR
is calculated by:
FRR =False Rejection
False Rejection + True Acceptance(5.2)
where False Rejection is a positive sample that was classified as a negative, and True Acceptance
is the number of positives samples classified as positive. FRR is the number of falsely rejected
positive samples out of all the positive samples.
The system accepts or rejects a sample based on a probability threshold. We set this
threshold so that samples with probability above this threshold are accepted, and the samples
Chapter 5. Experiments 53
below it are rejected. When the threshold is at zero, we accept all samples. FAR would be
at 100% because we accepted all negative samples, and the FRR is 0% because no negative
samples were rejected. As we increase the threshold, the FAR decreases while FRR increases.
FAR approaches 0% as the threshold is increased because we do not accept any samples, and
the FRR approaches 100% because all positive samples are rejected. The value of the FRR
and FAR meet at some threshold, and the values of FAR and the FRR is the EER. We show
the EER graphs for the experiments in Figure 5.1. The system with a lower EER is considered
better.
The EER test set contains P ×N responses from the second session, where P is the number
of identities, and N is the number of responses per identity. We calculate the distance of P
identities against (P − 1)×N negative responses, and N positive responses. We combine these
P × P ×N distances and calculate the EER.
5.2.2 Identification
For identification, accuracy is the most commonly used metric to compare authentication per-
formances [45]. We present the Cumulative Match Characteristic (CMC) curve to evaluate the
methods based on a multi-rank prediction. The accuracy test set contains P × N responses.
The template with the smallest distance to the probe response is considered to be the identity of
the response. We then calculate accuracy based on how many of these identities were correctly
identified. The system with higher accuracy is considered better.
5.3 Single Ear Results
We present the results for verification mode and identification mode for the single ear.
5.3.1 Verification
We present our results for verification scenarios in Table 5.2-5.4. For CWT/LDA [36] and
CWT [36] methods, we chose two CWT scales based on their performance: the CWT scale
that achieves minimum left EER, and the CWT scale that achieves minimum right EER.
When a single scale achieves minimum accuracy for both left and right ear, only one scale is
Chapter 5. Experiments 54
presented. A combination of CWT scales from one to ten(Multi-Scale) is presented for CWT
and CWT/LDA methods. The No-FS method is presented as a baseline. A subset of the
graphs used for the EER calculation is shown in Figure 5.1.
The results show that our method outperforms previous methods in some tests, especially
for the left ear. Table 5.2 presents the results for the 54 subject test. For the left ear, the Mean
template method performs better than previous methods with EER of 4.28%. For the right
ear, the CWT/LDA method performed the best. Results for the 24 subject test are presented
in Table 5.3. The CWT/LDA(multi-scale) outperformed all other methods for both the left
ear and the right ear. Finally, Table 5.4 shows the results for the 10 subject test. Our SVM
template method outperformed other methods for the left and right ears. We can see that in a
test with a smaller number of subjects, the SVM template outperforms the Mean template but
in a larger number of registered subjects, the trend reverses.
In the previous studies [30,36], it was shown that EER of the left ear was much lower than
the right ear. Our neural network based method shows smaller performance differences between
the left and the right. The smaller difference is due to the optimization criterion for our neural
network model. When our model is training, we stop the process when the sum of the left and
right ear training loss reaches a minimum. Since we optimize the model for both ears rather
than individually, the performance difference between the two is lower.
The results between the mean template and CWT methods for the 24 subject test and the 10
subject test show that neural network based TEOAE feature extractor performs better than the
generic CWT feature extractor. Another limitation of CWT as a feature extractor is shown in
the results; the scale which produces the best result changes for every test. Combining multiple
CWT scales shows lower performance compared to an optimized scale.
Chapter 5. Experiments 55
Table 5.2: Verification performance of different methods for 54 subject test
MethodLeft EER
(mean ± std)Right EER
(mean ± std)
Ours(SVM template)
5.63% ± 1.21% 2.96% ± 1.03%
Ours(Mean template)
4.28% ± 1.19% 2.38% ± 0.796%
CWT/LDA(Multi-scale) [36]
7.47% 3.70%
CWT/LDA(scale=7) [36]
4.44% 1.85%
CWT(Multi-scale) [36]
9.92% 8.89%
CWT(scale=8) [36]
9.26% 9.44%
CWT(scale=9) [36]
10.30% 6.85%
No-FS 38.3%± 1.04% 41.7% ± 0.78%
Table 5.3: Verification performance of different methods for 24 subject test
MethodLeft EER
(mean ± std)Right EER
(mean ± std)
Ours(SVM template)
5.60% ± 3.04% 4.16% ± 2.16%
Ours(Mean template)
7.44% ± 3.28% 4.82% ± 1.96%
CWT/LDA(Multi-scale) [36]
4.20% ± 2.95% 2.92% ± 1.84%
CWT/LDA(scale=6) [36]
4.89% ± 2.47% 4.42% ± 2.49%
CWT(Multi-scale) [36]
9.47% ± 3.26% 8.57% ± 2.32%
CWT(scale=8) [36]
8.70% ± 2.74% 8.84% ± 2.48%
No-FS 38.7%± 2.91% 41.2% ± 3.22%
Chapter 5. Experiments 56
Table 5.4: Verification performance of different methods for 10 subject test
MethodLeft EER
(mean ± std)Right EER
(mean ± std)
Ours(SVM template)
2.89% ± 4.87% 3.63% ± 4.41%
Ours(Mean template)
5.85% ± 5.38% 4.87% ± 3.84%
CWT/LDA(Multi-scale) [36]
8.81% ± 6.49% 4.61% ± 4.38%
CWT/LDA(scale=8) [36]
6.08% ± 4.74% 4.37% ± 3.97%
CWT(Multi-scale) [36]
10.5% ± 8.03% 10.8% ± 7.50%
CWT(scale=8) [36]
8.72% ± 5.55% 8.89% ± 5.44%
CWT(scale=9) [36]
10.5% ± 8.03% 7.88% ± 5.44%
No-FS 39.16%± 5.21% 39.6% ±6.38%
Chapter 5. Experiments 57
(a) Left Ear 54 subject test (b) Right Ear 54 subject test
(c) Left Ear 24 subject test (d) Right Ear 24 subject test
(e) Left Ear 10 subject test (f) Right Ear 10 subject test
Figure 5.1: Verfication scenario EER graphs for subject 0
Chapter 5. Experiments 58
5.3.2 Identification
We show our experimental results for all identification tests in Table 5.5 - 5.7. Similar to
the verification results, we present two CWT scales for CWT/LDA [36] and CWT [36] that
achieve the highest left ear, and right ear accuracy. Only one scale is shown if the same CWT
scale achieved the highest accuracy for both ears. CWT/LDA(multi-scale) is also added for
comparison.
Our proposed methods outperform the CWT/LDA and the CWT method in all identifica-
tion scenarios. For the left ear the accuracy was 91.4% for the 54 subject test, 89.7% for the 24
subject test, and 94.7% for the 10 subject test. For the right ear the accuracy was 96.3% for the
54 subject test, 93.2% for the 24 subject test, and 96.6% for the 10 subject test. The CWT/LDA
method had an accuracy in the low to mid 80s. The CMC curve is presented in Figure 5.2.
Our methods achieve better results than the CWT/LDA method for rank 1 prediction. The
CWT/LDA method has low accuracy for rank 1 but it shows similar performance for rank 3
accuracy. The CWT method does not perform well.
We compare the results between the mean template and CWT results for the 24 subject test
and the 10 subject test. We can draw the same conclusion as the verification results regarding
the performance of the neural network as a TEOAE feature extractor; the feature extractor
trained using a neural network is better than CWT. We also observe that the scale that produces
the best result also changes for every test. Our neural network method generalizes well across
different training set sizes and shows smaller variance than previous methods. The results
suggest that our method is more robust against dataset splits than the CWT based feature
extraction method.
Chapter 5. Experiments 59
Table 5.5: Identification performance of different methods for 54 subject test
Method Left Accuracy Right Accuracy(mean ± std) (mean ± std)
Ours (SVM) 91.4% ± 2.22% 96.3% ± 1.36%Ours (Mean) 89.8% ± 2.81% 94.5% ± 1.61%
CWT/LDA(Multi-Scale) [36] 78.5% 80.7%CWT/LDA(Scale=6) [36] 85.2% 81.5%CWT/LDA(Scale=10) [36] 83.3% 88.5%
CWT(Multi-Scale) [36] 58.3% 53.0%CWT(Scale=10) [36] 58.3% 59.6%
Table 5.6: Identification performance of different methods for 24 subject test
Method Left Accuracy Right Accuracy(mean ± std) (mean ± std)
Ours (SVM) 89.7% ± 6.84% 92.3% ± 3.80%Ours (Mean) 87.6% ± 7.99% 93.2% ± 4.51%
CWT/LDA(Multi-Scale) [36] 86.6% ± 5.84% 92.0% ± 5.19%CWT/LDA(Scale=6) [36] 84.3% ± 6.67% 84.3% ± 7.07%CWT/LDA(Scale=7) [36] 83.5% ± 5.64% 85.6% ± 6.82%
CWT(Multi-scale) [36] 83.5% ± 5.64% 85.6% ± 6.82%CWT(Scale=10) [36] 68.8% ± 5.69% 75.4% ± 5.24%
Table 5.7: Identification performance of different methods for 10 subject test
Method Left Accuracy Right Accuracy(mean ± std) (mean ± std)
Ours (SVM) 93.8% ± 11.4% 94.7% ± 10.1%Ours (Mean) 94.7% ± 10.1% 96.6%± 4.44%
CWT/LDA(Multi-Scale) [36] 83.9% ± 11.7% 88.2% ± 8.55%CWT/LDA(Scale=9) [36] 84.6% ± 11.5% 89.9% ± 8.14%
CWT(Multi-scale) [36] 78.25% ± 14.9% 82.7% ± 12.2%CWT(Scale=10) [36] 51.8% ± 17.0% 60.3% ± 14.8%
Chapter 5. Experiments 60
(a) Left Ear 54 subject test (b) Right Ear 54 subject test
(c) Left Ear 24 subject test (d) Right Ear 24 subject test
(e) Left Ear 10 subject test (f) Right Ear 10 subject test
Figure 5.2: CMC curve for single ear identification scenario
Chapter 5. Experiments 61
5.4 Both ears
We present the results for verification and identification using both the left and right ear TEOAE
responses. We test our method against CWT/LDA(mul-score) and CWT (Mul-score) method
proposed by Liu and Hatzinakos [36]. For both methods, the CWT scales proposed by Liu and
Hatzinakos were used.
5.4.1 Verification
We present our verification result in Table 5.8. Both our methods performed better than
CWT/LDA method in the 54 subject test and the 10 subject test. The Mean template method
achieved an EER of 0.187% for the 54 subject test and the SVM template method achieved
an EER of 1.18% for the 10 subject test. The CWT/LDA method outperformed our method
in the 24 subject test.
The performance of the fusion ear scenario increases with the number of subjects and the
number of data points used for training similar to single ear results. The increase in subjects
seems to be a more significant factor in improving the final performance. The relationship
between the number of subjects and performance is also shown in the CWT/LDA and the
CWT methods proposed by Liu and Hatzinakos [36]
The same trend can be seen for SVM template and the Mean template methods as discussed
in the single ear verification result. The SVM template outperform the Mean template method
in small subject tests but are worse in the more substantial number of subjects.
5.4.2 Identification
We present our identification result in Table 5.9. Our method performs better for 54 subject
test and 24 subject test. We achieve 99.3% for 54 subject test and 94.9% on our 24 subject
test. We achieve higher accuracy with lower variance. For the 10 subject test, our method has
considerable lower variance, showing that it is more robust to changes in TEOAE dataset and
distributions. The CMC graph is shown in Figure 5.3.
Chapter 5. Experiments 62
Table 5.8: Verification performance of different methods for fusion of ear scenario
Method SubjectsEER
(mean ± std)
Ours (SVM)
54
0.414% ± 0.564%
Ours (Mean) 0.187% ± 0.146%
CWT/LDA [36] 0.604%
CWT [36] 7.22%
Ours (SVM)
24
2.64% ± 1.67%
Ours (Mean) 3.99% ± 1.71%
CWT/LDA [36] 1.27% ± 1.07%
CWT [36] 6.11% ± 3.18%
Ours (SVM)
10
1.18% ± 2.50%
Ours (Mean) 3.71% ± 3.91%
CWT/LDA [36] 2.15% ± 2.15%
CWT [36] 6.42% ± 6.09%
Table 5.9: Identification performance of different methods for fusion of ear scenario
Method SubjectsAccuracy
(mean ± std)
Ours (SVM)
54
99.3% ± 1.04%
Ours (Mean) 98.6% ± 1.07%
CWT/LDA [36] 92.8%
CWT [36] 63.1%
Ours (SVM)
24
94.6% ± 3.34%
Ours (Mean) 95.0% ± 4.40%
CWT/LDA [36] 91.1% ± 4.88%
CWT [36] 69.4% ± 4.88%
Ours (SVM)
10
95.4% ± 6.81%
Ours (Mean) 95.2% ± 5.97%
CWT/LDA [36] 97.7% ± 10.2%
CWT [36] 78.2% ± 10.3%
Chapter 5. Experiments 63
(a) 54 subject test
(b) 24 subject test
(e)10 subject test
Figure 5.3: CMC curve for fusion ear identification scenario
Chapter 5. Experiments 64
5.5 Training Time Comparison
We compare the training time between our neural network method and CWT/LDA method.
For the neural network method, the time taken to train the model for 30 epochs was measured.
When training the model, the lowest training error happens before the 30th epoch, but the
specific epoch is not known before training. In most experiments, 30 epochs were enough for
the model to reach the lowest error. The computation was done on an IBM Power8 CPU with
an NVIDIA P100 GPU machine. The training time for each test and the training data size are
shown in Table 5.10
For CWT/LDA method, only the time to train an LDA model was measured. The train-
ing time for logistic regression used in identification scenario was negligible. The CWT/LDA
method has multiple retraining steps because a new model has to be trained every time a new
subject is registered to the system. Registering N subjects would require N − 1 training steps.
The time to train N − 1 LDA models was measured on a Power8 CPU with 96 Threads. The
training time is presented in Table 5.11.
We can see that the CWT/LDA method is much faster than the neural network due to the
smaller training set, and lower computational cost. The training time increases with the size
of the data for both methods.
Test Name Data Size Training Time(s)
10 Subject Test 40,793.7 1126.6724 Subject Test 27,528.7 762.2654 Subject Test 25,602 834.41
Table 5.10: Training time for neural networks with different data sizes
Test Name Data Size Training Time(s)
10 Subject Test 100 0.3024 Subject Test 240 2.7554 Subject Test 540 16.24
Table 5.11: Training time for CWT/LDA method with different data sizes
Chapter 5. Experiments 65
5.6 Inference Time Comparison
Inference computation time was measured on a machine with IBM PowerPC 8 CPU with
NVidia P100 GPU. The CWT/LDA approach is dependent on the CPU processing power, and
the neural network approach is mainly dependent on the GPU processing power. For both
methods, the time was measured from pre-processing to producing the probability or distance
metric.
For the neural network, we used a combination of NumPy and PyTorch library to perform
pre-processing and inference. The time for CWT/LDA method measures CWT, LDA transform,
and the logistic regression inference times for one CWT scale. We used the pywt python library
for continuous wavelet transform and scikit-learn for LDA and logistic regression. The pywt
library does not have the Daubechies 5 mother wavelet, so the Gauss 3 mother wavelet was
used to approximate the CWT computing time. The results are shown in Table 5.12.
The inference time for the neural network is longer than the CWT/LDA method. The
biggest bottleneck for neural network method is the inference, and for the CWT/LDA it was
the pre-processing and feature extraction time.
Method Pre-Processing(µs) Inference(µs) Total(µs)
Ours 211.2 1312.6 1523.8CWT/LDA 657.0 4.169 661.2
Table 5.12: Inference time comparison between neural network and CWT/LDA
Chapter 6
Conclusions
In this paper, we focused on building a multi-session TEAOE biometric identification and
verification system. We first introduced the data collection process and the attributes of the
TEOAE response, discussed previous research in TEOAE biometrics security, and presented
the current state-of-the-art in other biometric systems using a neural network.
Previous methods used CWT to extract features from the data and used these features to
create a model. The hyperparameters required for the CWT feature extraction are a mother
wavelet and a scale. The CWT method overfits to the dataset, and the authors of the previous
works have recommended further research for a method that is better than CWT. Our paper
focused on removing the dependency on CWT feature extraction.
We also focused on reducing the number of parameters to reduce the size of the model. Past
work in one-shot learning, siamese network, and multi-task learning were applied to design an
efficient neural network. The implementation took advantage of the common structure between
the TEOAE response of the left and right ear. Parts of the network were designed with shared
parameters, which learned the commonality, and other parts of the network were designed to
learn the different distributions between the left and right ear.
For a biometric system, it is essential to allow registration of a new identity with ease.
The registration process for a biometric security system can be challenging due to the limited
amount of biometric samples that can be collected. Difficult registration processes can cause
frustration for the users. Neural networks designed for classification are slow and difficult to
train and are not fit for biometric systems. Instead, we designed a neural network architecture
66
Chapter 6. Conclusions 67
that produces an embedding of a TEOAE response. The similarity or distance between the
two embeddings was used to identify and verify individuals. We used the triplet loss objective
which penalizes a model when a negative class is closer compared to a positive class in the
embedding space to train the model. Training a triplet loss network requires an algorithm to
pick hard triplets for better performance and convergence rate. We used online batch hard
negative mining strategy to only select triplets that maximize the loss.
We also discussed different templating methods to register a user. We presented the pros
and cons of each strategy along with different distance functions that calculate the similarity
between the templates and the probe-samples.
Combining all of these theories, we built the neural network model, trained the network, and
tested it using the TEOAE responses collected at the University of Toronto by the Biometric
Security lab. We presented the results for verification, identification scenarios, and three differ-
ent tests for each scenario. These different scenarios were designed to test model generalization.
We also compared the results between using a response from one ear and using responses from
both ears.
Our method outperforms previous results in both single ear and fusion ear identification
scenarios. It also produces comparable or slightly better results in verification scenarios. Test
results indicate that there is potential for improvements given more training samples per identity
and the number of identities.
Future work should focus on increasing the size of the data and reducing TEOAE response
acquisition time for the system. Increasing generalization through more data, and reducing the
acquisition time will be a massive step towards making the system more effective. Testing the
TEOAE authentication method under different stimulus signal to verify that it works under
different conditions will help the system more robust. Also designing a system that works
across different stimulus signal to mitigate the risk of stolen biometric information would be
significant. Further work should be done in further reducing the size of the neural network
model so that it is computationally more efficient.
Bibliography
[1] F. Agrafioti and D. Hatzinakos. Ecg based recognition using second order statistics. In 6th
Annual Communication Networks and Services Research Conference (cnsr 2008), pages
82–87, May 2008.
[2] N. Armanfard, M. Komeili, J. P. Reilly, and L. Pino. Vigilance lapse identification us-
ing sparse eeg electrode arrays. In 2016 IEEE Canadian Conference on Electrical and
Computer Engineering (CCECE), pages 1–4, May 2016.
[3] Pouya Bashivan, Irina Rish, Mohammed Yeasin, and Noel Codella. Learning rep-
resentations from EEG with deep recurrent-convolutional neural networks. CoRR,
abs/1511.06448, 2015.
[4] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient
descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994.
[5] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J.
Mach. Learn. Res., 13:281–305, February 2012.
[6] BioSec. Medical biometric databases, 2015.
[7] Avishek Joey Bose and Parham Aarabi. Adversarial attacks on face detectors using neural
net based constrained optimization. CoRR, abs/1805.12302, 2018.
[8] Leon Bottou. Online algorithms and stochastic approximations. In David Saad, editor,
Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998.
revised, oct 2012.
68
Bibliography 69
[9] H. Bredin. Tristounet: Triplet loss for speaker turn embedding. In 2017 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5430–5434, March
2017.
[10] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. Sig-
nature verification using a ”siamese” time delay neural network. In Proceedings of the
6th International Conference on Neural Information Processing Systems, NIPS’93, pages
737–744, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.
[11] Paul Chambers, Neil J. Grabham, Matthew A. Swabey, Mark Lutman, Neil White, John
Chad, and Stephen Beeby. A comparison of verification in the temporal and cepstrum-
transformed domains of transient evoked otoacoustic emissions for biometric identification.
3:246 – 264, 06 2011.
[12] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet
network for person re-identification. ArXiv e-prints, April 2017.
[13] De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and Nanning Zheng. Person re-
identification by multi-channel parts-based cnn with improved triplet loss function. In The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[14] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with
application to face verification. In 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546 vol. 1, June 2005.
[15] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Mach. Learn., 20(3):273–
297, September 1995.
[16] H. P. da Silva, A. Fred, A. Loureno, and A. K. Jain. Finger ecg signal for user authen-
tication: Usability and performance. In 2013 IEEE Sixth International Conference on
Biometrics: Theory, Applications and Systems (BTAS), pages 1–8, Sept 2013.
[17] J. Gao, F. Agrafioti, S. Wang, and D. Hatzinakos. Transient otoacoustic emissions for
biometric recognition. In 2012 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 2249–2252, March 2012.
Bibliography 70
[18] Jiexin Gao. Towards a unified signal representation via empirical mode decomposition,
Nov 2012.
[19] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural net-
works. In Geoffrey Gordon, David Dunson, and Miroslav Dudk, editors, Proceedings of the
Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15
of Proceedings of Machine Learning Research, pages 315–323, Fort Lauderdale, FL, USA,
11–13 Apr 2011. PMLR.
[20] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Ex-
amples. ArXiv e-prints, December 2014.
[21] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch,
Y. Jia, and K. He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv
e-prints, June 2017.
[22] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant
mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), volume 2, pages 1735–1742, June 2006.
[23] J.W. Hall. Handbook of Otoacoustic Emissions. Singular Thomson Learning, 2000.
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. CoRR, abs/1512.03385, 2015.
[25] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for
person re-identification. CoRR, abs/1703.07737, 2017.
[26] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift. ArXiv e-prints, February 2015.
[27] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,
J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran,
and R. Hadsell. Overcoming catastrophic forgetting in neural networks. ArXiv e-prints,
December 2016.
Bibliography 71
[28] G. R. Koch. Siamese neural networks for one-shot image recognition. 2015.
[29] Krzysztof M. Kochanek, Lech K. liwa, Klaudia Puchacz, and Adam Pika, 2015.
[30] M. Komeili, W. Louis, N. Armanfard, and D. Hatzinakos. Feature selection for nonstation-
ary data: Application to human recognition using medical biometrics. IEEE Transactions
on Cybernetics, 48(5):1446–1459, May 2018.
[31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.
Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–
1105. Curran Associates, Inc., 2012.
[32] Yann Lecun and Yoshua Bengio. Convolutional networks for images, speech, and time-
series. MIT Press, 1995.
[33] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature,
521(7553):436–444, 2015.
[34] J. Lei Ba, J. R. Kiros, and G. E. Hinton. Layer Normalization. ArXiv e-prints, July 2016.
[35] W. F. Leung, S. H. Leung, W. H. Lau, and A. Luk. Fingerprint recognition using neural
network. In Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop,
pages 226–235, Sep 1991.
[36] Y. Liu and D. Hatzinakos. Earprint: Transient evoked otoacoustic emission for biometrics.
IEEE Transactions on Information Forensics and Security, 9(12):2291–2301, Dec 2014.
[37] Y. Liu and D. Hatzinakos. Human acoustic fingerprints: A novel biometric modality for
mobile security. In 2014 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 3784–3788, May 2014.
[38] D. Masters and C. Luschi. Revisiting Small Batch Training for Deep Neural Networks.
ArXiv e-prints, April 2018.
[39] J. Nagi, F. Ducatelle, G. A. Di Caro, D. Cirean, U. Meier, A. Giusti, F. Nagi, J. Schmidhu-
ber, and L. M. Gambardella. Max-pooling convolutional neural networks for vision-based
Bibliography 72
hand gesture recognition. In 2011 IEEE International Conference on Signal and Image
Processing Applications (ICSIPA), pages 342–347, Nov 2011.
[40] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann
machines. In Proceedings of the 27th International Conference on International Conference
on Machine Learning, ICML’10, pages 807–814, USA, 2010. Omnipress.
[41] Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled:
High confidence predictions for unrecognizable images. CoRR, abs/1412.1897, 2014.
[42] I. Odinaka, P. H. Lai, A. D. Kaplan, J. A. O’Sullivan, E. J. Sirevaag, S. D. Kristjansson,
A. K. Sheffield, and J. W. Rohrbaugh. Ecg biometrics: A robust short-time frequency
analysis. In 2010 IEEE International Workshop on Information Forensics and Security,
pages 1–6, Dec 2010.
[43] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differ-
entiation in pytorch. 2017.
[44] S. Pathoumvanh, S. Airphaiboon, B. Prapochanung, and T. Leauhatong. Ecg analysis for
person identification. In The 6th 2013 Biomedical Engineering International Conference,
pages 1–4, Oct 2013.
[45] P. Jonathon Phillips, Alvin F. Martin, Charles L. Wilson, and Mark A. Przybocki. An
introduction to evaluating biometric systems. IEEE Computer, 33:56–63, 2000.
[46] M.M. Al Rahhal, Yakoub Bazi, Haikel AlHichri, Naif Alajlan, Farid Melgani, and R.R.
Yager. Deep learning approach for active classification of electrocardiogram signals. Infor-
mation Sciences, 345:340 – 354, 2016.
[47] P. S. Raj and D. Hatzinakos. Feasibility of single-arm single-lead ecg biometrics. In 2014
22nd European Signal Processing Conference (EUSIPCO), pages 2525–2529, Sept 2014.
[48] P. S. Raj, S. Sonowal, and D. Hatzinakos. Non-negative sparse coding based scalable access
control using fingertip ecg. In IEEE International Joint Conference on Biometrics, pages
1–6, Sept 2014.
Bibliography 73
[49] P. Ramachandran, B. Zoph, and Q. V. Le. Searching for Activation Functions. ArXiv
e-prints, October 2017.
[50] Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learn-
ing and forgetting functions. Psychological review, 97 2:285–308, 1990.
[51] Sebastian Ruder. An overview of multi-task learning in deep neural networks. CoRR,
abs/1706.05098, 2017.
[52] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal repre-
sentations by error propagation. Technical report, California Univ San Diego La Jolla Inst
for Cognitive Science, 1985.
[53] Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J Fleet. Adversarial manipulation
of deep representations. In International Conference on Learning Representations (ICLR),
2016.
[54] Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer,
Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wolfram
Burgard, and Tonio Ball. Deep learning with convolutional neural networks for brain
mapping and decoding of movement-related information from the human EEG. CoRR,
abs/1703.05051, 2017.
[55] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding
for face recognition and clustering. CoRR, abs/1503.03832, 2015.
[56] S. N. A. Seha and D. Hatzinakos. Human recognition using transient auditory evoked
potentials: a preliminary study. IET Biometrics, 7(3):242–250, 2018.
[57] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting
the log-likelihood function. 90:227–244, 10 2000.
[58] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. CoRR, abs/1409.1556, 2014.
Bibliography 74
[59] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
dinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of
Machine Learning Research, 15:1929–1958, 2014.
[60] Statista. Global unit sales of headphones and headsets from 2013 to 2017 (in millions),
2018.
[61] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unrea-
sonable effectiveness of data in deep learning era. CoRR, abs/1707.02968, 2017.
[62] Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing
obama: Learning lip sync from audio. ACM Trans. Graph., 36(4):95:1–95:13, July 2017.
[63] Matthew A. Swabey, Paul Chambers, Mark E. Lutman, Neil M. White, John E. Chad,
Andrew D. Brown, and Stephen P. Beeby. The biometric potential of transient otoacoustic
emissions. Int. J. Biometrics, 1(3):349–364, March 2009.
[64] MedLife Technologies. ascreen tiny oae device, 2018.
[65] F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint learning of single-image and cross-
image representations for person re-identification. In 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 1288–1296, June 2016.
[66] Kilian Q. Weinberger, John Blitzer, and Lawrence K. Saul. Distance metric learning for
large margin nearest neighbor classification. In In NIPS. MIT Press, 2006.
[67] Chunlei Zhang and Kazuhito Koishida. End-to-end text-independent speaker verification
with triplet loss on short utterances. In Proc. Interspeech 2017, pages 1487–1491, 2017.
[68] W. L. Zheng, J. Y. Zhu, Y. Peng, and B. L. Lu. Eeg-based emotion classification using
deep belief networks. In 2014 IEEE International Conference on Multimedia and Expo
(ICME), pages 1–6, July 2014.
Appendix A
Performance
A.1 File Size Comparison
Small model size is essential to use the neural network on the mobile phone. We used PyTorch
neural network framework [43] to compute and save the model. File sizes are based on PyTorch’s
proprietary save file. The file sizes of the network trained using different hyperparameter are
shown in Table A.1. As more blocks and channels are used, the size of the network grows. The
recommended size for mobile phones according to iOS specifications is around 10MB.
75
Appendix A. Performance 76
Table
A.1:
PyT
orch
mod
elfi
lesi
zes
for
diff
eren
thyp
erp
aram
eter
s
Com
mon
sect
ion
#of
Ch
ann
els
Ind
ivid
ual
Sec
tion
#of
Ch
ann
els
Com
mon
Sec
tion
#of
Blo
cks
Ind
ivid
ual
Sec
tion
#of
Blo
cks
Em
bed
din
gS
ize
Fil
eS
ize
84
33
128
913K
84
22
128
913K
84
11
128
826K
84
43
6471
2K16
84
364
1.4M
32
162
212
83.
5M32
164
364
2.8M
32
162
225
66.
1M64
324
364
5.6M
64
322
254
4.1M
64
322
225
613
M64
322
212
87.
1M256
128
34
6433
M
Appendix B
Failed Experiments
In this section, we discuss some of the failed experiments that made sense in the beginning but
failed to produce the state of the art results. We leave some brief notes on the reasons for the
approach, and why it might have failed. Research is an iterative process, and this section does
contain all the results because some ideas were abandoned halfway.
B.1 Random Forest
We first tried training a random forest model using CWT as the features. The random forest
model performed better but was not significantly better. It still had the problems with having
to choose a CWT scale and the mother wavelet. The gaus3 mother wavelet was tested, but it
does not seem to work as well as Daubechies 5.
In testing. The random forest did not perform well. The accuracy for the left ear on the
best scale was 60.3%. It did not perform better than the previous state-of-the-art and was
abandoned.
B.2 Simple Convnet
We first designed a network with a convolutional neural network or a fully connected network
that kept reducing the number of parameters until it reached the desired encoding dimension.
There was a limitation on stacking the network because the number of parameters in the layers
77
Appendix B. Failed Experiments 78
above had to be bigger than the final encoding. It restricted the number of layers that we could
stack. The results were better than the random forest. The response was processed using CWT,
and multiple scales were stacked and were made into an image. This image was processed using
2D-Convolution. The network is presented in Figure B.1
Figure B.1: Simple Architecture used to perform classification
Appendix B. Failed Experiments 79
B.3 One Network
When one network was trained to predict both the left and the right ears, the accuracy of the
left ear suffered greatly. The accuracy of the right ear identification did not seem to matter.
This architecture is presented in B.2
Figure B.2: Architecture diagram for training the convolutional neural network withoutshared parameters
Appendix B. Failed Experiments 80
B.4 Auto Encoder
We trained an autoencoder with four layers and the network trained well with low loss. We
used the CWT as the feature. We abandoned the idea due to multiple training steps required
to build this network.
B.5 CWT and Neural Networks
Originally a 2D-convolution was performed on the image. We stacked the CWT scaled like an
image. We used up to 100 scales to maximize the performance. After testing the network with
multiple random forest seeds, it was clear that the network was not generalizing. The variance
of the network was far too significant even after weight initialization. We decided to remove
CWT from pre-processing.
B.6 Quadruplet Loss
We have tried using Quadruplet loss for the network, but it did not perform better than the
triplet loss. When using the quadruplet loss, another negative had to be chosen. This negative
was randomly picked rather than choosing the best negative like the triplet selection and might
be the reason behind a lower performance. Even though the accuracy and the EER were lower,
it was off by a couple of percentage points. The quadruplet loss is visualized in Figure B.3
Appendix B. Failed Experiments 81
Figure B.3: The final objective of quadruplet loss
Appendix C
Data Splits
Table C.1: Random seed and Test Subject distribution for 10 subject test
Random Seed Test Subjects Number of Training Responses
72 4,6,9,13,23,27,35,39,42,53 39,7805 6,17,21,23,26,31,40,46,51,53 40,40698 0,5,13,16,19,32,33,34,35,49 39,59023 14,16,17,18,22,24,28,33,36,50 42,96083 1,12,17,26,28,29,33,34,45,46 41,55459 1,6,16,17,18,27,39,46,48,51 41,5704 4,6,12,17,20,22,28,31,37,52 41,60633 1,5,15,21,30,32,40,41,44,50 42,80827 6,9,11,17,18,20,36,39,40,46 40,68885 1,5,7,8,20,30,39,44,45,47 39,96245 2,5,7,16,19,26,41,44,50,51 41,98241 8,9,10,14,15,36,37,40,49,53 40,97418 12,16,26,30,31,33,37,38,44,45 41,60035 7,14,17,34,35,36,45,49,50,52 41,04057 9,10,20,27,29,30,39,43,49,50 39,78042 3,5,12,17,19,32,44,48,49,52 40,38026 2,5,9,10,20,36,37,39,47,50 41,24262 4,5,15,27,28,32,35,44,47,49 40,62040 4,16,20,21,26,34,38,39,49,53 35,23824 2,5,8,13,22,24,27,41,46,53 42,094
82
Appendix C. Data Splits 83
Table C.2: Random seed and Test Subject distribution for 24 subject test
Random Seed Test Subjects Number of Training Responses
724,6,9,11,13,14,17,18,22,23,25,27,31,
33,34,35,36,39,42,43,47,49,52,5326,530
50,2,3,4,5,6,17,19,21,23,24,26,31,32,
33,34,40,41,45,46,50,51,52,5329,276
980,5,8,10,11,12,13,14,16,17,19,24,29,
31,32,33,34,35,36,43,46,47,49,5028,340
233,8,10,13,14,16,17,18,20,22,23,24,28,
29,33,34,35,36,41,44,47,48,50,5226,796
831,2,7,10,11,12,17,19,21,24,26,28,29,
31,33,34,35,37,38,41,44,45,46,5031,114
591,5,6,7,16,17,18,21,22,27,28,32,34,35,39,41,42,45,46,48,50,51,52,53
28,388
44,6,7,11,12,14,15,16,17,18,20,22,24,
25,27,28,29,31,32,34,37,42,43,5229,346
330,1,4,5,6,10,15,17,21,25,26,27,28,30,
32,36,37,39,40,41,44,46,49,5028,612
270,1,5,6,9,11,14,16,17,18,20,27,28,30,36,39,40,41,43,44,46,47,49,50
28,422
850,1,5,7,8,14,16,17,18,20,23,26,30,35,38,39,40,41,44,45,46,47,49,53
26,448
450,2,5,7,9,13,16,18,19,20,26,27,28,29,36,37,40,41,44,48,49,50,51,53
26,370
415,7,8,9,10,14,15,18,27,29,30,31,33,36,37,39,40,41,43,46,47,49,51,53
28,952
187,12,15,16,20,22,23,25,26,27,30,31,32,33,37,38,39,40,41,44,45,48,50,52
28,968
352,4,7,13,14,17,18,20,22,23,26,31,32,
34,35,36,38,39,42,43,45,49,50,5226,758
572,3,7,9,10,16,19,20,27,29,30,31,32,33,35,36,38,39,42,43,44,45,49,50
25,916
423,4,5,6,8,9,12,13,15,16,17,19,24,26,
32,33,34,37,44,45,48,49,50,5227,134
260,2,3,5,7,8,9,10,11,14,20,22,29,35,
36,37,39,40,42,44,45,47,50,5227,982
622,4,5,6,7,15,16,20,22,23,25,27,28,31,32,34,35,41,43,44,47,49,52,53
25,044
400,2,4,5,11,16,18,20,21,24,25,26,29,33,34,35,36,38,39,40,48,49,50,53
22,520
242,5,8,10,13,14,16,19,20,21,22,24,26,
27,32,37,38,39,40,41,44,46,52,5327,658