Deep Learning Bottle-Neck Features for Speaker Recognition

UNIVERSITAT POLITÈCNICA DECATALUNYA

FINAL DEGREE PROJECT

Deep Learning Bottle-NeckFeatures for Speaker Recognition

Author:Eudald Cumalat Puig

Supervisor:Javier Hernando Pericas

A thesis submitted in fulfillment of the requirementsfor the degree of Enginyeria en Sistemes de Telecomunicació

in the

Grup de Tractament de la ParlaDepartament de Teoria del Senyal i Comunicacions

October, 2018

https://www.upc.edu

https://www.upc.edu

http://www.tsc.upc.edu/ca/recerca/veu

http://www.tsc.upc.edu/ca

i

UNIVERSITAT POLITÈCNICA DE CATALUNYA

AbstractEscola Tècnica Superior d’Enginyeria de Telecomunicació de Barcelona

Departament de Teoria del Senyal i Comunicacions

Enginyeria en Sistemes de Telecomunicació

Deep Learning Bottle-Neck Features for Speaker Recognition

by Eudald Cumalat Puig

Speaker recognition is a very useful and powerful technology that has manyinteresting security applications, which makes it an investigation field topour on many efforts. Recognizing whether a person is who he/she claimsor not is a natural capability of the humans, but we are not perfect, and deeplearning has brought the possibility of optimizing this ability to almost per-fection, and putting it all together with the fact that each one of us has aunique voice, it makes this technology a very important option to keep inmind when designing a security system for banks or mobiles. In this projectI test some evaluation methods for speaker recognition systems and also im-plement several optimizations for siamese neural networks. I optimized thebaseline network from EER = 31,51% to EER = 13,73% and applied PLDAscoring techniques to the embeddings of the siamese neural network andto the embeddings of the pre-trained network, and then repeat all the ex-periments on a siamese model with reduced dimensions. This project hasbeen developed using python (keras, tensorflow), matlab and with the AL-IZÈ toolkit for speaker recognition.

HTTPS://WWW.UPC.EDU

https://etsetb.upc.edu/

http://www.tsc.upc.edu/ca

ii

AcknowledgementsFirst of all, I want to thank my tutor, Javier Hernando Pericas, for guid-ing, teaching and supporting me through the development of this project.Second, I want to thank Miquel Angel India Massana for all the resourcesand ideas provided and mainly for his patience, and also Umair Khan, whohelped me in one of the parts of the project. Last but not least, I want to thankmy family and friends, who were always there supporting me.

iii

To my grandparents, always motivating me andbeing proud of what I do.

iv

Contents

Abstract i

Acknowledgements ii

Contents iv

List of Figures vi

List of Tables vii

List of Abbreviations viii

1 Problem Formulation 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Project goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Temporal Planification . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Gantt Diagram . . . . . . . . . . . . . . . . . . . . . . . 21.3.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Project Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.1 Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.3 Personal . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.4 Estimated costs . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5.1 Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5.2 Implementation and project monitoring . . . . . . . . . 5

2 Context 62.1 Introduction and state of the art . . . . . . . . . . . . . . . . . . 6

2.1.1 Speaker recognition . . . . . . . . . . . . . . . . . . . . 62.1.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Database used . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Development: PLDA scoring techniques 93.1 i-Vector extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Scoring methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Cosine Distance . . . . . . . . . . . . . . . . . . . . . . . 11

v

3.2.2 G-PLDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.3 HT-PLDA . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.4 Fast Variational Bayes G-PLDA and HT-PLDA . . . . . 12

4 Development: Siamese Neural Network 134.1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.1.2 Activation functions . . . . . . . . . . . . . . . . . . . . 144.1.3 Bottle-Neck layer . . . . . . . . . . . . . . . . . . . . . . 154.1.4 Learning process - Loss functions . . . . . . . . . . . . 164.1.5 Siamese neural network . . . . . . . . . . . . . . . . . . 174.1.6 Training process . . . . . . . . . . . . . . . . . . . . . . 174.1.7 Validation process . . . . . . . . . . . . . . . . . . . . . 184.1.8 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Baseline and optimization . . . . . . . . . . . . . . . . . . . . . 204.2.1 Baseline network shape . . . . . . . . . . . . . . . . . . 204.2.2 Baseline layer configuration . . . . . . . . . . . . . . . . 204.2.3 Baseline parameter configuration . . . . . . . . . . . . . 214.2.4 Model pre-training . . . . . . . . . . . . . . . . . . . . . 214.2.5 Embeddings extraction . . . . . . . . . . . . . . . . . . 224.2.6 PLDA scoring from embeddings . . . . . . . . . . . . . 244.2.7 Shape optimization and parameter tuning . . . . . . . 24

5 Results 265.1 PLDA scoring techniques . . . . . . . . . . . . . . . . . . . . . 275.2 Siamese Neural Network . . . . . . . . . . . . . . . . . . . . . . 28

6 Impact 316.1 Environmental . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Social . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Conclusions and future work 33

Bibliography 35

vi

List of Figures

1.1 Gantt diagram of the project . . . . . . . . . . . . . . . . . . . . 2

3.1 i-vector extraction process . . . . . . . . . . . . . . . . . . . . . 10

4.1 Neural network simple structure example . . . . . . . . . . . . 144.2 ReLU graphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Sigmoid graphic . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.4 Bottleneck layer diagram example. . . . . . . . . . . . . . . . . 164.5 Siamese diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 204.6 Siamese pre-train diagram. . . . . . . . . . . . . . . . . . . . . 224.7 Siamese embeddings extraction diagram. . . . . . . . . . . . . 234.8 Siamese embeddings extraction from pre-trained network di-

agram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.9 Second siamese diagram. . . . . . . . . . . . . . . . . . . . . . . 24

5.1 Det curves for the first siamese experiments. . . . . . . . . . . 285.2 Det curves for the first PLDA experiments. . . . . . . . . . . . 295.3 Det curves for the first PLDA experiments. . . . . . . . . . . . 30

vii

List of Tables

1.1 Table of material costs. . . . . . . . . . . . . . . . . . . . . . . . 41.2 Table of software costs. . . . . . . . . . . . . . . . . . . . . . . . 41.3 Table of the costs of the personal. . . . . . . . . . . . . . . . . . 4

4.1 Table of details of each of the layers from the features subnet-works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Table of details of the concatenation subnetwork. . . . . . . . . 214.3 Table of details of each of the layers from one of the two sub-

networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Table of details of the concatenation subnetwork. . . . . . . . . 254.5 Table of details of the siamese neural network baseline param-

eters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

viii

List of Abbreviations

PLDA Probabilistic Linear Discriminant AnalysisG-PLDA Gaussian PLDAHT-PLDA Heavy-Tailed PLDAVBG-PLDA Variational Bayes G-PLDAVBHT-PLDA Variational Bayes HT-PLDAGMM Gaussian Mixture ModelUBM Universal Background ModelEER Equal Error RateFAR False Acceptance RateFRR False Rejection RatePCA Principal Component Analysis

1

Chapter 1

Problem Formulation

1.1 Motivation

Being a chess master, for example, requires lots of effort, hours and dedi-cation. Being able to translate from other languages also requires years ofpractice and experience. But nowadays, thanks to internet and with veryfew notions of machine learning, you can create a system that becomes achess unbeatable player in just two days of training or less, or you can makea system able to recognise precisely if you are lying or telling the truth, or ifyou are the person you claim or not, or even your feelings at that moment.The fact that this can be achieved only with some lines of code fascinated meand brought me to this field and more specifically to this project.

1.2 Project goals

This project is divided in two main parts: PLDA scoring techniques andSiamese Neural Networks. Each one of the parts has its own goals.

For the PLDA scoring techniques part, its main goals are:

1. Use the ALIZÈ toolkit (http://alize.univ-avignon.fr) to extract the i-Vectors from the feature vectors of the database used.

2. Using the i-vectors extracted, reproduce the baseline results in [4].

3. Test different PLDA systems on the i-vectors extracted.

For the Siamese Neural Networks part, its main goals are:

1. Learn how a Siamese Neural Network works.

2. Train a Siamese Neural Network.

Chapter 1. Problem Formulation 2

3. Optimize the previous network tuning its parameters.

4. Try several methods to improve the performance of the network, suchas pretraining, triplet loss, using different embeddings as inputs. . .

1.3 Temporal Planification

1.3.1 Gantt Diagram

I started this project at 6th of february of 2018. I have calculated approxi-mately the dedication as follows (taking into account the extension I had totake):

25hours/week× 35weeks = 875hours

Figure 1.1 shows the diagram of tasks that I followed during the develop-ment of this project. It has 4 main tasks: Introduction, PLDA scoring tech-niques, Siamese Neural Networks and the writing of this document.

FIGURE 1.1: Gantt diagram of the project, on a weekly view.

1.3.2 Problems

Several factors have caused this project to be delayed:

• Getting stuck on difficult software problems on the scripts I coded fora long period of time.

• Lack of knowledge: the time spent studying and learning about DeepLearning, speaker recognition and neural networks is a factor to take intoaccount, because these are not simple concepts. Also the ability to pro-gram in any language needed (python, bash, matlab. . . ) requires timeto get familiar with.


• Calendar: by the time you have introduced yourself into the world ofDeep Learning and have knowledge enough to start developing any ofthe objectives of the project, you maybe have already consumed onequarter of the time available.

1.4 Project Budget

1.4.1 Material

This project has been developed in my two personal computers (a laptopand a tower). Since the project is developed through a VPN connection in aremote server, you don’t a computer with much resources.

1.4.2 Software

• Ubuntu 16.04: Operating system of the laptop where I developed thisproject.

• Windows 10: Operating system of the tower where I also developed thisproject.

• Ubuntu 18.04.1 LTS: Operating system of the Calcula sever.

• Overleaf : Website where I wrote the document of the project in LATEX.

• Instagantt: Website to design gantt diagrams.

• draw.io: Website to design block diagrams.

• Matlab: Program used to develop some of the PLDA scoring techniquescripts and the evaluation system.

• Python: Programming language used to develop part of the scripts ofthis project.

• Bash: Programming language based on linux commands, used mailyto manipulate files or create execution routines for the python/matlabscripts.

• ALIZÈ toolkit: Software used to extract the i-vectors from the featurevectors. [13]

1.4.3 Personal

In this project, the only personal needed is the developer (in this case, theauthor of this document). As this is an individual project, I have assumedthe role of project leader, developer and tester. As described in section 1.3.1,the dedication of this project has been approximately 875 hours, and those


hours are approximately divided in 150 hours as project leader, 550 hours asdeveloper and 175 hours as tester.

1.4.4 Estimated costs

The following table resumes the costs of the hardware I had to use duringthis project:

Product Quantity CostLaptop 1 800,00 eTower 1 1.200,00 eTOTAL: 2.000,00 e

TABLE 1.1: Table of material costs.

In the following table, the costs for all the software licenses I had to use canbe found:

Product Quantity CostWindows 10 1 0,00 eUbuntu 16.04 1 0,00 eUbutnu 18.04.1 LTS 1 0,00 eOverleaf 1 0,00 eInstagantt 1 0,00 edraw.io 1 0,00 eMatlab 1 0,00 ePython 1 0,00 eALIZÈ toolkit 1 0,00 eTOTAL: 0,00 e

TABLE 1.2: Table of software costs.

The calculation of the costs of the personal involved in this project are in thefollowing table:

Role Quantity Cost / hour Hours Final costProject leader 1 35,00 e 150 5.250,00 eDeveloper 1 15,00 e 550 8.250,00 eTester 1 15,00 e 175 2.625,00 eTOTAL: 16.125,00 e

TABLE 1.3: Table of the costs of the personal.

To sum up, I have calculated that the total cost for this project, taking intoaccount the hardware, software and personal resources, is 18.125,00 e.


1.5 Methodology

1.5.1 Research

The research process has been constant during the whole project, as I havebeen experimenting with software I did not know well enough and withcomplex concepts I had to understand well in order not to make mistakesin my scripts. I have been given lots of papers (listed in the bibliography)to navigate through them and get a deeper knowledge on the state-of-the-artstructures and scoring methods.

1.5.2 Implementation and project monitoring

In this project I have both taken other’s scripts and adapted them into mycasuistry and designed and written my own scripts based on ideas takenfrom other’s papers.

After writing the scripts, I have run them in Calcula server (if the script re-quires a lot of memory or time) or in my tower.

This project has been monitored via non-scheduled meetings (non-scheduledfrom the beginning of the project, every time I had some results or doubts,we met) and mailing.

6

Chapter 2

Context

2.1 Introduction and state of the art

Voice signals contain information about identity, gender, emotions, age, lan-guage and many other things that, if we are able to read them, can be veryuseful. Being able to get the identity of a person, for example, has a widerange of applications in the world of security. You can unlock your phone bysimply saying hi, or log into your bank account, or even enter your house,and combining it with other biometric systems such as fingerprint, face, irisor signature, you can improve security a lot. Because getting the password ofsomeone is as easy as seeing him/her type it, or hearing it, but biometrics cannot be easily reproduced. Due to all those facts, the ability to get the identityof someone from the voice, or Speaker Recognition, has been a big target ofresearch in the field of deep learning.

2.1.1 Speaker recognition

Speaker recognition is the art of identifying a person using its voice features(in the time and/or frequency domain). It is a complex process that inputsa voice signal and outputs a decision, and in all its parts deep learning canbe applied. The state-of-the-art in this field is to extract the features from thespectrogram of the voice signal and model the supervectors using GMMsadapted from an already trained UBM, as in [14] (explained in detail in sec-tion 3.1). The two main variants are speaker identification and verification,and they can be text dependent or independent.

• Identification: The objective is to identify the correct speaker from a setof N possible pre-enrolled speaker identities in the database. [7]

• Verification: The speaker claims an identity and the system has to beable to decide whether that identity corresponds to the claimer or not,based on a test speech signal and a pre-enrolled speaker record. [3]

Chapter 2. Context 7

And both types can, at the same time, be:

• Text-Dependent: This systems are trained to identify a person or iden-tity by speaking a specific word or sentence, for example OK Googlein Android or Hello Siri in Apple devices. It is believed that speakerverification, for example, achieves better accuracy when used in a text-dependent system. [1].

• Text-Independent: Text-independent systems are able to identify a per-son or identity by saying anything the person wants, there are no con-straints on the words that the speaker is allowed to use. Due to this, theenrollment and the test data can have a completely different content,and the system must take this mismatch into account because usually,phonetic variability represents an adverse factor to accuracy.[6]

2.1.2 Deep Learning

Deep Learning is a subfield of Machine Learning (briefly, it is a field of com-puter science that gives systems the ability to learn through statistical tech-niques) based in algorithms that try to imitate the structure and functioningof the brain. These algorithms are called artificial neural networks (ANN)and there are many types: deep neural networks (DNN), deep belief net-works (DBN) and recurrent neural networks (RNN). All these algorithms arebased on feature learning, which is a set of techniques that allows a systemto automatically learn and distinguish the characteristics needed for taking adecision or making a classification, from raw data (voice recordings, photos,texts. . . ). There are two main types of learning:

• Supervised: In supervised feature learning, you feed the system with la-beled data, so that it can train the model, check its accuracy and find anoptimal training spot. Some applications are recommendation systems,disease diagnosis, crime prediction based on historical data and otherclassification problems.

• Unsupervised: In unsupervised feature learning, the system is fed withunlabeled data. These systems focus on grouping the unsorted data ac-cording to similarities and differences. Some applications are chatbots,self-driving cars and facial recognition systems.

Chapter 2. Context 8

2.2 Database used

The database used in this project, either for the PLDA part or the siameseneural network is the core test-common condition 5 of the NIST 2010 SRE[10]. It includes various number of trials of normal conversational telephonespeech used for both training and testing (labeled). For the background data,the trials are collected from NIST SRE 2004-2008, which include 37.599 speechutterances. Also, from those 37.599 utterances, 18.140 signals are labeled andused for the PLDA training.

The PLDA part is developed directly from the features already extracted. Thedetails of the feature extraction can be found in section 4.1 of [4].

To extract the i-vectors from the feature vectors of the speech utterances, theALIZÈ open source software [13] has been used. Baseline cosine distance andPLDA is also performed using the ALIZÈ toolkit. i-vectors dimension is 400and the PLDA dimension is also 400.

For the siamese neural network part, the inputs of the system are the super-vectors of the speech utterances, which have a dimension of 16896, and forthe PLDA experiments using the siamese embeddings, its size is the numberof nodes that the bottleneck layer of that model has.

The performance of all the experiments is evaluated using the Equal ErrorRate (EER).

9

Chapter 3

Development: PLDA scoringtechniques

3.1 i-Vector extraction

The main parts of the i-vector extraction process from speech signals are de-scribed in this section:

• Feature Extraction – The voice signal is transformed into feature vec-tors, which are numerical representations of the discriminative speakerinformation, and can be processed by algorithms in order to do statis-tical analysis. A state-of-the-art in the field is to extract first the speechspectrum, and then extract the features from it, as in [14]. The mostpopular spectrum is the Mel-Frequency Cepstrum (MFC) which worksvery well in speech processing.

Several techniques can be applied to the feature vectors in order to com-pensate external effects. The ones that are more commonly used are:extract the mean (Cepstral Mean Normalization), extract the mean andnormalize the variance to the unit (Cepstral Mean and Variance Nor-malization) and feature warping.

• Gaussian Mixture Model (GMM) Supervectors – A GMM is a weightedsum of gaussian densities which are defined by three sets of param-eters, the weight, the mean and the covariance matrix. The GMM isadapted from a Universal Background Model (UBM), which is a GMMpreviously trained with a large number of speech segments, also knownas global GMM. The feature vectors are used to estimate the parametersof each adapted GMM with the Expectation-Maximization (EM) algo-rithm, as in [12]. Then, the mean vectors of each n adapted GMM arestacked to build a Supervector, which for each speaker i is:

si = (µi1, µi

2, . . . , µin)

t

Chapter 3. Development: PLDA scoring techniques 10

• i-vectors – The previous supervectors can be modeled as:

si = m + Tv

Where m is the speaker and session independent mean supervector, Tis the total variability matrix and v is a vector of latent variables. An i-vector w is the mean of the posterior distribution of this vector v, whichis conditioned by the Baum-Welch statistics of the given utterance ofthe speaker i. This i-vector w is computed as:

w = (I + TtΣ−1N (u)T)−1TtΣ−1F̃ (u)

Where N (u) is a diagonal matrix that contains the zeroth order Baum-Welch statistics, F̃ (u) is a supervector of the centralized first orderstatistics and Σ is a diagonal covariance matrix initialized by Σubm andupdated during the factor analysis training. The training of the T ma-trix is performed using the EM algorithm given the Baum-Welch statis-tics from background speech utterances. So, an i-vector [9] is basicallya low rank vector representation of a speech utterance that contain in-formation about the speaker, gender, emotions, age, . . .

FIGURE 3.1: Graphic representation of the i-vector extractionprocess from raw data. From the "Deep Learning for Speech

and Language" seminar of the UPC.


3.2 Scoring methods

3.2.1 Cosine Distance

A simple method to compare two utterances in order to make the decisionof whether the test utterance belongs to the speaker or not is the cosine dis-tance[8]:

score(wtarget, wtest) =(wtarget)twtest

‖wtarget‖ · ‖wtest‖R θ

Where θ is the decision threshold, and wtarget (from a known speaker) andwtest (from an unknown speaker) are respectively the target and test i-vectors.This method only considers the angle between the two i-vectors and not theirmagnitudes, but it is believed that session and channel information (non-speaker information) affect their magnitudes, so removing the magnitude inall the scoring process may result in a more robust i-vector scoring system.

3.2.2 G-PLDA

G-PLDA [11] is a method that performs the scoring with session-variabilitycompensation so, as explained in the previous section cosine distance, that iswhy it is a more efficient method. It assumes the i-vectors can be decomposedas follows:

w = m + Φξ + ε

Where m is the global offset of the i-vector, the columns of Φ are eigenvoices,ξ is the latent vector (having a standard normal prior) and ε is a residualvector, normally distributed, with zero mean and a full covariance matrix.

G-PLDA assisted with a Gaussianization step (length normalization) is one ofthe most used methods for evaluating speaker recognizers that use i-vectorsor x-vectors. It can be trained either with generative or discriminative meth-ods.

For the scrips, the G-PLDA has many parameters that have to be tuned inorder to get a good evaluation. Those parameters are:

• Size of speaker identity: In order to reproduce the baseline results, thisparameter is 400.

• Iterations: In order to reproduce the baseline results, this parameter is50.


3.2.3 HT-PLDA

HT-PLDA [5] is a scoring method known to give similar accuracy as G-PLDAwithout the need of lenght normalization, but at a considerable extra compu-tational cost. It assumes that ξ has a Student’s t prior instead of a Gaussianprior as in G-PLDA.

3.2.4 Fast Variational Bayes G-PLDA and HT-PLDA

Fast variational bayes is an algorithm proposed in [2] which gives HT-PLDAapproximately the same computational load as standard G-PLDA, but alsowithout the need to length normalize the vectors. The vectors are expectedto have zero mean.

For the scripts (which can be found in [2]), this method has many parametersto be tuned in order to get the best accuracy. The configuration is different ifyou want to score using G-PLDA or HT-PLDA.

The parameters are:

• nu: This represents the degrees of freedom. For G-PLDA nu has to beinf, and for HT-PLDA I set nu = 2.

• Size of speaker identity: For both cases, I set this parameter to 200.

• Iterations: In both cases, the best results were obtained with the numberof iterations set to 10.

• F: Factor loading matrix, which is initialized to identity if you do notspecify it.

• W: Within-speaker precision, which also is initialized to identity if notspecified.

13

Chapter 4

Development: Siamese NeuralNetwork

4.1 Neural Network

A neural network is basically a system that learns how to perform specifictasks through trial and error, using examples. There are neural networks thatplay on-line video-games, and after having played against several players,becomes unbeatable. Other examples can be systems that detect who a per-son is, or systems that learn to recognize voice. In this last example, you feedthe network with voice samples and it learns characteristics that allow it totake decisions on whether a person is who claims or not, for example.

4.1.1 Structure

A neural networks has many layers (depending on the design), the numberof layers is called the depth of the network, and depending on it, the net-work will behave one way or another. Each layer has many nodes, whichreceive an input, change its internal state and produce an output (dependingon the decision taken by the activation function, explained in detail in thesub-section 4.1.2). These nodes are called neurons, and they try to imitate thebiological structure of the brain.

The neurons are connected with other neurons, creating this way the net-work. Each connection has a weight (which is the parameter to be tuned inthe learning phase), and can be represented as:

n0 = f (N

∑i=1

wi0ni + b0)

Chapter 4. Development: Siamese Neural Network 14

Where f is the activation function chosen, n0 is the neuron that receive thesignal, N is the number of neurons of the previous layer and that have a con-nection with this neuron, ni is each of the N neurons of the previous layer, wiois the weight associated to this connection, and b0 is a real number associatedto the neuron n0.

FIGURE 4.1: This figure illustrates the structure of a very simpleneural network.

4.1.2 Activation functions

An activation function defines the output of a neuron given an input. Thereare many of them, but in this project I use the following ones:

• Rectified linear unit (ReLU):

f (x) =

{0, for x < 0x, for x ≥ 0


FIGURE 4.2: ReLU function.

• Sigmoid:

f (x) = σ(x) =1

1 + e−x

FIGURE 4.3: Sigmoid function.

• Softmax:

fi(x) =exi

∑Jj=1 exj

for i = 1,. . . ,J

4.1.3 Bottle-Neck layer

A bottleneck layer in a neural network is a layer with less neurons than theone below or above. The purpose of putting a bottleneck layer in the networkis to find a network that generalizes well to new inputs, because these layersreduce the number of parameters in the network whilst still allowing it to bedeep and represent many feature maps.


FIGURE 4.4: This figure illustrates a bottleneck layer (in red) ina neural network.

4.1.4 Learning process - Loss functions

The training of a neural network consists of the optimization of a loss func-tion. The learning process has the following phases:

• Forward propagation: First, we feed the network with an input and getthe corresponding output.

• Error calculation: Then, using the loss function and the real and expectedoutputs, the error is calculated.

• Backward propagation: Finally, the error calculated goes back through thewhole network and adjusts the weights. Each adjustment is equivalentto how much that node has contributed to the error.

In this project I have used the following two loss functions:

• Binary cross entropy: Binary cross entropy has been used with the siameseneural network. This function scores how wrong a model is, and largevalues mean that the model is worse, so the objective is to minimize it.The function is the following:

L(θ) = −[y′log(y) + (1− y′)log(1− y′)]

Where θ are the weights, y is the output of the network and y′ is the ex-pected output. So, the objective here is to get an optimum set of weightsthat minimize the function L.

• Categorical cross entropy: The categorical cross entropy has been usedto train the pre-trained model. It also outputs how wrong a model is,but the difference with binary cross entropy is that categorical uses asoutputs (real and expected) the index of the speaker among an outputof length equal to the number of speakers.


4.1.5 Siamese neural network

A siamese neural network is an artificial neural network that has two branches(sub-networks) or more with identical weights and configuration. It also hasadditional layers above that concatenate the outputs of the two branches andalso another layer that gives the decision. This type of neural networks areuseful when you want to know how much similar two things are (signatures,texts, . . . ), and have the following advantages:

• They have to train a smaller number of parameters because the weightsof the two branches are shared. This means less training time, less datarequired and also less tendency to over-fit.

• Each branch produces a feature vector of its input which, if the inputsare similar, the feature vectors will be similar too and easier to compare.

4.1.6 Training process

In order to perform the training of the model, I used as training data thewhole set of 1372 different speakers of the training set, and for the validation,I have used a small selection of locutors from the test set. In order to stop thetraining at the optimal epoch, I have used the function early stopping fromthe Keras library monitoring the test accuracy and saving the epoch where itis at its maximum value.

The training is performed by randomly generating positive and negativecases and feeding them to the network. This generation is repeated 50 times,because for each time you generate 2 examples (one positive and one nega-tive), forming a batch of size 100. To make each generation, the code followsthe following steps:

• Among all the different speakers in the set, two of them are randomlyselected.

• For the first speaker, two of its files are randomly selected (positive ex-ample).

• For the second speaker, one of its files is randomly selected and theother file is selected from the first speaker (negative example).

Once the batch is generated, the following process happens in the network:

• The first set of 50 examples is introduced to the network.

• At the same time, the second set of 50 examples is also introduced tothe network, to the second branch (for clarification, see figure 4.5).

• The two sets of examples generate 50 representations each, which areunified each representation from the first branch with its analog fromthe second branch.


• The 50 unified representations are introduced to the concatenation sub-network, producing an output.

• Using the loss function, the error between the obtained value and theexpected value is calculated.

• The error propagates through the network updating the weights (back-ward propagation, explained in the subsection 4.1.4).

In my case, this whole process happens 100 times for each epoch.

4.1.7 Validation process

For the validation, the batch generation process is the same as in the trainingsubsection. The process after generating the batch is a bit different from themoment where the representations that the two branches output get unifiedand used as input of the concatenation subnetwork. From this point, theprocess continues as follows:

• The concatenation subnetwork produce an output, and if it is greateror equal to 0.5, it is considered a positive case. If not, it is considered anegative case.

• Using the labels of the examples, the script checks if the prediction ofwhether it is a positive case or a negative case is correct.

This validation process is repeated 50 times for each epoch, and after this 50times, the accuracy is calculated. If this accuracy is better than the accuracyof the previous epoch, the script takes this epoch as the best and moves to thenext one, but if it is not better, it moves to the next epoch without taking thisone as the best. If this happens 50 times (number of patience epochs of theearly stopping tool), the script stops and saves the model of the best epochregistered in terms of test accuracy.

4.1.8 Parameters

All the parameters I used to configure the network are described and detailedin this subsection. The parameters are the things that decide, for example,how fast the network is trained or how many connections are discarded. Tak-ing into account that the weight of the connection between two neurons is:

w(t + 1) = w(t)− λδLδw− λkw(t)

Where w is the weight of the connection, t is the moment of the training, λ

is the learning rate parameter, k is the weight decay parameter and δLδw is the

contribution of this weight to the total error (L is the loss function).


• Learning Rate: The learning rate λ determines the learning speed of thenetwork. A bad choice of the learning rate could lead to over-fitting.Learning rate could also be defined as the importance given to the con-tribution of this weight to the total error.

• Weight Decay: The weight decay k is a regularization parameter used inorder to avoid over-fitting by reducing the size of the weights (propor-tionally to its size). If the weight is big, the penalization is also big, butif the weight is small it can grow freely without much penalization.

• Dropout: The dropout is a regularization parameter which representsthe percentage of connections that are randomly disabled in each epoch.This forces the network not to use all its weights, which also helps toavoid over-fitting.

• Batch Size: The batch size is basically the number of samples that isgoing to be propagated through the network.

• Epochs: One epoch is when the entire dataset has passed forward andbackward through the neural network once. It can be seen as an itera-tion, because after each epoch, the network weights get updated, andyou look for the best epoch to stop training based on the accuracy orthe loss. If you train your model too many epochs (over-fitting), it canget used to the problem too much and will not generalize well, whichmeans that when you use that model to predict using another datasetor simply different data from the same dataset, its accuracy will be toolow. But if you train your model only for a few epochs (under-fitting),your model will not be able to predict anything with a good accuracy,not even from the same training set.


4.2 Baseline and optimization

4.2.1 Baseline network shape

As it is more visual and easy to understand, I have illustrated the shape ofthe network in the following figure:

FIGURE 4.5: This figure illustrates the shape of the siamese neu-ral network.

The two identical branches are the features subnetworks and the part wheretheir inputs are concatenated is the concatenation subnetwork.

4.2.2 Baseline layer configuration

In this subsection the details of the layers of the two types of subnetworks,which are the features subnetworks and the concatenation subnetwork, arespecified in the following two tables. In the first table, the activation functionfor all the layers is the ReLU function.


Layer Neurons ConnectionsInput 16.896 10.000Feature layer 1 10.000 10.000Feature layer 2 10.000 10.000Feature layer 3 10.000 5.000Bottleneck layer 5.000 5.000

TABLE 4.1: Table of details of each of the layers from the fea-tures subnetworks.

Layer Neurons Connections Activation functionConcatenation layer 10.000 5.000 ReLUPrediction layer 5.000 1 Sigmoid

TABLE 4.2: Table of details of the concatenation subnetwork.

4.2.3 Baseline parameter configuration

The parameters I used in order to train the baseline network are learning rateequal to 0,0001, weight decay and dropout equal to 0 and a batch size of 100.

After training this network and evaluating it, I first tried to tune these param-eters in order to have the best possible results before starting to experimentwith other optimizing techniques. After 13 models, the one that performedbetter, using the same network shape as detailed in 4.2.2, was trained onlychanging the learning rate to 0,0005. I tried with different values of the otherparameters but all of them were not better than only changing the learningrate.

In chapter 6 the evaluation of both this configurations is explained.

4.2.4 Model pre-training

The next technique I used was to pre-train a model identical to the branchessubnetworks of the siamese with exactly the same configuration as the tunedversion of the siamese. The network was the following:


FIGURE 4.6: This figure illustrates the shape of the pre-trainedneural network.

In this network, instead of using the binary cross entropy loss function, Iused the categorical cross entropy loss function, because the output was avector of zeros of length equal to the number of speakers and with a one inthe position of the corresponding speaker.

After having trained it, I loaded the weights from the first four layers (thethree feature layers with 10.000 nodes and the bottleneck layer with 5.000nodes) to each one of the two subnetworks of the siamese and started train-ing a siamese model like the one in figure 4.5 but starting with the weightsalready pre-trained. This made the accuracy of the siamese neural networkget a 13,6% of improvement from the tuned version. In the results chapter Igive this results also in terms of EER.

4.2.5 Embeddings extraction

For the following subsection, I have used different vectors to evaluate theperformance of the network. This vectors are the embeddings extracted justat the output of the bottleneck layer of one of the two branches of the siameseneural network, as illustrated in the following figure:


FIGURE 4.7: This figure illustrates the place from where thesiamese embeddings have been extracted, both from the tuned

siamese or from the siamese with pre-trained weights.

Embeddings have been also extracted from the output of the bottleneck layerof the pretrained network, as illustrated in the following figure:

FIGURE 4.8: This figure illustrates the place from where thesiamese embeddings have been extracted on the pre-train net-

work.

The embeddings extracted have a length of 5.000 and only contain informa-tion about the speaker.


4.2.6 PLDA scoring from embeddings

Using the embeddings extracted as illustrated in figures 4.7 and 4.8, I trainedG-PLDA models (with the codes explained in section 3.2.4) in order to eval-uate the performance of the network. For the PLDA training and testing,I had to reduce the dimension of the embeddings (5.000) because their sizecaused different errors during the process. To reduce them, I used PrincipalComponent Analysis (PCA), which configuration was:

• For embeddings of the siamese with pre-trained weights loaded: Using PCA,the best result was acquired reducing the embeddings dimensions from5.000 to 200, and with 25 iterations of the PLDA.

• For embeddings of the pre-trained model: The best result was acquired re-ducing the embeddings dimensions from 5.000 to 150, and with 20 iter-ations of the PLDA.

4.2.7 Shape optimization and parameter tuning

After having applied all optimization techniques above, I decided to changethe shape of the network in order to try to get even better results than theones with the previous siamese network configuration.

The new shape of the network is illustrated in the following figure:

FIGURE 4.9: This figure illustrates the new shape of the siameseneural network.


The details of the layers for this new configuration, as well as the parametersconfiguration, are detailed in the tables below. In table 4.3 the activationfunction is the ReLU function in all layers.

Layer Neurons ConnectionsInput 16.896 5.000Feature layer 1 5.000 5.000Feature layer 2 5.000 5.000Feature layer 3 5.000 1.000Bottleneck layer 1.000 1.000

TABLE 4.3: Table of details of each of the layers from one of thetwo subnetworks.

Layer Neurons Connections Activation functionConcatenation layer 2.000 1.000 ReLUPrediction layer 1.000 1 Sigmoid

TABLE 4.4: Table of details of the concatenation subnetwork.

And the parameter configuration that gave the best results among all the ex-periments with this new configuration, also changing the number of trainingand validation steps per epoch to 400 and 100 respectively, was the following:

Learning rate Weight decay Dropout Batch size0,001 0,0001 0,075 50

TABLE 4.5: Table of details of the siamese neural network base-line parameters.

With this new shape and configuration, I applied all the optimization tech-niques to this model, including pre-training, embeddings extraction bothfrom the pre-trained model and the siamese with the weights loaded andPLDA scoring with those embeddings. The results of all the experiments aredetailed and commented in the following chapter.

26

Chapter 5

Results

In this chapter I cover the results of the two parts of this project. Those resultsare in terms of Equal Error Rate (EER) and some of them in terms of testdataset accuracy. EER is a biometric security system algorithm that indicatesthe point where the False Acceptance Rate (FAR) and the False Rejection Rate(FRR) curves meet.

FAR indicates the percentage of times that an impostor has been identified asthe true speaker, and is computed as follows:

FAR =Number of false acceptances

Total number of identification attempts

FRR indicates the percentage of times that a true speaker has been identifiedas an impostor, and is computed as follows:

FRR =Number of false rejections

Total number of identification attempts

In other words, the EER is the point where the FAR and the FRR are minimaland optimal, so the lower the EER is, the better your system is.

For the testing phase, I have two files containing the labels of true speakers(clients) and impostors respectively. The client’s file has the labels for 708speech files, and the impostors’ file has the labels for 29665 speech files. Ihave another file with the labels for 2354 speech files corresponding to themodel trials, which are the trials to which are compared the clients or impos-tors files in order to get a score. These scores are saved in two files, one forthe clients and the other for the impostors, which then are used to performthe evaluation.

Chapter 5. Results 27

5.1 PLDA scoring techniques

The ALIZÈ toolkit, which has two different scoring techniques available, hasbeen used to evaluate the quality of the i-vectors extracted. The first scoringmethod is a simple cosine distance, with which I got an EER = 6,0472%. Thesecond scoring method available in the toolkit is G-PLDA, which requires along training for its matrices in order to perform a the best possible scoring.My result was EER = 3,8567%.

For the fast variational bayes algorithm part, several experiments have beenperformed (two of G-PLDA and two of HT-PLDA). In order to have the re-sults clear enough, I enumerate them in the following list:

• First G-PLDA experiment: The model has been trained, as expected fromthe algorithm, previously extracting the mean of the i-vectors and lengthnormalizing them. This gave an EER = 3,8104%.

• Second G-PLDA experiment: This model has also been trained extractingthe mean of the i-vectors but without length normalizing them. Theresult was EER = 6,5921%, as expected, because G-PLDA does not workwell with vectors without length normalization.

• First HT-PLDA experiment: This model has been trained with i-vectorswith zero mean and without length normalization. Also, the F and Wmatrices have been initialized as the identity matrix. The result for thisexperiment was EER = 4,0446%.

• Second HT-PLDA experiment: This model has been trained exactly thesame way as the previous one, with the exception of the initializationof the F and W matrices. This time, these matrices have been initializedwith the ones that I trained in the first G-PLDA experiment. The resultwas EER = 4,2054%.

As the results on the previous list show, G-PLDA scoring needs length nor-malized vectors to perform optimal. A G-PLDA system with zero mean andlength normalized vectors give the best accuracy out of all the experimentsin this section. Very close to this accuracy is the HT-PLDA model with its Fand W matrices initialized to identity, and taking into account that this accu-racy is accomplished with a speed similar to G-PLDA and without the needof length normalizing the vectors, it is an option to take into account whenworking with large vectors and datasets.


5.2 Siamese Neural Network

The first three experiments with the first siamese neural network configura-tion are the baseline training, the training of the network with tuned param-eters, and the training of the tuned network with previously trained weights.The baseline result with which I started working was EER = 31,51% (testaccuracy of 68,79%). After this result, and only changing the value of thelearning rate, I managed to get a model with an EER = 28,05% (test accuracyof 72,93%). Then, I pre-trained a network with exactly the same shape as oneof the two branches of the siamese network, from the input to the bottlenecklayer. Then, I loaded the weights to both branches of a siamese model andtrained that model. The result for this experiment was the best one of thewhole siamese networks part, EER = 13,73% (test accuracy of 86,53%).

The following figure illustrates the DET curves of the three experiments de-scribed in the previous paragraph:

FIGURE 5.1: DET curves of baseline siamese model, tunedsiamese model and tuned siamese with pre-trained weights.

Then, I extracted the embeddings at the output of the bottleneck layer ofthe two last models trained (the model with the weights I used to initializethe last siamese model and the siamese model with the weights pre-loaded),and used them to train two G-PLDA models. The model trained with theembeddings from the pre-training model gave an EER = 16,78%, and the onetrained with the embeddings from the model with pre-trained weights gavean EER = 15,23%.


The following graphs illustrate the DET curves of the experiments describedin the previous paragraph:

FIGURE 5.2: DET curves of, from left to right respectively, G-PLDA of the embeddings extracted from the pre-train networkand G-PLDA of the embeddings extracted from the siamese

model with pre-trained weights.

As it is clearly seen in the previous figure, the performance of both PLDAsystems using the two different embeddings I extracted is very similar. Tak-ing into account that one of the embeddings type requires a lot less effort toget, shows that it is not worth to use the embeddings of the whole siamesesystem, because it improves by 1% but at a cost of having to train a siamesesystem, which can last days.

The following results are those corresponding to the second configurationdetailed in the section 4.2.7. First, the tuned siamese model with reducednumber of nodes per layer gave an EER = 24,41% (test accuracy of 75,56%).Then I used the pre-training technique in this model, obtaining an EER =14,2% (test accuracy of 86,31%).

The following figure illustrate the DET curves of the experiments describedin the previous paragraph:


FIGURE 5.3: DET curves of, from left to right respectively,siamese model with the second configuration and siamesemodel with pre-trained weights and also the second configu-

ration.

The results corresponding to the evaluation using the embeddings extractedat the output of the bottleneck layer and with the new network configurationand with PLDA, were not good enough to detail them here, because the EERwas higher than 25% in all the G-PLDA models I trained.

31

Chapter 6

Impact

In the degree I have studied (telecommunications systems engineering), veryfew things about sustainability and impact have been explained to us. Morespecifically, my only contact with this important concepts was in a short sem-inar called Tecnologies per al desenvolupament sostenible, which can be trans-lated as technologies for sustainable development, and where the teacherand the speakers tried to awake the sense of responsibility in us. That sem-inar changed completely the way I think, and brought me to perform thisbrief analysis on the impact this project could have in our society.

6.1 Environmental

I consider minimal the environmental impact of this project due to the factthat the only material used has been my two personal computers which arecompletely needed. The energy consumed has only been the one consumedby both the computers (never on at the same time, I used the tower at homeand the laptop everywhere else) and could have only been improved by buy-ing a more efficient computer in terms of energy consuming, which was notan option.

For the public transport used to get to the meetings, anything could havebeen improved by me, because even though I did not use it, it did not stopworking.

The electrical consume of Calcula server is also a thing I can not control,because even though I do not use it, it continues working.

In general, my conclusion is that I am pretty satisfied with the environmentalimpact of my project because it is very low. Of course it can always improvemore, but I did my best in order to keep it as low as possible.

Chapter 6. Impact 32

6.2 Social

Socially speaking, and due to the fact that it is a project aimed to research andnot for commercial purposes, the impact of this project can only be taken intoaccount for other researchers who can use the results of the tests performedin this project for their own ones.

Personally speaking, this project has opened me some doors I was not con-sidering before. The deep learning world is so big and powerful, and all theconcepts I learned about it and about speaker recognition will be very usefulfor me in the future.

33

Chapter 7

Conclusions and future work

In the first part of this project, I extracted my own i-vectors and submittedthem to some PLDA techniques in order to see how they performed. Thebest result obtained was with the G-PLDA using the fast variational bayesalgorithm (EER = 3,8104%), which was better than both the G-PLDA of theALIZÈ toolkit and the HT-PLDA of the fast variational bayes algorithm (EER= 4,0446%), but at the cost of having to length normalize the vectors. If Iwere designing a system in which the vectors used were very large, VBHT-PLDA would definitely be an option, because the speed of this algorithm isapproximately the same as the traditional G-PLDA but without the need ofapplying length normalization.

In the second part of this project, I performed some siamese neural networkexperiments using the supervectors of the same speech files I used to extractthe i-vectors. Despite the lack of time to do exhaustive optimization, I man-aged to improve a system with an EER = 31,51% to a system with an EER =13,73% only applying pre-training. I also tried to get better results with dif-ferent strategies, but not a single one of them got a better EER than 13,73%:

• The best G-PLDA model using the embeddings at the output of thebottleneck layer gave an EER = 15,23%, which considering the extratime required for extracting the embeddings and training the PLDAmodel, I do not consider it worth.

• Then, having reduced the dimensions of the layers, I repeated all theexperiments. At the beginning I thought that this could produce bet-ter results, because the EER of the siamese model without pre-trainingimproved from 28,05% to 24,41%, but after applying the pre-trainingtechnique to this new and better model, it gave an EER of 14,2%, whichis very close to the best result I got but yet it is not better.

• Last, I extracted the embeddings of this last model, but the results werefar worse, so I did not continue this path.

Chapter 7. Conclusions and future work 34

Even though I managed to improve the baseline by a 17,78%, I consider that13,73% is not a good enough result compared to other systems in the state-of-the-art such as G-PLDA applied to i-vectors with length normalization(EER = 3,8104%) as implemented in the first part of this project, so, for a fu-ture project, I would start optimizing a lot more the siamese model withoutany techniques applied, changing its parameters, its number of layers andthe dimensions of its layers. Then, applying optimization techniques suchas pre-training would probably give similar results to the state of the art.Also focusing on getting a good PLDA model to evaluate the siamese neu-ral network could be an interesting path to follow, taking into account thecomputational power and robustness of this technique.

Overall, I consider that this project has accomplished all the goals stated inthe section 1.2. For the first part, I extracted i-vectors that I later evaluatedwith different PLDA systems, getting results of the order of the current state-of-the-art. For the second part, I have been able to understand siamese neuralnetworks, where they are worth to apply, why, and performed several exper-iments, being able to improve a lot the baseline model.

35

Bibliography

[1] B. Ma A. Larcher K. A. Lee and H. Li. “Text-dependent speaker verifi-cation: Classifiers, databases and RSR2015.” In: Speech Communication60 (2014), pp. 56–77.

[2] D. Garcia-Romero D. Snyder A. Silnova N. Brümmer and L. Burget.“Fast variational Bayes for heavy-tailed PLDA applied to i-vectors andx-vectors.” In: Interspeech 2018 (2018). URL: https://arxiv.org/abs/1803.09153.

[3] D. Povey D. Snyder D. Garcia-Romero and S. Khudanpur. “Deep Neu-ral Network Embeddings for Text-Independent Speaker Verification.”In: Interspeech 2017 (2017). URL: https : / / www . isca - speech . org /archive/Interspeech_2017/pdfs/0620.PDF.

[4] O. Ghahabi and J. Hernando. “Restricted Boltzmann machines for vec-tor representation of speech in speaker recognition.” In: Computer Speechand Language (2018), p. 27. URL: https://www.sciencedirect.com/science/article/pii/S0885230816302923.

[5] Patrick Kenny. “Bayesian Speaker Verification with Heavy-Tailed Pri-ors.” In: (). URL: https://www.crim.ca/perso/patrick.kenny/kenny_Odyssey2010.pdf.

[6] T. Kinnnunen and H. Li. “An Overview of Text-Independent SpeakerRecognition: from Features to Supervectors.” In: Speech Communication(2009). DOI: 10.1016/j.specom.2009.08.009.

[7] Skanda Koppula. “Energy-Efficient Speaker Identification with Low-Precision Networks.” In: (2016). URL: https://groups.csail.mit.edu/sls/publications/2018/SkandaKoppula_MEng_Thesis.pdf.

[8] J. Glass D. Reynolds N. Dehak R. Dehak and P. Kenny. “Cosine Similar-ity Scoring without Score Normalization Techniques.” In: (2010). URL:https://groups.csail.mit.edu/sls/publications/2010/Dehak_Odyssey.pdf.

[9] R. Dehak P. Demouchel N. Dehak P. J. Kenny and P. Ouellet. “Front EndFactor Analysis for Speaker Verification.” In: IEEE Transactions on audio,speech, and language processing 19.4 (2011). URL: http://habla.dc.uba.ar/gravano/ith-2014/presentaciones/Dehak_et_al_2010.pdf.

[10] NIST. “The NIST year 2010 speaker recognition evaluation plan.” In:NIST (2010). URL: https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2010.

https://arxiv.org/abs/1803.09153

https://arxiv.org/abs/1803.09153

https://www.isca-speech.org/archive/Interspeech_2017/pdfs/0620.PDF

https://www.isca-speech.org/archive/Interspeech_2017/pdfs/0620.PDF

https://www.sciencedirect.com/science/article/pii/S0885230816302923

https://www.sciencedirect.com/science/article/pii/S0885230816302923

https://www.crim.ca/perso/patrick.kenny/kenny_Odyssey2010.pdf

https://www.crim.ca/perso/patrick.kenny/kenny_Odyssey2010.pdf

http://dx.doi.org/10.1016/j.specom.2009.08.009

https://groups.csail.mit.edu/sls/publications/2018/SkandaKoppula_MEng_Thesis.pdf

https://groups.csail.mit.edu/sls/publications/2018/SkandaKoppula_MEng_Thesis.pdf

https://groups.csail.mit.edu/sls/publications/2010/Dehak_Odyssey.pdf

https://groups.csail.mit.edu/sls/publications/2010/Dehak_Odyssey.pdf

http://habla.dc.uba.ar/gravano/ith-2014/presentaciones/Dehak_et_al_2010.pdf

http://habla.dc.uba.ar/gravano/ith-2014/presentaciones/Dehak_et_al_2010.pdf

https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2010

https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2010

BIBLIOGRAPHY 36

[11] S. Prince and J. Elder. “Probabilistic Linear Discriminant Analysis forInferences About Identity.” In: (2007).

[12] Douglas A. Reynolds and Richard C. Rose. “Robust Text-IndependentSpeaker Identification Using Gaussian Mixture Speaker Models”. In:IEEE Transactions on Speech and Audio Processing 3.1 (1995), p. 4.

[13] “Toolkit for i-vector extraction.” In: (). URL: http : / / alize . univ -avignon.fr/.

[14] M. Hrúz Z. Zajíc and L. Müller. “Speaker Diarization Using Convo-lutional Neural Network for Statistics Accumulation Refinement.” In:Interspeech 2017 (2017).

http://alize.univ-avignon.fr/

http://alize.univ-avignon.fr/

Deep Learning Bottle-Neck Features for Speaker Recognition

Documents

Transcript of Deep Learning Bottle-Neck Features for Speaker Recognition