EFFECTIVE ACOUSTIC MODELING FOR ROBUST SPEAKER RECOGNITION ... · EFFECTIVE ACOUSTIC MODELING FOR...

EFFECTIVE ACOUSTIC MODELING FOR ROBUST

SPEAKER RECOGNITION

by

Taufiq Hasan Al Banna

APPROVED BY SUPERVISORY COMMITTEE:

Dr. John H. L. Hansen, Chair

Dr. Carlos Busso

Dr. Hlaing Minn

Dr. P. K. Rajasekaran

Copyright c© 2013

Taufiq Hasan Al Banna

All rights reserved

Dedicated to my daughter Maryam


SPEAKER RECOGNITION

by

TAUFIQ HASAN AL BANNA, BS, MS

DISSERTATION

Presented to the Faculty of

The University of Texas at Dallas

in Partial Fulfillment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY IN

ELECTRICAL ENGINEERING

THE UNIVERSITY OF TEXAS AT DALLAS

December 2013

ACKNOWLEDGMENTS

First and foremost, I thank Almighty God, who has created me and blessed me with every-

thing that made this research possible. Second, I thank my parents, who always encouraged

education, learning and creativity. I also want to thank my wife, who has been support-

ive to me throughout my PhD life, coping with my random work schedule, absorbing my

frustrations, and tolerating my stress during the deadlines.

I am extremely fortunate and grateful to have Dr. John Hansen as my PhD advisor. His

advice and guidance tremendously pushed me beyond my capabilities and encouraged me to

aim for the highest achievements. His supervision encompassed a delicate balance between

instructive guidelines and complete freedom that enabled me to be focused, creative, inde-

pendent and productive. Among other faculty members, I would like to thank Dr. Philip

Louzou (late), who I greatly admired for his research and teaching. I would also like to thank

my committee members, Dr. Rajasekaran, Dr. Minn, and Dr. Busso, for their advice and

suggestions on my research.

I am immensely grateful to my colleague and friend Dr. Hynek Boril. In addition to work-

ing together on a number of projects in different areas, he has been very encouraging and

supportive throughout my time at UT Dallas. I would also like to thank my colleagues

Dr. Abhijeet Sangwan, Gang Liu, Seyed Omid Sadjadi and Keith Godin, for their support

in different occasions. I would especially thank our alumni, Yun Lei, who developed many

software components that was useful for my research. Finally, I would like to thank those

sponsors who provided the financial support during my PhD.

July 2013

v

PREFACE

This dissertation was produced in accordance with guidelines which permit the inclusion as

part of the dissertation the text of an original paper or papers submitted for publication.

The dissertation must still conform to all other requirements explained in the “Guide for the

Preparation of Master’s Theses and Doctoral Dissertations at The University of Texas at

Dallas.” It must include a comprehensive abstract, a full introduction and literature review,

and a final overall conclusion. Additional material (procedural and design data as well as

descriptions of equipment) must be provided in sufficient detail to allow a clear and precise

judgment to be made of the importance and originality of the research reported.

It is acceptable for this dissertation to include as chapters authentic copies of papers already

published, provided these meet type size, margin, and legibility requirements. In such cases,

connecting texts which provide logical bridges between different manuscripts are mandatory.

Where the student is not the sole author of a manuscript, the student is required to make an

explicit statement in the introductory material to that manuscript describing the student’s

contribution to the work and acknowledging the contribution of the other author(s). The

signatures of the Supervising Committee which precede all other material in the dissertation

attest to the accuracy of this statement.

vi


SPEAKER RECOGNITION

Publication No.

Taufiq Hasan Al Banna, PhDThe University of Texas at Dallas, 2013

Supervising Professor: Dr. John H. L. Hansen

Robustness due to mismatched train/test conditions is the biggest challenge facing the

speaker recognition community today, with transmission channel and environmental noise

degradation being the prominent factors. Performance of state-of-the art speaker recognition

methods aim at mitigating these factors by effectively modeling speech in multiple record-

ing conditions, so that it can learn to distinguish between inter-speaker and intra-speaker

variability. The increasing demand and availability of large development corpora introduces

difficulties in effective data utilization and computationally efficient modeling. Traditional

compensation strategies operate on higher dimensional utterance features, known as super-

vectors, which are obtained from the acoustic modeling of short-time features. Feature com-

pensation is performed during front-end processing. Motivated by the covariance structure

of conventional acoustic features, we envision that feature normalization and compensation

can be integrated into the acoustic modeling. In this dissertation, we investigate the fol-

lowing fundamental research challenges: (i) analysis of data requirements for effective and

efficient background model training, (ii) introducing latent factor analysis modeling of acous-

vii

tic features, (iii) integration of channel compensation strategies in mixture-models, and (iv)

development of noise robust background models using factor analysis. The effectiveness

of the proposed solutions are demonstrated in various noisy and channel degraded condi-

tions using the recent evaluation datasets released by the National Institute of Standards

and Technology (NIST). These research accomplishments make an important step towards

improving speaker recognition robustness in diverse acoustic conditions.

viii

TABLE OF CONTENTS

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

PREFACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Dissertation contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Effective Universal Background Model (UBM) construction . . . . . . 4

1.2.2 Acoustic Factor Analysis (AFA) . . . . . . . . . . . . . . . . . . . . . 4

1.2.3 Mixture-dependent feature transforms . . . . . . . . . . . . . . . . . 5

1.2.4 Maximum Likelihood based Acoustic Factor Analysis (ML-AFA) . . . 5

1.3 Organization of this dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 5

CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Speaker recognition fundamentals . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Human speech production mechanism . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Properties of ideal features . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Mel-Frequency Cepstral Coefficients (MFCC) . . . . . . . . . . . . . 13

2.3.3 Voice Activity Detection (VAD) . . . . . . . . . . . . . . . . . . . . . 14

2.3.4 Feature normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Speaker modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Performance evaluation with standardized datasets . . . . . . . . . . . . . . 17

2.5.1 Types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

ix

2.5.2 Equal Error Rate (EER) . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.3 Detection Cost Function (DCF) . . . . . . . . . . . . . . . . . . . . . 20

2.5.4 Detection Error Trade-off (DET) curve . . . . . . . . . . . . . . . . . 22

CHAPTER 3 ROBUST SPEAKER MODELING . . . . . . . . . . . . . . . . . . . 25

3.1 Vector Quantization (VQ) based methods . . . . . . . . . . . . . . . . . . . 26

3.2 Gaussian Mixture Model (GMM) based method . . . . . . . . . . . . . . . . 27

3.2.1 GMM formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 GMM training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Adapted GMMs: The GMM-UBM speaker verification system . . . . . . . . 30

3.3.1 The likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.2 Maximum A Posteriori (MAP) adaptation of UBM . . . . . . . . . . 32

3.3.3 The GMM supervectors . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.4 GMM supervector Support Vector Machine (GMM-SVM) . . . . . . . 34

3.3.5 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . 36

3.4 Factor analysis of the GMM supervectors . . . . . . . . . . . . . . . . . . . . 37

3.4.1 Linear distortion model . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.2 Linear Gaussian models . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4.3 Classical MAP adaptation . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4.4 Eigenvoice adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4.5 Eigenchannel adaptation . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.6 Joint Factor Analysis (JFA) . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.7 The i-Vector approach . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.8 Channel compensation in the i-Vector domain . . . . . . . . . . . . . 46

3.4.9 Speaker verification using i-Vectors . . . . . . . . . . . . . . . . . . . 49

3.5 Research progress time-line . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

CHAPTER 4 EFFECTIVE UNIVERSAL BACKGROUND MODEL TRAINING . 53

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1.1 Parameters of the UBM . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 The ideal UBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

x

4.2.1 Data balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.2 Data amount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Baseline system description . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.1 Front-end processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.2 UBM training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.3 UBM database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.4 Speaker modeling and scoring . . . . . . . . . . . . . . . . . . . . . . 60

4.4 UBM data: What is a sufficient amount . . . . . . . . . . . . . . . . . . . . 60

4.4.1 Average Weighted Variance (AWV) . . . . . . . . . . . . . . . . . . . 61

4.5 Sub-sampling of feature frames . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5.1 Feature selection based on Euclidean distance . . . . . . . . . . . . . 67

4.5.2 Performance of sub-sampling schemes . . . . . . . . . . . . . . . . . . 71

4.6 UBM data: Number of unique speakers . . . . . . . . . . . . . . . . . . . . . 72

4.7 Selection of UBM speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.7.1 KL divergence based speaker selection (KL-D) . . . . . . . . . . . . . 75

4.7.2 Speaker selection using prototype UBM (P-UBM) . . . . . . . . . . . 75

4.7.3 Results of speaker selection methods . . . . . . . . . . . . . . . . . . 76

4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

CHAPTER 5 ACOUSTIC FACTOR ANALYSIS . . . . . . . . . . . . . . . . . . . 79

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.1.1 Analysis on full-covariance UBMs . . . . . . . . . . . . . . . . . . . . 81

5.1.2 Limitations of factor analysis on GMM supervectors . . . . . . . . . . 82

5.1.3 Feature dimensionality reduction . . . . . . . . . . . . . . . . . . . . 83

5.1.4 Implications of the proposed method . . . . . . . . . . . . . . . . . . 85

5.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2.1 Acoustic Factor Analysis (AFA) . . . . . . . . . . . . . . . . . . . . . 86

5.2.2 Mixture dependent transformation . . . . . . . . . . . . . . . . . . . 87

5.3 Properties of the AFA transform . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.3.1 Probability distribution of the transformed features . . . . . . . . . . 89

xi

5.3.2 Acoustic feature enhancement . . . . . . . . . . . . . . . . . . . . . . 91

5.3.3 Acoustic feature variance normalization . . . . . . . . . . . . . . . . . 93

5.4 AFA integrated i-Vector system . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.4.1 UBM and AFA model training . . . . . . . . . . . . . . . . . . . . . . 94

5.4.2 UBM transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.4.3 Baum-Welch statistics estimation . . . . . . . . . . . . . . . . . . . . 95

5.4.4 Hyper-parameter estimation . . . . . . . . . . . . . . . . . . . . . . . 96

5.5 System description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.5.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.5.2 UBM training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.5.3 Total variability modeling . . . . . . . . . . . . . . . . . . . . . . . . 98

5.5.4 Session variability compensation and scoring . . . . . . . . . . . . . . 98

5.6 Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.6.1 Performance evaluation of AFA systems . . . . . . . . . . . . . . . . 100

5.6.2 Effect of different AFA dimension . . . . . . . . . . . . . . . . . . . . 102

5.6.3 Effect of UBM variance flooring . . . . . . . . . . . . . . . . . . . . . 103

5.6.4 Performance in microphone conditions . . . . . . . . . . . . . . . . . 107

5.6.5 Fusion of multiple systems . . . . . . . . . . . . . . . . . . . . . . . . 107

5.6.6 Computational advantages . . . . . . . . . . . . . . . . . . . . . . . . 108

5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

CHAPTER 6 MIXTURE-DEPENDENT FEATURE TRANSFORMATIONS . . . . 111

6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.2.1 Mixture-wise feature transformation . . . . . . . . . . . . . . . . . . . 113

6.2.2 Integration within the i-Vector system . . . . . . . . . . . . . . . . . 114

6.2.3 Mixture-wise PCA (m-PCA) . . . . . . . . . . . . . . . . . . . . . . . 115

6.2.4 Mixture-wise Whitening (m-WHT) . . . . . . . . . . . . . . . . . . . 115

6.2.5 Mixture-wise LDA (m-LDA) . . . . . . . . . . . . . . . . . . . . . . . 116

6.2.6 Mixture-wise NAP (m-NAP) . . . . . . . . . . . . . . . . . . . . . . . 118

xii

6.2.7 Mixture-wise Nuisance Attribute Elimination (m-NAE) . . . . . . . . 119

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119


6.3.2 UBM training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.3.3 Total variability modeling . . . . . . . . . . . . . . . . . . . . . . . . 120

6.3.4 Back-end channel compensation and scoring . . . . . . . . . . . . . . 120


6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

CHAPTER 7 MAXIMUM-LIKELIHOOD ACOUSTIC FACTOR ANALYSIS . . . . 124

7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.1.1 Connection with AFA . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.1.2 AFA in noisy conditions . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.2.1 Maximum Likelihood - Acoustic Factor Analysis (ML-AFA) . . . . . 128

7.2.2 Isotropic residual noise . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.2.3 Diagonal residual noise . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7.2.4 i-Vector extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.2.5 Model interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.3 System description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.3.1 Voice activity detection . . . . . . . . . . . . . . . . . . . . . . . . . . 135


7.3.3 Noisy file generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.3.4 UBM and AFA model training . . . . . . . . . . . . . . . . . . . . . . 137

7.3.5 i-Vector extractor training . . . . . . . . . . . . . . . . . . . . . . . . 137

7.3.6 PLDA classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137


7.4.1 Effect of the modeling method . . . . . . . . . . . . . . . . . . . . . . 139

7.4.2 Variation of acoustic factor dimension . . . . . . . . . . . . . . . . . 141

7.4.3 System fusion and calibration . . . . . . . . . . . . . . . . . . . . . . 143

xiii

7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

CHAPTER 8 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.1 Dissertation contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.1.1 Study on UBM training . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.1.2 Acoustic Factor Analysis (AFA): UBM based linear transforms . . . . 146

8.1.3 Feature domain channel compensation within UBM mixtures . . . . . 147

8.1.4 ML - Acoustic Factor Analysis: An alternative to the UBM . . . . . . 148

8.2 Directions for future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

APPENDIX A CLASSICAL MAP AND THE GMM-UBM APPROACH . . . . . . 151

APPENDIX B EM ALGORITHM FOR AFA WITH UN-CORRELATED NOISE . 154

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

VITA

xiv

LIST OF FIGURES

2.1 Block diagram of a basic speaker verification system. . . . . . . . . . . . . . . . 8

2.2 Anatomy of the human speech production organs. Figure reprinted from MITopencourseware material prepared by Joseph S. Perkell [1]. . . . . . . . . . . . . 10

2.3 Various steps in the MFCC feature extraction procedure from a speech frame.(a) A 200 sample frame representing 25 ms of speech sampled at a rate of 8kHz,(b) The DFT power spectrum showing the first 101 points, (c) A 24-channeltriangular Mel-filterbank, (d) The log filter-bank energy output values from theMel-filterbank, (e) The 12 static MFCC coefficients obtained by performing aDCT on the filter-bank energy coefficients and retaining the first 12 values. . . . 12

2.4 (a) A speech waveform with voice activity decisions overlaid. A value of 1 and 0indicate speech and non-speech, respectively. (b) Spectrogram plot of the speechwaveform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 An illustration of target and non-target score distributions and the decisionthreshold. Areas under the curves with blue and red colors represent FAR andFRR errors, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 A Detection Error Trade-off curve. The points in the curve corresponding tothe threshold that yields the Equal Error Rate (EER) and minimum DetectionCost Function (DCF) (as defined in NIST SRE 2008), and the direction of anincreasing threshold are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7 (a) An example DET curve without conversion of scores to standard normaldeviates. (b) Traditional Receiver Operating Characteristics (ROC) curve. . . . 24

3.1 Schematic diagram of a GMM-UBM system for a 4 mixture UBM. MAP adap-tation procedure and supervector formation by concatenating the mean vectorsare also illustrated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Conceptual illustration of an SVM classifier. Positive (+) and negative (-) exam-ples are labeled, the optimal linear separator and support vectors are shown. . . 35

3.3 A graphical representation of 79 utterances spoken by 10 individuals collectedfrom the NIST SRE-2004 corpus. The i-Vector [2] representation is used for eachsegment. This plot is generated using GUESS, an open-source graph explorationsoftware [3], that can visualize higher dimensional data using distance measuresbetween samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Research progress on speaker recognition : Time-line view of the past 30 years . 52

xv

4.1 (a) UBM training CPU time variation with changing amount of UBM data (hrs).(b) Variation of system EER with total amount of UBM data (hrs). Featureframes are selected uniformly from each utterance. . . . . . . . . . . . . . . . . 63

4.2 (a) Variation of UBM average weighted variance (Σ) with total amount of UBMdata (hrs). (b) Scatter plot showing the correlation between EER and averageweighted variance (Σ). All data points from Figure 4.1(b) and Figure 4.2 (a) areused to generate this scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3 Conceptual illustration of the feature selection schemes. (Selected frames areshown in dark.) (a) Original utterance spectrogram, (b) LFS, (c) UFS, (d) RFSand (e) IFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4 (a) Comparison of the theoretical PDF, its Gaussian approximation and the ac-tual PDF obtained from a feature-distance histogram. 13 dimensional MFCCcoefficients were used and the parameter λ was calculated directly from the data.For this data, λ = 281.6836, µD = 83.9506 and σ2

D = 276.07. (b) A PDF of theinter-feature Euclidean distance and the proposed distance threshold (shown forα = 0.1 and 0.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.5 Variation of (a) AWV (Σ) and (b) system performance with the change of numberof UBM speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6 A schematic diagram of the speaker space in the UBM. . . . . . . . . . . . . . . 74

4.7 System performance variation with the change of number of UBM speakers se-lected using different methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.1 Analysis of full covariance matrices of a UBM trained using 60-dimensional MFCCfeature (20 static+∆+∆∆). (a) A 3-D surface plot of the covariance matrixshowing high values in the diagonal and significant off-diagonal values indicatingcorrelation among different feature coefficients. (b) Sorted eigenvalues of the samecovariance matrix demonstrating that most of the energy is accounted for by inthe first few dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2 Distribution of top posterior probabilities p(g|xn) obtained from a subset of de-velopment data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3 Input SNR [dB] (ξ) vs. Wiener gains. Wiener gain and square-root Wiener gainare shown with a solid (-) and dashed (- -) line, respectively. . . . . . . . . . . . 91

5.4 A block diagram of the proposed AFA integrated i-Vector system. The systemis shown in two phases: (a) development and (b) evaluation. In the evaluationphase, only i-Vector extraction procedure is depicted assuming an arbitrary clas-sifier. For details on the PLDA classifier used, refer to Section 5.5.4 . . . . . . . 99

5.5 Performance comparison between proposed AFA and baseline i-Vector systemwith respect to (a) %EER, (b) minDCF’08 and (c) minDCF’10 for different eigen-voice size Nev of the PLDA model. Evaluation is performed on NIST SRE 2010core condition-5 using the extended trials. . . . . . . . . . . . . . . . . . . . . . 101

xvi

5.6 Performance comparison of AFA system for different values of q with respect to %Relative Improvements (RI) in %EER, minDCF’08 and minDCF’10 compared tothe corresponding baseline system performance metric. Evaluation is performedon NIST SRE 2010 core condition-5 using the extended trials. The figure clearlyreveals that the system performance drastically degrades as the value of q is reduced.103

5.7 Performance comparison of baseline, AFA and fusion systems using DET curves.Evaluation is performed by pooling results of the core conditions 1-5 of NISTSRE 2010 extended trials. (i) Baseline i-Vector system using Full CovarianceUBM (Baseline full-cov), (ii) AFA i-Vector system (q = 42), and (iii) Equal-weight linear fusion of systems (i) & (ii). . . . . . . . . . . . . . . . . . . . . . . 110

6.1 Distribution of mixture-wise probabilistic feature count. The distributions p(NX (g))and p(Ns(g)) are obtained from 1024 mixture counts for all data, and computingthe same for each 984 speakers, respectively. . . . . . . . . . . . . . . . . . . . . 118

6.2 Performance comparison between proposed, baseline and fusion systems demon-strated using Detection Error Trade-off (DET) curves. . . . . . . . . . . . . . . 123

7.1 Probabilistic graphical model of a Mixture of Factor Analyzer (MFA) model foracoustic features. The box on the right denotes a ‘plate’ representing a datasetof N independent observations of acoustic features xn. Here, yn are the hiddenvariables, or acoustic factors, and g indicate the responsible mixture componentin the model. The box on the left represent the parameters of the g-th modelcomponent out of a total of M mixtures. . . . . . . . . . . . . . . . . . . . . . . 127

7.2 Scatter plot of synthetic 2D Gaussian data with four clusters and trained mix-ture models. Means are shown as blue points while ellipses depict the covariancematrices. (a) Diagonal covariance GMM, (b) full covariance GMM, and (c) mix-ture of PPCA model and (d) mixture of factor analyzers (MFA) model showinga single dominant direction (of two dimensions) in each mixture component. . . 128

7.3 Partial super-covariance matrices and a UBM covariance matrix obtained from aGMM and AFA model. The super-covariance is estimated using the total vari-ability matrix T. (a) Partial super-covariance matrix of mixture-1 for a fullcovariance GMM-UBM. (b) Partial super-covariance matrix of mixture-1 for anAFAiso UBM model (q = 42). (c) The full-covariance matrix of the GMM-UBMobtained from mixture-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

xvii

LIST OF TABLES

2.1 Terminologies in a confusion matrix of a two-class recognizer . . . . . . . . . . . 18

3.1 Summary of the linear statistical models used in speaker recognition . . . . . . . 45

4.1 Comparison of different UBM training schemes with respect to EER and trainingCPU time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Comparison of different speaker selection approach for UBM training with respectto EER and No. of speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1 Performance comparison between baseline i-Vector and proposed AFA systemsfor different values of Nev and q. Evaluation performed on NIST SRE 2010 corecondition-5 extended trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2 UBM covariance matrix flooring function (vFloor-2) [4] . . . . . . . . . . . . . . 104

5.3 Performance comparison between baseline i-vector and different AFA systemsusing alternate UBM flooring. Evaluations performed on NIST SRE 2010 corecondition-5 extended trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.4 Common evaluation conditions in NIST SRE 2010 . . . . . . . . . . . . . . . . . 105

5.5 Performance comparison between baseline i-Vector and different AFA systems.Evaluation performed in NIST SRE 2010 core condition-1 extended trials . . . . 106




5.9 Linear equal-weight score fusion performance of Baseline i-Vector and proposedsystems for NIST SRE 2010 core Condition-5 . . . . . . . . . . . . . . . . . . . 109

5.10 Linear equal-weight score fusion performance of Baseline i-Vector and proposedsystems for NIST SRE 2010 Core Conditions 1-5 pooled . . . . . . . . . . . . . 109

6.1 Comparison between baseline i-Vector and proposed systems with respect to%EER, minDCF’08 and minDCF’10 for Nev = 150. Percent relative improve-ment (%r) and supervector compression ratio (α) are also shown. . . . . . . . . 121

6.2 Linear score fusion of baseline and proposed systems . . . . . . . . . . . . . . . 122

xviii

7.1 UBM training list description for NIST SRE 2012. Number of files used in dif-ferent categories are presented for both genders . . . . . . . . . . . . . . . . . . 135

7.2 Common evaluation conditions in NIST SRE 2012 . . . . . . . . . . . . . . . . . 139

7.3 Comparison of system performance when the proposed models are used as GMMsvs. AFAs for the i-Vector system. Results are shown for five NIST SRE 2012common conditions of the extended trials (male) . . . . . . . . . . . . . . . . . . 139

7.4 Performance comparison between baseline and the proposed systems in NISTSRE 2012 extended trials condition-1 . . . . . . . . . . . . . . . . . . . . . . . . 140





7.9 Fusion Performance of Baseline and Proposed Systems. Absolute and %Relativeperformance is shown for Fusion systems . . . . . . . . . . . . . . . . . . . . . . 143

xix

LIST OF ABBREVIATIONS

AFA Acoustic Factor Analysis

AWV Average Weighted Variance

ActDCF Actual Detection Cost Function

CDS Cosine Distance Scoring

CMS Cepstral Mean Subtraction

DCF Detection Cost Function

DCT Discrete Cosine Transformation

DET Detection Error Trade-off

EER Equal Error Rate

FA Factor Analysis

FAR False Alarm/Accept Rate

FRR False Reject Rate

GMM Gaussian Mixture Model

JFA Joint Factor Analysis

LDA Linear Discriminant Analysis

LLR Log-Likelihood Ratio

LPC Linear Predictive Coding

MAP Maximum A Posteriori

MFA Mixture of Factor Analyzers

MFCC Mel-frequency Cepstral Coefficient

ML Maximum Likelihood

MinDCF Minimum Detection Cost Function

NAP Nuisance Attribute Projection

xx

NIST National Institute of Standards and Technology

PCA Principal Component Analysis

PPCA Probabilistic Principal Component Analysis

SRE Speaker Recognition Evaluation

SVM Support Vector Machine

UBM Universal Background Model

VQ Vector Quantization

WCCN Within Class Covariance Normalization

xxi

CHAPTER 1

INTRODUCTION

Speaker recognition deals with identifying a person from his/her voice. This type of bio-

metric recognition is important in a number of applications, including security systems,

network access-control, forensics, automated customer services, etc. Individuals generally

have distinct voices due to their unique vocal tract shape, larynx size, and other physiologi-

cal properties of the human speech production system. Also, each speaker has their own way

of speaking, including their accent, rhythm, intonation style, pronunciation patterns and

choice of language and vocabulary. The most advanced speaker recognition systems today

utilize many of these speaker characteristics in different ways, with the short-term voiced

excitation characteristics being exploited in most scenarios.

As of today, the problem of accurate speaker recognition can be considered solved for

ideal recording conditions. It is the mismatch between training and test conditions that

represent the most challenging problem faced by speaker recognition researchers. There

can be different sources of mismatch present including: transmission channel differences [5],

handset variability [6], background noise [7, 8], room reverberation [9], variability due to

cognitive or physical stress [10], aging or health, different levels of vocal effort (e.g., whisper

[11, 12]) or exposure to noise resulting in Lombard effect [13]) or spontaneity of speech, to

name but a few. Many different compensation strategies have been proposed in the past

to reduce the impact of unwanted variability between training and test utterances, while

retaining the primary speaker identity information. Some techniques focus on improving

robustness within the acoustic features, some aim at devising modeling methods that are

less sensitive to nuisance variability, while others operate on the final decision scores.

1

2

One of the key driving forces of the research community is the ongoing NIST Speaker

Recognition Evaluations (SRE) during the past decade [14, 15, 16]. These evaluations present

well defined speaker verification tasks (open set single speaker recognition) using recordings of

a large number of speakers. These evaluations pose challenging problems since they generally

involve a number of diverse transmission channel, microphone, handset and background noise

variability. Along with the speaker verification task, a specific cost function is also defined

in each year’s evaluation that is to be optimized. Throughout this work, these standardized

NIST datasets and cost metrics will be utilized for performance evaluation of the proposed

speaker verification systems.

1.1 Motivation of the work

Channel compensation for speaker recognition can be performed in mainly three domains:

(i) acoustic feature domain, (ii) model domain, and (iii) score domain. At present, most of

the effective channel compensation strategies for speaker recognition operate on the model

domain. While feature domain compensation is still used, score domain compensation ap-

proaches become inessential with recent modeling techniques [17, 4]. The continual devel-

opment of speaker recognition systems can be traced back to the classic Gaussian Mixture

Model (GMM) and Universal Background Model (UBM) based approach introduced by

Reynold’s et. al. [18, 19]. In this approach, a GMM based UBM, that models the speaker

independent characteristics of features, is shown to be an effective impostor model and

also functions as an initial model for the Maximum A Posteriori (MAP) adapted speaker

model [19].

Built on the success of the GMM-UBM approach, GMM super-vectors (concatenation

of GMM mean vectors) were introduced as input features to SVM classifiers [20]. This was

among the first approaches where utterance features were beginning to emerge, and a whole

new set of mismatch compensation strategies based on linear statistical models started to

3

evolve. One of the important developments was that of the Nuisance Attribute Projection

(NAP) [21] approach which learned a projection matrix to transform the GMM super-vectors

into a channel/handset invariant subspace. The most recent techniques of compensation how-

ever are dominated by factor analysis based methods, including Eigenvoice, Eigenchannel,

Joint Factor Analysis (JFA) [22, 23, 17], and the i-Vector approach [2, 4]. If analyzed closely,

factor analysis techniques are also based on the principle of effective modeling and utiliza-

tion of development data. Here, Eigenvoice aims at finding a lower dimensional subspace for

the speaker super-vector space, whereas Eigenchannel attempts to find a lower dimensional

representation of the channel variability; JFA aims to combine both of these models, and fi-

nally the i-Vector approach combines the speaker and channel variability in a single subspace

model and yields reduced dimensional utterance features to be used with various classifiers.

In the subsequent NIST SRE evaluation results, it was consistently observed that systems

that were effectively able to harvest development data from various channel/handset/noisy

conditions provided better results [24]. A through discussion on relevant modeling strategies

for improved robustness is provided in Chapter 3.

One important aspect of the traditional speaker recognition systems is that they discard/de-

emphasize the acoustic feature covariances by using a diagonal model. Recently, it was

observed that using a full-covariance UBM provides advantages in speaker recognition per-

formance [4]. However, these full-covariance matrices are observed to have a low-rank struc-

ture, suggesting that the acoustic features can be modeled so as to lie in lower dimensional

subspaces in individual mixture components. Also, it may be worthwhile to consider the

full covariance structure of the models in the noisy conditions as well. These findings imply

that latent variable models, discriminative approaches and channel compensation can be

effectively applied in the acoustic feature domain within the mixture models. To summarize,

we identify the following as problems in the traditional speaker recognition systems:

• Large training data and computational requirements for background model training

4

• Discarding or de-emphasizing the acoustic feature covariances during modeling

• Sub-optimal usage of the discriminatory information contained in the acoustic features

• Assumes stationarity of the environment/channel/noise distortions

1.2 Dissertation contributions

In this work, we propose to address a number of these problems noted above for speaker

recognition systems in several parts. These are described below, and represent the core

contributions from this study.

1.2.1 Effective Universal Background Model (UBM) construction

Virtually all recent speaker recognition systems, including JFA, Eigenchannel and i-Vector,

assume the UBM super-vector to be at the center of the acoustic space. Also, the parameters

of these systems are not a function of the acoustic features, but rather they utilize the UBM

for probabilistic clustering and obtain the sufficient statistics. Recognizing this central role of

the UBM in current speaker recognition systems, we dedicate the first part of the proposed

research to analyze data requirements for effective UBM training. We first quantify how

much data is required for training a UBM in the context of a GMM-UBM system and

investigate feature sub-sampling based data reduction strategies for fast training [25]. Next

we experiment with speaker selection strategies for UBM training in an attempt to observe

its impact on speaker recognition performance [26, 27].

1.2.2 Acoustic Factor Analysis (AFA)

Observing the covariance structure of conventional acoustic features, we attempt to utilize

latent factor analysis models on this domain. We propose the Acoustic Factor Analysis

(AFA) [28, 29, 29, 30] framework, utilizing a mixture of factor analyzers derived from the

5

UBM that operates in each mixture component through a linear transformation. These

transformations aim at projecting the acoustic features into a mixture-dependent subspace

which is less affected by channel/handset degradations.

1.2.3 Mixture-dependent feature transforms

Building on the AFA strategy, we show that discriminative linear projections can also be

applied within the acoustic feature space using mixture-dependent transforms [31]. We

utilize conventional Linear Discriminant Analysis (LDA) and Nuisance Attribute Projection

(NAP) methods for acoustic features using this framework. A new linear transformation is

also proposed in this context.

1.2.4 Maximum Likelihood based Acoustic Factor Analysis (ML-AFA)

In noisy conditions, we observe that the full-covariance models and subsequently derived sub-

space models become highly dependent on the development datasets and tend to provide

sub-optimal performance. Motivated by the AFA strategy, we utilize a mixture of factor

analyzers directly trained from the data. This, in essence, is an AFA model trained using

the Maximum-Likelihood (ML) framework that replaces the conventional GMM-UBM model

[32]. The iterative training strategy discards the noisy directions from the acoustic features

in each step and is shown to be robust against different noisy conditions.

1.3 Organization of this dissertation

This dissertation is organized as follows. In Chapter 2, we present and summarize the pre-

liminary concepts of automatic speaker recognition technology. In Chapter 3, we provide

details of various robust modeling strategies proposed in the past in a roughly chronological

order. This focuses on approaches based on short-term acoustic features and explains the

evolution process the speaker recognition systems have undergone. Chapter 4 details the

6

research proposed on efficient and effective UBM training as mentioned before for the first

dissertation contribution. The proposed AFA strategy is introduced in Chapter 5, and mix-

ture component dependent transforms are detailed in Chapter 6, representing the second and

third contributions, respectively. Finally, ML based formulation of AFA in noisy conditions

is discussed in Chapter 7 and the dissertation is concluded in Chapter 8.

CHAPTER 2

BACKGROUND

This chapter provides a brief introduction to the problem of speaker recognition. The meth-

ods and terminology will be introduced in this chapter from a very high-level perspective.

Different blocks of a conventional speaker recognition system will be described that are most

relevant to the research work presented. Since this dissertation work proposes alternate mod-

eling schemes, discussion on various modeling strategies found in the literature are left-out

from this chapter, with a through treatment included in Chapter 3.

2.1 Speaker recognition fundamentals

The task of speaker recognition is generally described as identifying a person by their voice.

Speaker recognition can be classified into two broad groups: a) speaker identification and

b) speaker verification. In speaker identification, the task is to identify a speaker among

a set of speakers known to the automated system. This task can again be of two types:

in-set and out-of-set. In the in-set scenario, the automated system assumes that the test

speaker (the speaker asking for authentication) must be among the known speakers, while

in the out-of-set scenario, it may decide that the test speaker is none of the known speakers

(i.e., he/she does not sound similar to any of the known speakers). Speaker verification,

on the other hand, deals with the scenario when the system is provided by a test utter-

ance and an identity claim. The task is then simply to make a binary decision: if the

test speaker is indeed uttered by the claimed speaker or not. There are different terms

used in the literature for speaker verification, with most of the form: speaker/voice/talker

authentication/verification. Again, the speaker recognition task can be classified as either

7

8

Feature Extraction

Feature Extraction

Enrollment

Test

Modeling

Background Data

Scoring Λ

> τ

< τ

Accept

Reject

Figure 2.1. Block diagram of a basic speaker verification system.

text-dependent or text-independent. In the text-dependent scenario, the user is required to

utter a specific sentence for identification/verification, whereas in text-independent systems,

this is not the case. In this work, we will exclusively deal with text-independent speaker

verification. For a comprehensive tutorial review on speaker recognition, the reader may

refer to the following papers [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]. Also, we concentrate on

the verification task since we mostly utilize the standardized NIST SRE [14, 15, 16] datasets

which provide predefined tasks of this nature.

A simple block diagram representation of a speaker verification system is shown in Fig-

ure 2.1. Predefined feature parameters are first extracted from the audio recordings that are

designed to capture the idiosyncratic characteristics of a person’s speech in mathematical

parameters. These features obtained from an enrollment speaker are used to build/train

mathematical models that summarize their speaker dependent properties. For an unknown

test segment, the same features are then extracted and compared against the model of the

enrollment/claimed speaker. The models are designed so that such a comparison provides a

likelihood score (a scalar value) indicating if the two utterances are obtained from the same

speaker. If this likelihood score is higher/lower than a predefined threshold, the system

accepts/rejects the test speaker.

9

It should be noted that the block diagram of Figure 2.1 is overly simplified. As we will

discuss further about the standard speaker recognition systems available today, features can

be extracted from short-time segments of speech, a relatively longer duration of speech, or

the entire utterance. Also, some feature extraction methods may be dependent on other

speech utterances spoken by a diverse speaker population, as well as the enrollment speaker.

In these approaches, sometimes the modeling and feature extraction from utterances become

intertwined. In short, the recent techniques make use of the general properties of human

speech by observing many different speech recordings to make effective speaker verification

decisions. This is also intuitive, since we as humans also learn how human speech varies

across conditions over time. For example, if we only heard male speakers voices in our life,

we would not be as good at distinguishing between female speakers. In the following section,

we will briefly discuss the human speech production mechanism followed by conventional

short-time acoustic feature extraction methods.

2.2 Human speech production mechanism

It will be helpful to briefly review the human speech production mechanism to understand

the guiding principles of acoustic feature design. The main components responsible for

human speech production are illustrated in the schematic diagram in Figure 2.2. In the

speech processing literature, the term vocal tract is widely used, which refers to the entire

pathway between the larynx and the lips. The length of the vocal tract for an adult male

is approximately 17cm long [43]. Speech is produced as air flows out from the larynx and

becomes modulated through vocal fold activity and the cavities in the vocal tract, producing

different phonetic sounds. The frequency content of the acoustic signal is modified by the

resonance properties of the different cavities along the path. These resulting resonances are

known as formants.

10

General Physiological/Neurophysiological Features • Muscles are under voluntary control • Structures contain feedback receptors that

supply sensory information to the CNS: – Surfaces: touch/pressure – Muscles:

• length and length changes: spindles • Tension: tendon organs

– Joints (TMJ): joint angle • Reflex mechanisms:

– Stretch – Laryngeal (coughing) – Startle

• Motor programs (low-level, “hard wired” neural pattern generators) – Breathing – Swallowing – Chewing – Sucking

Pharynx

Larynx

Epiglottis

Lips

Arytenoid cartilage

Esophagus

Lungs

Soft palate

Vocal cords

Nasal cavity

Tongue

Hyoid

Teeth

Trachea

Diaphragm

Adam's apple

Thyroid cartilage

Cricoid cartilage

• Low-level circuitry could be employed in Figure by MIT OCW.

speech motor control. The picture is complex, and a comprehensive account hasn’t emerged.

112/05 6

Figure 2.2. Anatomy of the human speech production organs. Figure reprinted from MITopencourseware material prepared by Joseph S. Perkell [1].

The exact location and shape of each organ in the path of the vocal tract contribute to

its shape, which in turn provides the unique resonance structures of each speaker’s voice.

However, as speech is a non-stationary signal, the unique patterns also vary with time

and context. For example, speech waveforms of the same speaker speaking a different vowel

sound will emit completely different patterns, but these patterns may be quite unique to this

speaker for these specific vowels. Also, it should be noted that speaker identity information is

11

a secondary non-linguistic information embedded in the speech signal and thus, it is unlikely

that a straightforward parameter measure will uniquely characterize a speaker at all times

[35].

Motivated by the above mentioned observations on speech production, automatic speaker

recognition systems mostly utilize the spectral characteristics of speech waveform in the

short-time segments, aiming to capture the resonance properties that are potentially unique

to the speaker. A more elaborate discussion on the human speech production mechanism

can be found in [43] and [44].

2.3 Feature extraction

A unique aspect of voice biometrics is the variability in its sample length/duration. In other

popular forms of biometrics, such as face, fingerprint, iris, hand geometry, etc., the input

data/image is of a fixed dimension. Features extracted and organized from these biometric

data can thus be conveniently converted to fixed sized vectors. It is clear that extracting

a single fixed dimensional vector from any bio-metric pattern would be most convenient

for direct comparisons. However, for speech, due to its time-varying nature and context-

dependency, features based on averaged parameters over an entire speech utterance are not

very effective [45, 46]. Take for example, speaking rate as a feature. It is obvious that two

people having the same speaking rate can be quite common, and thus, this feature by itself

may not be very useful. Researchers noted early on that a specific speaker’s idiosyncratic

features will be time-varying and context/speech sound dependent [47, 34]. Thus, effective

feature parameters are based on short-time speech characteristics, instead of time averaged

behavior of speech across the utterance, which is more susceptible to mimicry. However,

high-level and long-term features such as dialect, accent, speaking style/rate, prosody, etc.

are also useful and can be beneficial when used together with “low-level” acoustic features

[48, 49].

12

0 20 40 60 80 100 120 140 160 180 200−0.05

0

0.05

Time (samples)

Am

plitu

de

Speech waveform (200 samples)

10 20 30 40 50 60 70 80 90 100

−10

−5

0

DFT point

Log

pow

er-s

pec

trum

FFT power spectra (101 points)

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2

Frequency (kHz)

Filte

rgain

24-channel Triangular Mel-filterbank

0 0.5 1 1.5 2 2.5 3 3.5 4−4

−3

−2

−1

0Mel-scaled Log filter-bank energy

Log

Ener

gy

Frequency (kHz)

1 2 3 4 5 6 7 8 9 10 11 12−4

−2

0

212 point MFCC

Coefficient index

Am

plitu

de

(a)

(b)

(c)

(d)

(e)

Figure 2.3. Various steps in the MFCC feature extraction procedure from a speech frame.(a) A 200 sample frame representing 25 ms of speech sampled at a rate of 8kHz, (b) The DFTpower spectrum showing the first 101 points, (c) A 24-channel triangular Mel-filterbank, (d)The log filter-bank energy output values from the Mel-filterbank, (e) The 12 static MFCCcoefficients obtained by performing a DCT on the filter-bank energy coefficients and retainingthe first 12 values.

13

2.3.1 Properties of ideal features

In [47], Wolf outlined the ideal characteristics of speech features for speaker recognition.

Based on that discussion, we define an ideal feature parameter as one that:

Property 1: It is found naturally and frequently in human speech,

Property 2: It can be easily estimated,

Property 3: it varies among speakers, but is consistent for an individual each speaker,

Property 4: It does not change with a speaker’s age, health, emotional state and is indepen-

dent of speech content, language, dialect and accent,

Property 5: It is not affected by transmission channel/handset, room acoustics or background

noise,

Property 6: It is not modifiable by subject’s conscious effort

Clearly, meeting all these conditions simultaneously is not practical. However, they can

be considered as idealistic design goals for speech short-time features. In the following

section, we discuss the most popular short-time spectral features, known as MFCCs.

2.3.2 Mel-Frequency Cepstral Coefficients (MFCC)

The most popular short-time acoustic features are the Mel-Frequency Cepstral Coefficients

(MFCC) [50] and Linear Predictive Coding (LPC) [51] based features. For a review on

different acoustic features for speaker recognition, the reader may refer to [39, 41]. To

obtain the MFCC features from an audio recording, first the audio samples are divided

into short overlapping segments. These segments are typically 20 − 25ms in duration. A

typical 25 ms speech signal frame is shown in Figure 2.3(a). The signal obtained in these

segments/frames are then multiplied by a tapered window function (e.g., hamming, hanning,

14

etc.) and the Fourier power-spectrum is obtained. Next, the logarithm of the spectrum

is computed and the no-linearly spaced mel-space filter-bank analysis is performed. The

latter process mimics the way the human auditory system de-composes audio signals. The

filter-bank analysis produces the spectrum energy in each channel (also known as the filter-

bank energy coefficients), representing different frequency bands. A typical 24-channel filter-

bank and its output is shown in Figure 2.3(c) and (d), respectively. As is evident here,

the filter-bank is designed so that it is more sensitive to frequency variations in the lower-

end of the spectrum, which is similar to the human auditory system [50]. Finally, the

MFCC coefficients are obtained by performing a Discrete Cosine Transform (DCT) on the

filter-bank energy parameters and retaining a number of leading coefficients. DCT has

two important properties: a) it compresses the energy of a signal to a few coefficients,

and b) it de-correlates the coefficients. For these reasons, removing some dimensions using

the DCT improves modeling efficiency and reduces some nuisance components. Also, the

de-correlation property of DCT helps the models that assume feature coefficients are un-

correlated (i.e., acoustic models with covariance matrices can now assume a diagonal matrix

structure).

In summary, the following sequence of operations: power spectrum, logarithm and DCT,

produces the well known cepstral representation of a signal [52]. Figure 2.3(e) shows the

static MFCC parameters, retaining the first 12 coefficients after the DCT. Generally, veloc-

ity and acceleration parameters computed across multiple frames of speech, and represents

the first and second order derivatives, are appended to the static MFCC coefficients. These

parameters (known as deltas and double deltas, respectively) represent the dynamic proper-

ties of the short-time feature coefficients.

2.3.3 Voice Activity Detection (VAD)

It is desirable that the features are extracted from only the speech segments, and not si-

lence/noise, of the audio waveform. For this reason, the audio signal is processed by a Voice

15

0.5 1 1.5 2 2.5 3 3.5

−1

−0.5

0

0.5

1

Am

plitu

de

Time (s)

Audio waveformVAD decision

Time (s)

Fre

quen

cy (

kHz)

0.5 1 1.5 2 2.5 3 3.5−0

0.5

11.5

2

2.53

3.54

0.5 1 1.5 2 2.5 3 3.5

47

1013

−50

0

50

100

Time (s)Feature dimension

Am

plitu

de

(b)

(c)

(a)

Figure 2.4. (a) A speech waveform with voice activity decisions overlaid. A value of 1 and 0indicate speech and non-speech, respectively. (b) Spectrogram plot of the speech waveform.

Activity Detection (VAD) algorithm [53, 54]. Different approaches can be considered for this

process. Detecting the speech segments become critical when highly noisy/degraded acoustic

conditions are considered. Since effective VAD is not the focus of this study, we will not

discuss various approaches available in the literature. The function of a VAD is illustrated

in Figure 2.4(a), where the speech presence/absence is indicated by a binary signal overlaid

on the speech samples. The corresponding speech spectrogram and static MFCC coefficients

are shown in Figures 2.4(b) and (c), respectively. The VAD algorithm used in this plot is

presented in [53], and is typically referred to as Sohn VAD.

16

2.3.4 Feature normalization

As stated earlier, one of the desirable properties of acoustic features (and any feature pa-

rameter in a pattern recognition problem) is robustness to degradation. This is one of the

desirable characteristics of an ideal feature parameter [47]. In reality, it is not possible

to design a feature parameter that will not vary as acoustic conditions change. However,

these changes can be minimized in various ways using feature normalization techniques such

as, Cepstral Mean Subtraction (CMS) [55], feature warping [56], RASTA processing [57],

Quantile based Cepstral Normalization (QCN)[58], etc. It should be noted that the goal of

normalization is to modify the feature parameters extracted from a single utterance to be

more uniform across its entire duration. Normalization techniques are not designed to en-

hance the discriminative ability of the features (ideal Property 3 from 2.3.1), rather they aim

to modify the features so that they are more consistent among different speech utterances

(ideal Property 5 2.3.1). In this dissertation, we mostly utilize feature warping and cepstral

mean and variance normalization (CMVN) methods.

2.4 Speaker modeling

As mentioned earlier, modeling refers to the process of describing the feature parameters in

an effective way so that comparison between two speech utterances becomes convenient and

meaningful. A model can be generative or discriminative, parametric or non-parametric.

If the short-time spectral features follow the ideal feature properties [47], we would not

need any sophisticated modeling methods. However, since features do vary across different

acoustic conditions due to noise/channel or other distortions, modeling techniques need to

be designed specifically to handle these scenarios. We dedicate the next chapter on robust

modeling, where we discuss the various modeling schemes that have been proposed in the

literature along with their motivation.

17

2.5 Performance evaluation with standardized datasets

Evaluating the performance of a speaker verification task using a standardized dataset is

an important element of the research cycle. Over the years, new data-sets and performance

metrics have been introduced to match realistic scenarios. These, in turn, have motivated

researchers to discover new strategies to address challenges, compare results among peers,

and exchange ideas to further the research paradigm.

The National Institute of Standards and Technology (NIST) has been organizing a

Speaker Recognition Evaluation (SRE) campaign for the past several years aiming at pro-

viding standard datasets, verification tasks and performance metrics. Every year’s evalua-

tion introduces new challenges for the research community. These challenges include newly

introduced recording conditions (microphone, handset, room acoustics, etc.), short test ut-

terance duration, varying vocal effort, artificially and real-life additive noise, restrictions or

allowances in data utilization strategy, new performance metrics to be optimized, etc.

It is clear that the performance metric defined for a speaker recognition task depend

on the dataset and train-test pairs of speech (also known as trials) used for evaluation.

A sufficient number of trials needs to be provided for a statistically significant evaluation

measure [59].1 The performance measures can be based on hard verification decisions or soft

scores, they may require the log-likelihood ratio as scores, and depend on the prior probability

of encountering a target speaker. For a given dataset and task, systems evaluated using a

specific error/cost criteria can be compared. Before discussion on the common performance

measures, we would like to introduce the type of errors encountered in speaker verification.

1Recent NIST evaluations generally constitute of millions of such trials.

18

2.5.1 Types of errors

There are two main types of errors in speaker verification (or any other biometric authentica-

tion) when a hard decision is made by the automatic system. From the speaker authentication

point of view, we define them as:

False Accept (FA): Granting access to an impostor speakers

False Reject (FR): Denying access to a legitimate enrolled speaker

From the speaker-detection point of view (a target speaker is sought), these are termed as:

False alarm and Miss errors, respectively. According to these definitions, two error rates are

defined as:

False Acceptance Rate (FAR) =Total number of FA errors

Total number of impostor speaker attempts(2.1)

False Reject Rate (FRR) =Total number of FR errors

Total number of enrolled speaker attempts(2.2)

As a side-note, we will relate this error terminology to the generic error types encountered

in a two-class pattern recognition problem. Some of these terms are more familiar in other

research areas and thus may help the reader to associate with them. If the target class

is assumed positive, and impostor class is assumed negative, we obtain a confusion matrix

including four elements considering the predicted and actual classes. The table is shown in

Table 2.1. Utilizing these basic definitions, all errors and several performance measures can

Table 2.1. Terminologies in a confusion matrix of a two-class recognizerActual

Positive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Total Total Positive (P) Total Negative (N)

19

be defined for a two-class recognition problem:

Precision / Positive predictive value =TP

TP + FP=TP

P

True Positive Rate (TPR) / Recall / Sensitivity =TP

TP + FN=TP

P

False Positive Rate (FRR) / FAR / Type I error =FP

FP + TN=FP

N

False Negative Rate (FNR) / Miss Rate / FRR / Type II error =FN

P= (1− TPR)

True Negative Rate (TNR) / Specificity =TN

N= (1− FPR)

Negative predictive value =TN

FN + TN

Accuracy =TP + TN

P +N

The terms precision and recall are common in the general pattern recognition literature,

whereas sensitivity, specificity, and, positive and negative predictive values are mostly used

for reporting medical test performance. Also, in some biometric recognition literature, the

terms Type I and II errors are quite commonly used.

Speaker verification systems generally output a match score between the training speaker

and the test utterance. This is true for most two-class recognition/binary detection problems.

This score is a scalar variable that represents the similarity between the enrolled speaker and

the test speaker, with higher values indicating greater similarity. To make a decision, the

system needs to use a threshold (τ) as illustrated in Figure 2.1. If the threshold is too low,

the number of FA errors will be high, whereas if the threshold is too high, there will be many

FR/miss errors. This is also illustrated in Figure 2.5.

2.5.2 Equal Error Rate (EER)

The Equal Error Rate (EER) is defined as where the FAR and FRR values when become

equal. That is, by changing the threshold, we find a point where the FAR and FRR become

20

−50 −40 −30 −20 −10 0 10 20 30 40 500

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Raw recognition score

Pro

bab

ility

Den

sity

Funct

ion

False Accept Rate (FAR) = False Reject Rate (FRR)= Equal Error Rate (EER)

EER threshold

EER = 10.38%

Threshold

True Scores

False Scores

False reject area

False accept area

Figure 2.5. An illustration of target and non-target score distributions and the decisionthreshold. Areas under the curves with blue and red colors represent FAR and FRR errors,respectively.

equal. This is shown in Figure 2.5, where the shaded FAR and FRR areas are the same.

EER is a very popular performance measure for speaker verification systems, since it allows

a single number for system comparison. The soft scores from the automatic system are

required to compute the EER. No actual hard decisions are made. It should be noted

that, operating a speaker verification system on the threshold corresponding to the EER

may not be desirable for practical purposes. For high security applications, one should set

the threshold higher, lowering the FA errors at the cost of miss errors. However, for high

convenience, the threshold should be set lower. In the latter case, accepting an impostor

speaker is not as critical as in high security applications.

2.5.3 Detection Cost Function (DCF)

In this section, we note the performance measures introduced by NIST over the years. As

mentioned before, the EER does not differentiate between the two FA/FR errors, which

sometimes is not a realistic performance measure. The DCF, thus, introduces numerical

21

costs/penalties for the two types of errors (FA and Miss) which are predefined. The a-priori

probability of encountering a target speaker is also provided. The DCF is computed over

the full range of decision threshold values as:

DCF (τ) = CMissPMiss(τ)PTarget + CFAPFA(τ)(1− PTarget). (2.3)

Here,

CMiss = Cost of a miss/FR error (2.4)

CFA = Cost of a FA error (2.5)

PMiss(τ) = Pr(Miss error|Target speaker,Threshold = τ) (2.6)

PFA(τ) = Pr(FA error|Nontarget speaker,Threshold = τ) (2.7)

PTarget = Prior probability of observing a target speaker. (2.8)

The DCF is usually normalized by the best possible cost without any processing to

improve its intuitive meaning. This cost is obtained by either always accepting or always

rejecting each test utterance, whichever gives the lower cost [14]. This is known as the default

cost, Cdefault, given by,

Cdefault = min

CMiss × PTarget,

CFA × (1− PTarget)

. (2.9)

In the NIST SRE 2008[14], the DCF parameters were set as CMiss = 10, CFA = 1 and

PTarget = 0.01. Usually, the DCF is normalized by dividing with a constant [14, 15]. By

processing the DCF, two performance measures are derived: i) the minimum DCF (MinDCF)

and ii) the actual DCF (ActDCF). The MinDCF is basically the minimum value of DCF that

can be obtained by changing the threshold, τ . The MinDCF parameter can be computed

only when the soft scores are provided by the systems.

MinDCF = minτ

[CMissPMiss(τ)PTarget + CFAPFA(τ)(1− PTarget)] (2.10)

22

When the system provides hard decisions, the actual DCF can be utilized where the

probability values involved are simply computed by counting the errors using (2.1) and (2.2).

Both of these performance measures have been extensively used in the past NIST evaluations

[14, 15]. The most recent evaluation in 2012 introduced a DCF that is dependent on two

different operating points [16], instead of one.

It is important to note here that the MinDCF (or ActDCF) parameter is not an error rate

in the general sense. Thus, the interpretation on what it represents is not straightforward.

Obviously, the lower MinDCF, the better the system performance. However, the exact value

of the MinDCF can only be used to compare other systems evaluated using the same trials

and performance measure. Generally, when the system EER improves, the DCF parameters

also improve. An elaborate discussion on the relationship between EER and DCF can be

found in [60].

2.5.4 Detection Error Trade-off (DET) curve

When speaker verification performance needs to be evaluated in a range of operating points,

the DET curve is generally employed. The DET curve is a plot of the FA vs. FR/miss errors.

An example DET curve is shown in Figure 2.6. During the preparation of the DET curve,

the Cumulative Density Functions (CDF) of the true and impostor scores are transformed to

their normal deviates. This means, the true/impostor score CDF value for a given threshold

is transformed by a standard normal Inverse Cumulative Density Function (ICDF) and the

resulting values are used to make the plot. This transform yields a linear DET curve when

the two distributions are normal and have equal variances. Thus, even though the labels

indicate the axis are error probabilities, they are actually their corresponding normal deviate

values.

The DET curve representation facilitates a comparison between two systems across multi-

ple operating points. The same DET curve when plotted with the actual probability values

23

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40

False Alarm probability (in %)

Mis

s p

rob

ab

ility

(in

%)

EER = 10.29%

minDCF’08 = 0.0770

Increasingthreshold

Figure 2.6. A Detection Error Trade-off curve. The points in the curve corresponding to thethreshold that yields the Equal Error Rate (EER) and minimum Detection Cost Function(DCF) (as defined in NIST SRE 2008), and the direction of an increasing threshold areshown.

instead of their normal deviates is shown in Figure 2.7(a). Alternatively, a conventional

Receiver Operating Characteristics (ROC) curve as shown in Figure 2.7(b) could also be

used. In this case, the FAR/FPR is plotted against TPR. Comparing Figure 2.7(a) and

Figure 2.7(b), the relationship between an ROC curve and a DET curve becomes appar-

24

5 10 15 20 25 30 35 40 45

5

10

15

20

25

30

35

40

45


Mis

s p

rob

ab

ility

(in

%)

EER = 10.29%

minDCF’08 = 0.0770

(a)

5 10 15 20 25 30 35 40 45

55

60

65

70

75

80

85

90

95

False Positive Rate(in %)

Tru

e P

ositiv

e R

ate

(in

%)

EER = 10.29%

(FAR = 1−TPR)

minDCF’08 = 0.0770

(b)

Figure 2.7. (a) An example DET curve without conversion of scores to standard normaldeviates. (b) Traditional Receiver Operating Characteristics (ROC) curve.

ent. In these plots, the location of the EER and MinDCF as defined in NIST SRE 2008

(minDCF’08) are shown.

The next chapter focuses on speaker modeling concepts employed throughout the remain-

ing dissertation formulations and evaluations.

CHAPTER 3

ROBUST SPEAKER MODELING

Once audio segments are converted to feature parameters, the next task of the speaker recog-

nition process is acoustic modeling. In general terms, we can define modeling as a process

of characterizing the feature properties for a given speaker. The model must also provide

a means of comparison with an unknown utterance. A modeling method is robust when

the characterizing process of the features are not significantly affected by unwanted distor-

tions, even though the features themselves incorporate the degradations. Any distortion in

the acoustic signal directly affects the acoustic feature parameters, which in turn affects its

match score with the speaker model, especially if the acoustic condition is unseen. Ideally,

if features could be designed in such a way that no intra-speaker variation is present while

inter-speaker discrimination is maximum (as discussed in Section 2.3.1), the simplest meth-

ods of modeling might suffice.1 In essence, the non-ideal properties of the feature extraction

stage forces us to consider applying various techniques during the modeling phase so that

the effect of the nuisance variations observed in the signal are minimized during the speaker

verification process.

Most speaker modeling techniques make various mathematical assumptions on the fea-

tures (Gaussian distributed, for example). If these properties are not met by the data, we

are essentially introducing imperfections during the modeling phase as well. The normaliza-

tion of features can alleviate these problems to some extent, but not entirely. Consequently,

mathematical models are forced to fit the features, and recognition scores are derived based

1For example, ideal features extracted from a single frame of the vowel sound “a” would be enough. AEuclidean distance between such features would yield zero/non-zero for the same/different speaker.

25

26

on these models along with the test data. Thus, this process introduces artifacts in the

detection scores and a family of score normalization techniques have been proposed in the

past to address this final stage mismatch [6].

In summary, degradations in the acoustic signal affects features, models and scores. Thus,

improving robustness of speaker recognition systems is important in these three domains.

In recent times, it has been observed that as speaker modeling techniques are improved,

score normalization techniques become less effective [17, 4]. Similarly, we can argue that if

acoustic features are improved, simple modeling techniques will be sufficient. However, from

the speaker recognition research trend in the last decade, it seems that improving feature

robustness beyond a certain level (for a variety of degradations) is extremely difficult, if not

impossible. Modeling techniques that aim at learning the behavior of the degradations from

example speech utterances seems to be at an advantage here in improving robustness. For

example, an automatic system that has observed several examples of speech recordings of

different speakers in roadside noise will be better at distinguishing speakers in that specific

acoustic environment.

In this chapter, we will review the major modeling techniques that have been proposed in

the past few decades and analyze how these methods help improve the robustness of speaker

recognition systems compared to their predecessors.

3.1 Vector Quantization (VQ) based methods

If an acoustic feature parameter could be designed that obeys the properties of the idea

features outlined in Section 2.3.1, we can imagine that every speech frame will contain sig-

nificant speaker identity information. In this case, a simple average of the features across

the utterance should be a robust way of modeling the speaker. These approaches have been

studied in [45], with limited success when there are short duration utterances or other ro-

bustness issues [46]. Thus, it has been understood that even for text-independent systems,

27

VQ based speaker recognition: summary

First proposed In 1983 by Li et. al. [46]Previous methods Averaging of long-term features, vowel matchingProposed method Group features using VQ codebooks and use them to compare utterancesWhy robust? Clustering of speech data allows an unsupervised way of comparing fea-

tures within the same/similar phonetic context

acoustic features need to be grouped together according to phonetic/acoustic content (de-

tected in a supervised or unsupervised manner) and compared to unknown speech features

within the same group. This understanding leads to investigating clustering methods applied

to acoustic features.

Vector quantization based methods aim at modeling the probability density functions of

the acoustic feature distributions by utilizing a limited number of prototype vectors [46, 61,

62, 63]. In this approach, each speaker’s acoustic observations are modeled by a codebook of

vectors/templates that represent the (supposedly) unique phonetic clusters of that person’s

speech. The template vectors in each cluster is also known as cluster centroids. This method

is in part motivated by the computational benefit this technique offers by compressing an

arbitrary number of feature vectors into a small fixed sized VQ codebook. Also, since the

number of clusters in the codebook is fixed, VQ codebooks extracted from a training and

test utterance can be easily compared using different distance measures [62]. The discussion

pertinent to this dissertation are not relevant to VQ based approaches, and thus, we will not

discuss this method further. However, this approach lays the foundation for cluster based

modeling of speech for speaker recognition which continues to be effective till this day.

3.2 Gaussian Mixture Model (GMM) based method

A Gaussian Mixture Model (GMM) is a combination of Gaussian probability density func-

tions (PDF) generally used to model multivariate data. Similar to the VQ model, the GMM

clusters the data in an unsupervised way, but it provides a PDF of the data instead of simple

28

prototype vectors. Using GMMs to model a speaker’s features results in a speaker depen-

dent PDF. Evaluating the PDF at different data points (e.g., features obtained from a test

utterance) provides a probability score that can be used to compute the similarity between

a speaker GMM and an unknown speaker data. For a simple speaker identification task, a

GMM is first obtained for each speaker. During test, the utterance is compared against each

GMM and the most likely speaker (i.e., the highest scoring GMM) is selected.

In text-independent speaker recognition tasks, where there is no a priori knowledge about

the speech content, using GMMs to model acoustic features have been found to be most

effective for acoustic modeling. This is expected since the average behavior of the short-term

spectral features are more speaker dependent rather than their temporal characteristics. It

was first utilized in a speaker recognition method in [18]. The GMM has been proven to be a

better speaker model compared to VQ due to its probabilistic nature that allows for greater

variability. This means, even when the test utterance has different acoustic condition, GMM

models can relate to the data better than the more restrictive VQ model.

3.2.1 GMM formulation

A GMM is a mixture of (usually) multi-variate Gaussian probability density functions (PDF)

parameterized by a number of mean vectors, covariance matrices and weights of the indi-

vidual mixture components. The model is represented by a weighted sum of the individual

PDFs. If a random vector xn ∈ Rd can be modeled by M Gaussian components with mean

vectors µg, covariance matrices Σg, where g = 1, 2 · · ·M indicate the component indices, the

PDF of xn is given by,

f(xn|λ) =M∑

g=1

πgN (xn|µg,Σg), (3.1)

=M∑

g=1

πg(2π)d/2|Σg|1/2

exp

[−1

2(xn − µg)TΣ−1

g (xn − µg)], (3.2)

29

Gaussian Mixture Model (GMM) based method: summary

First proposed In 1995 by Reynolds et. al. [18]Previous methods Averaging of long-term features, VQ based methodsProposed method Model features using GMMs, compute similarity using feature likelihoodWhy robust? The probabilistic nature of GMM allows more variability in the data.

VQ based models are more restrictive and less robust

where the values πg indicate the weights across the g-th mixture components such that,

M∑

g=1

πg = 1. (3.3)

We denote the GMM model as λ = {πg,µg,Σg|g = 1, . . . ,M}. Equation (3.2) can be used to

evaluate the likelihood of a feature vector given the GMM model. Acoustic feature vectors are

generally assumed to be independent. For a sequence of feature vectors X = {xn|i ∈ 1 . . . T},

the probability of observing these features given the GMM model is computed as,

p(X|λ) =T∏

n=1

p(xn|λ). (3.4)

Note that the time domain order of the features are irrelevant in computing the likelihood.

This can be seen as an advantage for text-dependent speaker recognition.

3.2.2 GMM training

For a given dataset, there is no closed form solution in obtaining the best fitting GMM pa-

rameters, thus a Maximum Likelihood (ML) formulation is used. A GMM is usually trained

using Expectation-Maximization (EM) algorithm [64] that iteratively increases the likelihood

of the data given the model. Let X = {xn|i ∈ 1 . . . T} denote all the feature vectors in the

training data. The model mean vectors are first initialized randomly or using a clustering

method. The weight parameters πg are set to a constant so that condition Eq. (3.3) is sat-

isfied. If the initial/old model is denoted by λ = {µg,Σg, πg}, the updated/new parameters

30

Speaker Data UBM Speaker Model

GM

M m

ean

supe

r-ve

ctor

Figure 3.1. Schematic diagram of a GMM-UBM system for a 4 mixture UBM. MAP adap-tation procedure and supervector formation by concatenating the mean vectors are alsoillustrated.

λ = {µg, Σg, πg} are computed as:

πg =1

N

T∑

n=1

p(g|xn, λ), (3.5)

µg =

∑Tn=1 xnp(g|xn, λ)∑Nn=1 p(g|xn, λ)

and (3.6)

Σg =

∑Tn=1(xn − µg)(xn − µg)Tp(g|xn, λ)∑T

n=1 p(g|xn, λ). (3.7)

Here, the posterior probability of the mixture component g given the data vector xn is

computed as:

p(g|xn, λ) =πgN (xn|µg,Σg)∑Mg=1N (xn|µg,Σg)

, (3.8)

where N (xn|µg,Σg) are the Gaussian mixture component PDFs from Eq. (3.1).

3.3 Adapted GMMs: The GMM-UBM speaker verification system

The GMM based speaker recognition system has proven to be most effective in a speaker

identification task. For speaker verification, apart from the claimed speaker model, an alter-

nate speaker model (representing speakers other than the target) is needed. In this way, these

31

two models can be compared with the test data and the more likely model selected, leading

to an accept or reject decision. The alternate speaker model, also known as background

or cohort models, initiated the idea of using a Universal Background Model (UBM) that

represents “everyone else” except the target speaker. It is essentially a large GMM trained

to represent the speaker independent distribution of the speech features for all speakers in

general. The block diagram of Figure 2.1 becomes much more clear now, as the back-ground

model is assumed to exist. Note that the UBM is assumed to be a “universal” model which

serves as the alternate model for all enrolled speakers. Some methods considered providing

a speaker dependent unique alternate model. However, using a single background model has

been the most effective and meaningful.

The UBM was first introduced as an alternate speaker model in [65]. Later in [19], the

UBM was used as an initial model for the enrollment speaker GMMs. This concept was a

significant advancement achieved by the so called GMM-UBM method. In this approach,

a speaker’s GMM is adapted or derived from the UBM using Bayesian adaptation [66]. In

contrast to performing a Maximum-Likelihood (ML) training of the GMM for an enrollment

speaker, this model is obtained by updating the well-trained UBM parameters. This relation

between the speaker model and the background model provides better performance than

independently trained GMMs, and also lays the foundation for speaker model adaptation

techniques that were developed later. We will return to these relations as we proceed. In

the following sub-sections, we describe the formulations of this approach.

3.3.1 The likelihood ratio test

Given an observation O, and a hypothesized speaker s, the task of speaker verification can

be stated as a hypothesis test between:

H0 : O is from speaker s,

H1 : O is not from speaker s.(3.9)

32

In the GMM-UBM approach, the hypothesis H0 and H1 are represented by a speaker depen-

dent GMM λs and the UBM λ0. Thus, for the set of observed feature vectors X = {xn|n ∈

1 . . . T}, the likelihood ratio test is performed by evaluating the following ratio:

p(X|λs)p(X|λ0)

≥ τ accept H0

< τ reject H0

(3.10)

where τ is the decision threshold. Oftentimes, the likelihood ratio test is performed within

a logarithmic scale, providing the so called Log-Likelihood Ratio (LLR),

Λ(X ) = log p(X|λs)− log p(X|λ0). (3.11)

3.3.2 Maximum A Posteriori (MAP) adaptation of UBM

Let X = {xn|n ∈ 1 . . . T} denote the set of acoustic feature vectors obtained from the

enrollment speaker s. Given a UBM model as in Eq. (3.1) and the enrollment speaker’s

data X , at first the probabilistic alignment of the feature vectors with respect to the UBM

components is calculated as:

p(g|xn, λ0) =πgp(xn|g, λ0)∑Mg=1 πgp(xn|g, λ0)

, γn(g). (3.12)

Next, the values of γn(g) values are used to calculate the sufficient statistics for the weights,

means and covariance parameters as,

Ns(g) =T∑

n=1

γn(g), (3.13)

Fs(g) =T∑

n=1

γn(g)xn, and (3.14)

Ss(g) =T∑

n=1

γn(g)xnxTn . (3.15)

These quantities are known as the zero, first and second order Baum-Welch statistics, re-

spectively. Using these parameters, the posterior mean and covariance matrix of the features

33

given the data vectors X can be found as:

Eg[xn|X ] =Fs(g)

N(g), (3.16)

Eg[xnxTn |X ] =

Ss(g)

Ns(g). (3.17)

The following MAP adaptation update equations for weight, mean and covariance, are pro-

posed in [66] and utilized in [19] for speaker verification:

πg =

[αgNs(g)

T+ (1− αg)πg

]β, (3.18)

µg = αgEg[xn|X ] + (1− αg)µg, (3.19)

Σg = αgEg[xnxTn |X ] + (1− αg)(Σg + µgµ

Tg )− µgµT

g . (3.20)

The scaling factor β ensures that the sum of πg across the mixtures is unity. Thus, the

new GMM parameters are a weighted summation of the UBM parameters and the sufficient

statistics obtained from the observed data. The variable αg is defined as:

αg =Ns(g)

Ns(g) + r. (3.21)

Here, r is known as the relevance factor. This parameter controls how the adapted GMM

parameter will be affected by the observed speaker data. In the original paper [19], this

parameter was defined differently for the model weight, mean and covariance. However, since

only adaptation of the mean vectors turned out to be the most effective, we only use one

relevance factor in our discussion here. The typical range for r is 14 ∼ 19. Figure 3.1 shows

an example of the MAP adaptation procedure for a 2-dimensional feature and 4-mixture

UBM case.

3.3.3 The GMM supervectors

One of the issues with speaker recognition is that, the training and test speech data can be

of different durations. This requires the comparison of two utterances of different lengths.

34

GMM-UBM system: summary

First proposed In 2000 by Reynolds et. al. [19]Previous methods GMM models for enrollment, cohort speakers as backgroundProposed method Adapt speaker GMMs from an Universal Background Model (UBM)Why robust? Speaker models adapted from a well-trained UBM is more reliable than

directly trained GMMs for each speaker

Thus, one of the efforts towards effective speaker recognition has always been to obtain a

fixed dimensional representation of a single utterance [45]. This is extremely useful since

many different classifiers can be utilized on these “utterance level” features from the machine

learning literature. One effective solution to obtaining a fixed dimensional vector from

variable duration utterance is the formation of a GMM supervector, which is essentially a

large vector obtained by concatenating the parameters of a GMM model. Generally, a GMM

supervector is obtained by concatenating the GMM mean vectors of a MAP adapted speaker

model, as illustrated in the far right portion of Figure 3.1.

The term supervector was first used in this context for Eigenvoice speaker adaptation

in speech recognition applications [67].2 For speaker recognition, supervectors were first

introduced in [68], motivating new model adaptation strategies involving Eigenvoice and

MAP adaptation. Researchers realized that these large dimensional vectors are a very good

platform for designing channel compensation methods. Various effective modeling techniques

were proposed to operate on the supervector space. The two dominating trends observed in

these efforts were based on factor analysis and Support Vector Machines (SVM). They will

be discussed in the following sections.

3.3.4 GMM supervector Support Vector Machine (GMM-SVM)

Support Vector Machines [69] are one of the most popular supervised binary classifiers in

machine learning. In [70], it was observed that GMM supervectors can be effectively uti-

2The term “super vector space” in mathematics is a different terminology which is unrelated to thesupervectors mentioned here.

35

+

- -

-

-

- -

- +

+

+

+

+

++

+

++

+

- -

-

- -

-

+

Optimal Linear Separator

Maximum margin 2

||w||

wT x + b =

0

- x1

x2

H

Positive examples

Negative Examples

+

++

Support vectors

Figure 3.2. Conceptual illustration of an SVM classifier. Positive (+) and negative (-)examples are labeled, the optimal linear separator and support vectors are shown.

lized for speaker recognition/verification using SVMs. The supervectors obtained from the

training utterances were used as positive examples while a set of impostor utterances were

used as negative examples. Channel compensation strategies were also developed in this

domain, such as, Nuisance Attribute Projection (NAP) [21], Within Class Covariance Nor-

malization (WCCN) [71], etc. Other approaches utilized SVM models for speaker recognition

using short and long term features [72, 49]. However, using GMM supervectors with SVM

and NAP provided the most effective solution. In the following subsections we discuss the

fundamentals of SVM and NAP.

36

3.3.5 Support Vector Machines (SVM)

SVM aims at optimally separating multi-dimensional data points obtained from two classes

using a hyperplane (a higher dimensional plane). The model can then be used to predict the

class of an unknown observation depending on its location with respect to the hyperplane.

Given a set of training vectors and labels (xn, yn) for n ∈ {1 . . . T} where xn ∈ Rd and

yn ∈ {−1,+1}, the goal of SVM is to learn the function f : Rd 7→ R so that the class label

of an unknown vector x can be predicted as,

I(x) = sign(f(x)). (3.22)

For a linearly separable dataset [69], a hyperplane H given by wTx + b = 0, can be obtained

that separates the two classes, so that,

yn(wTxn + b) ≥ 1, n = 1 . . . T.

An optimal linear separatorH provides a maximum margin between the classes. This means,

the distance betweenH and the projections of the training data from the two different classes

are maximum. The maximum margin is found to be 2/||w|| and data points xn for which

yn(wTxn + b) = 1, i.e., points that lie on the margins, are known as support vectors. In

a simple 2-dimensional case, the operation of the SVM is illustrated in Figure 3.2. The

optimization problem of an SVM can be summarized as:

maxw

2

||w|| subject to yn(wTxn + b) ≥ 1 for n = 1 . . . T.

When the training data is not linearly separable, the features can be mapped into a higher

dimensional space using Kernel functions where the classes become linearly separable. For

more details on SVM training and Kernels, the reader may refer to [69, 73]. Compensation

strategies that are developed for SVM based speaker recognition (e.g., NAP and WCCN)

will be discussed in later sections.

37

GMM-SVM system: summary

First proposed In 2006 by Campbell et. al. [70]Previous methods Adapted GMM based methods, GMM-UBM systemProposed method GMM supervector as utterance dependent features, classify using SVMsWhy robust? Combines the effectiveness of adapted GMM as an utterance model and

the discriminating ability of the SVM.

3.4 Factor analysis of the GMM supervectors

Factor analysis aims at describing the variability in a high dimensional observable data

vectors using a lower number of unobservable/hidden variables. For speaker recognition, the

idea of explaining the speaker and channel dependent variability using factor analysis in the

GMM supervector space was first discussed in [74]. The variability in the supervector space

is always assumed to be around the UBM supervector, thereby, using this model as the center

of the acoustic space. Many variants of factor analysis methods have been employed since

then, which finally led to the current state-of-the-art i-Vector approach. In this section, we

will discuss these methods briefly in order to illustrate how the techniques have evolved.

3.4.1 Linear distortion model

In the discussions to follow, a speaker dependent GMM supervector ms is generally assumed

to be a linear combination of four components. These components are as follows:

Component 1: Speaker/channel/environment independent component (m0)

Component 2: Speaker dependent component (mspk)

Component 3: Channel/environment dependent component (mchn)

Component 4: Residual (mres)

Component 1 is usually obtained from the UBM model and is a constant. Component 2-4

are random vectors and are responsible for representing the variability in the supervectors

38

due to a range of phenomena. Using this model, a GMM supervector obtained from speaker

s and session h is written as,

ms,h = m0 + mspk + mchn + mres. (3.23)

For acoustic features of dimension d, and a UBM with M mixture components, these GMM

supervectors are of dimension (Md×1). As an example, the speaker and channel independent

supervector m0 is the concatenation of the UBM mean vectors µg,

m0 =

µ1

µ2

...

µM

. (3.24)

We denote the sub-vectors of m0 for the g-th mixture as m0[g], which equals µg. In the

following sections, we discuss how well-known linear Gaussian models, including factor anal-

ysis, can be utilized to develop methods based on this generic decomposition of the GMM

supervectors.

3.4.2 Linear Gaussian models

Factor analysis (FA) [75] is one of the most general forms of the latent variable model.

It provides a decomposition of the data vectors x ∈ Rd in the following manner:

x = Wy + µ+ ε. (3.25)

Here, W is a (d × q) factor loading matrix (q < d), y is the (q × 1) hidden variable vector

(containing the latent factors), µ is the (d× 1) mean vector and ε denotes a residual noise

process. Traditionally, the hidden variables are assumed to be independent and standard

normal y ∈ N (0, I), and the noise model as Gaussian ε ∈ N (0,Ψ), where Ψ is diagonal.

The covariance of the model is thus C = Ψ + WWT. In essence, the factor loading matrix

39

determines the correlation among the data points whereas the noise term is responsible for

the residual variance (not co-variance) that are unique for each data point. This means

that the dependence between the observable variables x are explained by a lower number of

hidden variables y. The model from Eq. (3.25) does not have a closed form solution and can

be derived from training data using Expectation-Maximization algorithms [76, 77].

Probabilistic Principal Component Analysis (PPCA): In the special case when

the noise term ε in Eq. (3.25) is restricted to be isotropic, i.e., ε = σ2I, where σ2 is the noise

variance in each dimension, the factor analysis model becomes a Probabilistic Principal

Component Analyzer (PPCA) model. In this case, the factor loading matrix W spans the

principal subspace of the data. This means, orthonormal projections of the data on the first,

second, etc. columns of W have the highest, second highest, etc. variance. In PPCA, the

model covariance is represented as C = σ2I + WWT [77].

Principal Component Analysis (PCA): In the limiting case, when σ2 → 0 and the

noise is isotropic, standard PCA emerges from Eq. (3.25). PCA states that for d-dimensional

data vectors xn, n ∈ 1 . . . T , the q columns of the matrix W that spans the principal subspace

can be obtained by concatenating the dominant eigenvectors of the sample covariance matrix,

S =1

T

T∑

n=1

(xn − µ)(xn − µ)T. (3.26)

The optimal linear reconstruction of the data point xn is

xn = Wyn + µ, (3.27)

where yn = WT(xn−µ). This reconstruction of data points can be thought of as a “model”

of the data in this context. Therefore, the standard normal vector yn is interpreted as a

lower dimensional representation of xn. Similar to FA and PPCA, PCA parameters can also

be learned using an EM algorithm [78], although it is generally computed using eigenvalue

decomposition of the sample covariance matrix.

40

In FA and PPCA, the residual error is modeled by the term ε, which is not present in

PCA. We should also note that in none of the above methods, the model covariance C and the

data covariance S are actually the same when q < d. When q = d, PCA provides a whitening

transformation of the data, whereas the PPCA and FA models become pointless since the

residual component becomes zero. In summary, PCA and PPCA seeks the directions within

the data that maximizes the variance, whereas FA seeks those that contain the highest

co-variance.

In the following sections, we describe the linear models that have been proposed to operate

on the GMM supervectors for speaker verification and relate them to the linear distortion

model from Eq. (3.23), FA, PPCA and PCA, as discussed in this section.

3.4.3 Classical MAP adaptation

Here, we revisit the MAP adaptation technique discussed previously in the GMM-UBM

system in Section 3.3. If we examine the adaptation in Eq. (3.19), used to update the mean

vectors, it is clear that this is a form of linear combination of a speaker dependent and

independent components. In a more generalized way, MAP adaptation can be represented

as an operation on the GMM mean supervector as,

ms = m0 + Dzs, (3.28)

where D is an (Md×Md) diagonal matrix and zs is an MD × 1 standard normal random

vector. According to the linear distortion model of Eq. (3.23), mspk = Dzs. The procedure

for training the diagonal D matrix is detailed in [22] Section II C. As discussed in [22], in a

special condition the MAP adaptation equations of [19] given in Eq. (3.19). This is derived

in Appendix A. Thus, MAP adaptation of GMMs for speaker modeling can be seen as adding

a speaker dependent offset to the UBM mean vectors in the supervector space.

41

3.4.4 Eigenvoice adaptation

Perhaps the first factor analysis related model utilized in speaker recognition was the Eigen-

voice method [68]. Eigenvoice was initially proposed for speaker adaptation in speech recog-

nition [79]. In essence, this method restricts the speaker model parameters to lie in a lower

dimensional subspace, which is defined by the columns of the Eigenvoice matrix. In this

model, a speaker dependent GMM mean supervector ms is expressed as,

ms = m0 + Vys, (3.29)

where m0 is the speaker independent supervector obtained from the UBM, the columns of

the matrix V spans the speaker subspace, and ys are the standard normal hidden variables

known as speaker factors. In accordance with the linear distortion model from Eq. (3.23),

the speaker dependent component is mspk = Vys. Note that this model does not have

a residual noise term as in PPCA or FA. This means, the Eigenvoice model is essentially

equivalent to PCA. The model covariance is therefore VVT. Since supervectors are usually of

a large dimension, a full rank sample covariance matrix, i.e. the super-covariance matrix, is

difficult to estimate with limited amounts of data. Thus, EM algorithms [80, 78] are utilized

to estimate the Eigenvoices. The speaker factors need to be estimated for an enrollment

speaker. Computation of the likelihood score is carried out as provided in Eq. (19) [23],

using the adapted supervector.

This model implies that the adaptation of the GMM supervector parameters are restricted

by the Eigenvoice matrix. The advantage with this model is that when a small amount of

data is available for adaptation, the adapted model is more robust since it is restricted to

live in the speaker dependent subspace, and therefore less affected by nuisance directions.

However, it should be noted that Eigenvoice model does not model the channel or inter-

speaker variability.

42

3.4.5 Eigenchannel adaptation

Similar to adapting the UBM towards a speaker model, a speaker model can also be adapted

to a channel model [68]. This can be useful when an unseen channel distortion is observed

during test and the enrollment speaker model can be adapted to that channel. Similar to the

Eigenvoice model, the channel variability can also be assumed to lie in a subspace spanned

by the principal eigenvectors of the channel co-variance matrix. According to our distortion

model from Eq. (3.23), for a specific channel h, the term mchn = Uxh, where U is a low

rank matrix that spans the channel subspace, and xh ∈ N (0, I) are the channel factors.

When Eigenchannel adaptation is combined with classical MAP, we obtain the model for

the speaker and session dependent GMM supervector:

ms,h = m0 + Dzs + Uxh. (3.30)

More details on training the hyper-parameters D and U can be found in [22]. Finally,

likelihood computation can be carried out in a similar way as that used for the Eigenvoice

method.

3.4.6 Joint Factor Analysis (JFA)

The JFA model is formulated by combining Eigenvoice, Eigenchannel and MAP adaptation

together into a single model. This model assumes that both speaker and channel variability

lies in a lower dimensional subspace of the GMM supervector space. These subspaces are

spanned by the matrices V and U, as before. The model assumes, for a randomly cho-

sen utterance obtained from speaker s and session h, its GMM mean supervector can be

represented by,

ms,h = m0 + Uxh + Vys + Dzs,h. (3.31)

Thus, this is the only model thus far that considers all four components of the linear distortion

model discussed earlier. Indeed, JFA has been shown to outperform other contemporary

methods [23]. More details on implementation of JFA can be found in [23, 81].

43

Joint Factor Analysis: summary

First proposed In 2005 by Kenny et. al. [81]Previous methods MAP adapted GMM, GMM-SVM approachProposed method Model speaker and channel variability in GMM supervectorsWhy robust? Exploits the behavior of speakers’ features in variety of channel condi-

tions learned using factor analysis

Interestingly, a very similar model was being developed independently for face recognition

known as Probabilistic Linear Discriminant Analysis (PLDA) [82]. However, since JFA was

designed for GMM supervectors, the formulations involved processing the acoustic speech

frames and their statistics in different mixtures of the UBM. On the other hand, PLDA

simply assumed that the feature vectors (GMM supervectors in the case of JFA) were already

extracted and are available for modeling. We will discuss PLDA further in the following

sections.

3.4.7 The i-Vector approach

As discussed earlier, SVM classifiers on GMM supervectors have been a very successful

approach for robust speaker recognition. Factor analysis based methods, especially the JFA

technique have contributed to some of the best state-of-the-art systems. In an attempt

to combine the strengths of these two approaches, Dehak et al. [83, 2, 84] attempted to

use JFA as a feature extractor for SVMs. In their initial attempt [83], the speaker factors

estimated using JFA was used as features for SVM classifiers. Observing the fact that the

channel factors also contain speaker dependent information, the speaker and channel factors

were combined into a single space termed the “total variability” space [84, 2]. In this factor

analysis model, a speaker and session dependent GMM supervector is represented by,

ms,h = m0 + Tws,h. (3.32)

The hidden variables ws,h ∼ N (0, I) in this case are termed as “total factors”. Similar

to all factor analysis methods discussed thus far, the hidden variables are not observable

44

but can be estimated by their posterior expectation. The estimates of the “total factors”,

which can be used as features for the next stage of classifiers, came to be known as “i-

Vectors”. The term i-Vector is a short form of an “identity vector”, regarding the speaker

identification application and also of “intermediate vectors”, referring to its intermediate

dimension between that of a supervector and an acoustic feature vector[85, 2].

Unlike JFA or other factor analysis methods, the i-Vector approach does not make a

distinction between speaker and channel. It is simply a dimensionality reduction method

of the GMM supervector. In essence, Eq. (3.32) is a simple PCA model on the GMM

supervectors. Training the T matrix can be accomplished using the same procedure as the

Eigenvoice model. Since the i-Vector approach is the current state-of-the-art in speaker

recognition, we discuss the implementation of this method here in detail.

Step-1: Baum-Welch statistics estimation: For each utterance u of the develop-

ment dataset D, the zero and centralized first order Baum-Welch statistics are extracted

with respect to the UBM. Eqs. (3.13) and (3.14) are used for this. We drop the subscripts

s and h and replace them with u denoting an utterance. We have these statistics as

Nu(g) =∑

n

γn(g) and (3.33)

Fu(g) =∑

n

γn(g)xn. (3.34)

Here, the summation is computed over the number of frames available in the respective

utterance.

Step-2: EM Iterations: The T matrix of dimension (MD × R) is initialized by

random values. For each utterance u ∈ D, R × R precision matrix Lu and R × 1 vector Bu

are estimated as [86]:

Lu = I +M∑

g=1

Nu(g)TT[g]Σ

−1g T[g] and (3.35)

Bu =M∑

g=1

TT[g]Σ

−1g Fu(g), (3.36)

45

The i-Vector system: summary

First proposed In 2009 by Dehak et. al. [84]Previous methods JFA and GMM-SVMProposed method Reduce supervector dimension using factor analysis before classificationWhy robust? i-Vectors effectively summarize utterances and allows using compensa-

tion methods that were not practical in large dimensional supervectors

Table 3.1. Summary of the linear statistical models used in speaker recognitionModel Formulation Remarks

Classical MAP ms = m0 + Dzs D is diagonal, zs ∼ N (0, I)Eigenvoice ms = m0 + Vys V is low rank, ys ∼ N (0, I)Eigenchannel ms,h = m0 + Dzs + Uxh U is low rank, (zs,xh) ∼ N (0, I)Joint Factor Analysis ms,h = m0 + Uxh + Vys + Dzs,h U,V are low rank,

(xh,ys, zs,h) ∼ N (0, I)i-Vector ms,h = m0 + Tws,h T is low rank, ws,h ∼ N (0, I)

respectively, where T[g] is the g-th sub-matrix of T of dimension (d×R), and Σg is the UBM

covariance matrix. The total factors/i-Vector for the utterance u are estimated as:

wu = L−1u Bu. (3.37)

In each iteration, the g-th block of the T matrix is updated using the following equation:

T[g] =∑

u∈D

Fu(g)wTu

[∑

u∈D

(L−1u + wuw

Tu )Nu(g)

]−1

. (3.38)

Step-2 is then repeated for re-estimating the T matrix. Note that, since the UBM is fixed,

the Baum-Welch statistics extracted in Step-1 remain unchanged.

To extract an i-Vector during testing, the Baum-Welch statistics, Lu and Bu matrices

are first extracted from an utterance, and then Eq. (3.37) is used. An illustration of the

effectiveness of the i-Vector representation is shown in Figure 3.3, where i-Vectors are shown

to reside in clusters in a higher dimensional space when an LDA projection is used.

Table 3.1 highlights the key formulations and remarks for various factor analysis based

models discussed in this chapter.

46

spkr – 1!!spkr – 2!!spkr – 3!!spkr – 4!!spkr – 5!!spkr – 6!!spkr – 7!!spkr – 8!!spkr – 9!!spkr – 10!

Figure 3.3. A graphical representation of 79 utterances spoken by 10 individuals collectedfrom the NIST SRE-2004 corpus. The i-Vector [2] representation is used for each segment.This plot is generated using GUESS, an open-source graph exploration software [3], that canvisualize higher dimensional data using distance measures between samples.

3.4.8 Channel compensation in the i-Vector domain

The i-Vector approach itself does not directly perform any compensation. In essence, it only

provides a meaningful lower dimensional (∼400–800) representation of a GMM supervec-

tor. Thus, it has most of the advantages available with the supervectors, but due to its

47

lower dimension, many conventional compensation strategies could be applied for speaker

recognition which were previously not practical with large dimensional supervectors.

Linear Discriminant Analysis (LDA)

LDA is a commonly employed technique in statistical pattern recognition that aims at finding

linear combinations of feature coefficients to facilitate discrimination of multiple classes. It

finds orthogonal directions in the feature space that are more effective in discriminating the

classes. Projecting the original features in these directions improves classification accuracy.

Let, D indicate the set of all development utterances, ws,i indicates an utterance feature

vector (e.g., supervector or i-Vector) obtained from the i-th utterance of speaker s, ns denote

the total number of utterances belonging to speaker number s, and S is the total number of

speakers in D. With this representation, the between and within class covariance matrices

are given by,

Sb =1

S

S∑

s=1

(ws −w)(ws −w)T and (3.39)

Sw =1

S

S∑

s=1

1

ns

ns∑

i=1

(ws,i −ws)(ws,i −ws)T, (3.40)

(3.41)

where the speaker independent and speaker dependent mean vectors are given by,

w =1

S

S∑

s=1

1

ns

ns∑

i=1

ws,i and (3.42)

ws =1

ns

ns∑

i=1

ws,i, (3.43)

respectively. The LDA optimization thus aims at maximizing the between class variance

while minimizing the within class variance (which is postulated to be due to channel vari-

ability). The formulation requires the maximization of the Rayleigh coefficient for vector v,

J(v) =vSbv

T

vSwvT. (3.44)

48

The projections obtained from this optimization is found by the solution of the following

generalized eigenvalue problem:

Sbv = ΛSwv. (3.45)

Here, Λ is the diagonal matrix containing the eigenvalues. If the matrix Sw is invertible,

this solution can be found by finding the eigenvalues of the matrix S−1w Sb. Generally, the

first k < R eigenvalues are utilized to prepare a matrix ALDA of dimension R× k given by,

ALDA = [v1 . . .vk] , (3.46)

where v1 . . .vk denotes the first k eigenvectors obtained by solving Eq. (3.45). The LDA

transformation of the utterance feature w is thus obtained by,

ΦLDA(w) = ATLDAw. (3.47)

Nuisance Attribute Projection (NAP)

The NAP algorithm was originally proposed in [21]. In this approach, the feature space is

transformed using an orthogonal projection in the channel’s complementary space, which

depends only on the speaker. The projection is calculated using the within-class covariance

matrix. Define a d× d projection matrix [21] of co-rank k < d:

P = I− u[k]uT[k],

where u[k] is a rectangular matrix of low rank whose columns are the k principal eigenvectors

of the within class covariance matrix Sw given by Eq. (3.40). NAP projection is performed

on the utterance feature vector w as,

ΦNAP(w) = Pw. (3.48)

49

Within Class Covariance Normalization (WCCN)

This normalization was originally proposed for improving robustness in an SVM based

speaker recognition framework [71] using a one-versus-all decision approach. The WCCN

projection aims at minimizing the false alarm and miss error rates during SVM training.

Implementation of the strategy begins with using a dataset D, similar to that described

in the previous section. The within class covariance matrix Sw is calculated using Eq. (3.40)

and the WCCN projection is performed as,

ΦWCCN(w) = ATWCCNw, (3.49)

where the WCCN transformation matarix AWCCN is computed through Cholesky factorization

of S−1w such that,

S−1w = AWCCNAT

WCCN. (3.50)

In contrast to LDA and NAP, a WCCN projection conserves the directions of the feature

space.

i-Vector Length normalization

A very common pre-processing step applied on i-Vectors is length normalization [87]. This

process simply normalizes each i-Vector by its squared-root L2 norm. The normalized form

of an i-Vector w is given by,

wnorm =w

||w|| 12. (3.51)

Subtracting the mean of the i-Vectors computed over a large dataset is also commonly

employed for normalization.

3.4.9 Speaker verification using i-Vectors

After the introduction of i-Vectors are introduced, in essence, many previously available

pattern recognition methods were effectively applied in this domain. We discuss some of the

more popular methods of classification using i-Vectors in this section.

50

SVM classification

As discussed earlier, the i-Vector representation was discovered in an attempt to utilize JFA

as a feature extractor for SVMs. Thus, initially i-Vectors were used with SVMs with different

Kernel functions [2]. The idea here is the same as an SVM with GMM supervectors, except

that the i-Vectors are now used as the utterance dependent features. Due to the lower

dimension of the i-Vectors compared to supervectors, the application of LDA and WCCN

projections together became more effective and are therefore well-suited.

Cosine Distance Scoring (CDS)

In [2], cosine similarity measure based scoring was proposed for speaker verification. For this

measure, the match score between a target and test i-Vector wtarget and wtest is computed

as their normalized dot product,

CDS(wtarget,wtest) =wtarget ·wtest

||wtarget||||wtest||. (3.52)

This approach turned out to be very effective, despite its very simple nature, since no

enrollment process of a speaker was required.

Probabilistic Linear Discriminant Analysis (PLDA) classification

PLDA was first utilized for session variability compensation for facial recognition [82]. This

essentially follows the same modeling assumptions as JFA, i.e., a pattern vector contains the

class dependent and session dependent variability that lie in a lower dimensional subspace.

An i-Vector extracted from utterance u is decomposed as,

ws,h = w0 + Φβs + Γαh + ns,h. (3.53)

Here, w0 ∈ RR is the speaker independent mean i-Vector, Φ is the R×Nev low rank matrix

representing the speaker dependent basis functions/eigenvoices, Γ is the R × Nec low rank

51

matrix spanning the channel subspace, βs ∼ N (0, I) is an Nev × 1 hidden variable (i.e.,

speaker factors), αh ∼ N (0, I) is an Nec × 1 hidden variable (i.e., channel factors), and

ns,h ∈ RR is a random vector representing the residual noise.

PLDA was first introduced for speaker verification in [17] using a Heavy-Tailed distri-

bution assumption on i-Vectors, instead of a Gaussian assumption. Later, it was shown

that when i-Vectors are length normalized, a Gaussian PLDA model performs equivalent to

its Heavy-Tailed version [87]. Since the latter is computationally more expensive, Gaussian

PLDA models are more commonly used. Also, the use of a full-covariance noise model for

εh,s is feasible in this formulation, since it allows one to drop the Eigenchannel term from

Eq. (3.53) without loss of performance.

3.5 Research progress time-line

Much progress in the field of speaker recognition has been made recently. A time-line view

of the research progress in the past three decades is summarized in Figure 3.4. Only the

research works of significant impact that are relevant to this work is shown in the plot.

Also, some techniques are shown in the time-line that are not discussed here, including score

normalization, Automatic Speech Recognition (ASR) based and long-term feature based

methods. However, sample references are included for the interested reader.

52

VQ based speaker recognition [62, 61, 63]

GMM based text-independent speaker identification[18]

Universal background model used as an anti-model[65]

Adapted GMMs based method (GMM-UBM) [19]

Score normalization methods [6]

Speaker recognition using word N-grams [88]

Text-constrained speaker identification using GMMs [89]

SVM Generalized Linear Discriminant Sequence kernel[72]

SuperSID: using high level features [48]

Feature mapping [90]

Phonetic speaker recognition using SVM [91]

Speaker adaptive cohort selection (GMM ATNorm) [92]

Phonetic speaker recognition using lattice decoding [93]

Nuisance attribute projection [21]

ASR MLLR transforms as features [94]

Eigenvoice based speaker recognition [80]

GMM supervector SVM (GMM-SVM)[70]

Eigenchannel based speaker recognition [23]

Joint factor analysis (JFA)[22]

i-Vector and SVM [2]

i-Vector and Heavy-tailed PLDA [4]

i-Vector length norm G-PLDA [87]

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

Figure 3.4. Research progress on speaker recognition : Time-line view of the past 30 years

CHAPTER 4

EFFECTIVE UNIVERSAL BACKGROUND MODEL TRAINING

State-of-the-art Gaussian Mixture Model (GMM) based speaker recognition/verification sys-

tems utilize a Universal Background Model (UBM), which typically requires extensive re-

sources, especially if multiple channel and microphone categories are considered. As discussed

in Chapter 3, traditional acoustic modeling begins with the UBM for most recent speaker

recognition systems. Since little research has considered effective UBM acoustic model train-

ing, in this chapter, we systematically analyze speaker verification system performance when

UBM training data is selected and purposefully altered in different ways. These alterations

include: i) variation of the amount of data, ii) sub-sampling of the feature frames, and iii)

variation in the number of speakers. An objective measure is formulated from the UBM

covariance matrix which is found to be highly correlated with system performance when the

data amount was varied while keeping the total UBM data set size constant, and increasing

the number of UBM speakers while keeping the data amount constant. The advantages of

feature sub-sampling for improving UBM training speed is also discussed, and a novel and

effective phonetic distance based frame selection method is developed. The sub-sampling

methods presented are shown to retain baseline EER system performance using only 1%

of the original UBM data, resulting in a drastic reduction in UBM training computation

time. With respect to UBM speakers, the effect of systematically controlling the number of

training (UBM) speakers versus overall system performance is analyzed. It is shown experi-

mentally that increasing the inter-speaker variability in the UBM data while maintaining a

constant overall total data size gradually improves system performance. Finally, two alterna-

tive speaker selection methods based on different speaker diversity measures are presented.

53

54

Using the proposed schemes, it is shown that by selecting a diverse set of UBM speakers,

baseline system performance can be retained using less than 30% of the original UBM speak-

ers. The experiments conducted in this chapter are based on a GMM-UBM system, mostly

due to its low computational complexity. The NIST SRE-2008 [14] frame-work was used for

the evaluations. The research works presented in this chapter are previously published in

[27] c©2010 IEEE, and [26] c©IEEE 2011.

4.1 Motivation

The UBM is essentially a very large GMM trained to represent the speaker independent

distribution of speech features [19], and is employed as the expected alternative speaker model

during the verification task. It is also employed in open-set speaker recognition systems as

well. In the primary GMM based systems (GMM-UBM, GMM-SVM, JFA [22] and i-Vector

[2]), all speaker models are dependent on the UBM, making it a key element. However,

despite its importance, focused research on UBM training has not yet been conducted in

the literature. The general strategy is to use as many speakers and speech comprising

a wide range of speech/channel conditions, without much thought regarding data size or

performance tradeoffs. Though, recent research has focused on other aspects of the UBM,

such as [95] where adaptive individual background model training was considered, or [96],

where application of speaker normalization techniques on UBM data was investigated, basic

and fundamental questions regarding the construction of a UBM and its implications on

system performance is still an open unaddressed question. In this chapter, we present an in

depth consideration of the UBM training process and attempt to gain insight on how system

performance is related to specific UBM composition.

55

4.1.1 Parameters of the UBM

There are a number of distinct parameters involved in the UBM training process. It is

possible to classify these parameters into two broad categories as, a) algorithm parameters

and b) data parameters. Algorithm parameters are the variations in the training process

which include the number of mixture components, method of training, number of iterations,

method of initialization, etc. The data parameters include different ways of defining the

subset of available training data. These parameters consider the corpus, the amount of data,

number of speakers in the data, amount of data per speaker, method of selecting speakers,

ways of using the feature vectors, data balancing according to channel, microphone, language,

or other variability, and so on. Since the only available objective criteria that can measure

the quality of a UBM is overall/final system performance, finding a better UBM becomes

a challenge since it will generally rely on a trial and error based endeavor, making it very

impractical to vary all the mentioned parameters and find the optimal combination that

gives the best performance. Thus, in order to limit the scope of this research, we only focus

on a limited set of the data parameters, and attempt to analyze their effects on system

performance in order to answer some fundamental questions concerning UBM training.

4.2 The ideal UBM

As noted earlier, the UBM is a speaker-independent GMM trained with acoustic features

from a large set of speakers to represent the general, speaker independent distribution of

features. In the context of speaker verification, we would like to revisit the likelihood ratio

test discussed in Section 3.3.1. Given an observation O, and a hypothesized speaker s, the

task of speaker verification can be stated as a hypothesis test between:

H0 : O is from speaker s,

H1 : O is not from speaker s.(4.1)

56

In general, the hypothesis H0 and H1 are represented by a speaker dependent model Λs and

the background model Λs. Thus, for the observed feature vectors X , the likelihood ratio test

is performed by evaluating:

p(X|Λs)

p(X|Λs)

≥ τ accept H0

< τ reject H0

. (4.2)

Thus, ideally, the background model Λs should be a model that represents the entire space

of all possible alternative speakers to the hypothesized speaker s, which leads to a speaker

specific background model. This approach has been adopted by many researchers in the past

[95, 97, 98, 99]. However, creating a speaker specific background model for each enrolled

speaker can be computationally expensive, especially for a large number of speakers, which

is typically the case in NIST SRE evaluations [14]. Also, it may not always be necessary

to represent all outside speakers, versus only those that may attempt to enter the speaker

verification system as imposters. Thus, most modern speaker verification systems use a

single speaker independent background model, (i.e., the UBM) for modeling the alternative

hypothesis in the likelihood ratio test. Generally, the UBM is trained using a large amount

of data coming from a variety of different speakers and channel/microphone conditions, so

that the model contains at least some aspects of the variabilities that could be encountered

within the unknown test data.

4.2.1 Data balancing

Inspired by the general guideline of UBM data as presented in [19], the data requirements for

an ideal UBM can be specified as follows. Assume that there is only transmission channel

and microphone variability present in the available development data. Let the following

57

variables,

S = {si}, 1 ≤ i ≤ Ns, (4.3)

M = {mi}, 1 ≤ i ≤ Nm, and (4.4)

C = {ci}, 1 ≤ i ≤ Nc (4.5)

denote the set of all available speakers S, microphone types M , and transmission channel

types C, respectively, in a single gender database. Here, Ns, Nm, and Nc denote the number

of speakers, microphones and transmission channels, respectively. Let X denote the set of

all available features in the database. Next, define the following data sets:

Xsi = {x|x belongs to speaker si},

Xmi= {x|x was recorded with mic. mi}, (4.6)

Xci = {x|x comes from channel ci}.

Obviously, considering the total number of speakers, microphones, and channels, the union

of each reflects the total data available.

X =(∪Nsi=1Xsi

)=(∪Nmi=1Xmi

)=(∪Nci=1Xci

). (4.7)

If XI ⊂ X denote the set of features that should be used for the ideal UBM, it should

contain features from all these variabilities in proportions necessary to provide consistent

performance with respect to the test data. If, the prior probabilities of the occurrence of a

speaker, microphone type, or channel condition in the test data is known, the feature set XI

should fulfill the following constraints:

n(Xsi ∩ XI) = R [p(si)n(XI)] ∀si ∈ S, (4.8)

n(Xmi∩ XI) = R [p(mi)n(XI)] ∀mi ∈M, (4.9)

n(Xci ∩ XI) = R [p(ci)n(XI)] ∀ci ∈ C, (4.10)

58

where n(·), R[·] and p(·) indicate the number of elements in a set, the round-off operation,

and the prior probability of an element in the test data, respectively. Typically, there is no

prior knowledge concerning the test data condition distribution, leaving the system designer

with the only option of assuming these prior probabilities to be equal. This is known as

balancing the UBM data as discussed in [19]. It should be noted that Eq. (4.8) assumes all

speakers to be considerably diverse in nature; otherwise similar speakers (i.e., cohorts) may

introduce an unbalance in the data. This issue is further discussed in Section 4.7.

4.2.2 Data amount

It is clear that the set X of all available data features should be large enough to represent all

variabilities faithfully. For example, if there is only 5 min of data for a cordless phone in the

entire 10 hour UBM data set, trying to balance the microphone type would require us to use

only 5 min of data from each microphone type, which may lead to insufficient amounts of

UBM data. However, what defines a data amount to be sufficient is system dependent; with

only clean train and test data (e.g., an experiment with TIMIT dataset), a small amount

of data maybe required, whereas when cross-channel conditions are encountered, the UBM

dataset needs to be expanded to cover more variations. Thus, for a given data set, the

amount of data that appropriately represents the variability of the entire corpus should be

sufficient for the UBM. Since human speech features can only occupy a limited region in the

feature space due to physiological constraints, it is expected that the variability of the data

would be saturated when the data amount becomes very large, assuming other conditions

(e.g., channel, microphone, or language variability) are kept the same. Let ϑ(·) represent a

function that can measure the variability of the UBM data, then as the size of the set X

increases, ϑ(X ) should approach some constant value. Mathematically, this can be written

as follows,

limn(X )→∞

ϑ(X ) = υX . (4.11)

59

Here, υX is a constant that represents the amount of variability present in the entire dataset

according to the defined measure. Having defined the scope of the data characteristics, it is

now possible to consider alternative UBM training schemes. A baseline system scenario is

first considered in the next section.

4.3 Baseline system description

Since the objective of this research is focused only on UBM training, a fairly standard

GMM-UBM [19] baseline system is employed without any mismatch compensation or score

normalization. The male portion of the 5 min telephone train/test condition trials [14] of the

NIST 2008 SRE are used for all evaluations. A prime reason for not using expanded or en-

hanced GMM based systems (i.e., joint factor analysis (JFA) or Eigenchannel [22] or i-Vector

[2] and PLDA [100]), is because they require time consuming training of the Eigenchannel

and Eigenvoince matrices each time the UBM is retrained. Given the number and extent of

the experiments required in this study, it was decided that using such enhancements would

be impractical. Future studies could further explore the impact in UBM construction with

other system processing tasks.

4.3.1 Front-end processing

For the front-end, 39-dimensional MFCC features (MFCC+∆+∆∆) are extracted using a

25 ms analysis window with 10 ms shift. Next, feature warping [56] based on applying a 3-s

sliding window, is performed. To remove silence frames, a phoneme recognizer [101] based

voice activity detector (VAD) is utilized.

60

4.3.2 UBM training

For baseline UBM training, 1024 mixtures are used. UBM training is performed using the

Maximum Likelihood (ML) criterion with HTK [102] tools, performing 15 iterations per

mixture split.

4.3.3 UBM database

Two different databases are used for UBM training. They are summarized below including

the system performance using these two sets in the baseline system.

ID # utterances # speakers Source % EERDataset-1 2019 126 SRE-04 1-s (5 min training) 11.43Dataset-2 5685 392 SRE-04 & SRE-06 11.41

4.3.4 Speaker modeling and scoring

For modeling, gender dependent UBMs are adapted to each enrollment speaker dependent

model using classical MAP adaptation [19] with one iteration and a relevance factor of 19.

During scoring, a standard 20-best expected log likelihood ratio scoring is employed.

4.4 UBM data: What is a sufficient amount

The first question that arises concerning the UBM data is the required amount. A common

assumption in UBM training is that the more data used, the better the expected system

performance. UBMs with 512, 1024, 2048 or more mixtures are sought after, with the

assumption that they represent the definitive world speaker acoustic space. Research groups

involved in the NIST SRE typically use 5 min utterances from all NIST 2004-2005 data along

with the Switchboard Cellular I and II data [103]. However, there is no concrete evidence

that using the maximum amount of data guarantees better overall performance. According

61

to [19], as long as the development speaker population is kept the same, a small amount

of data is sufficient for reasonable system performance. This suggests that, the degree of

inter-speaker variation in the data is more important than the absolute amount of data per

speaker. In this section, system performance variation is examined by increasing the total

amount of data for UBM training, while keeping the speaker population the same.

An experiment was performed using the UBM database-1 as described in Section 4.3.

It should be noted that in this UBM data set, a single speaker may occur multiple times

in different channel/mic conditions. This database is used mainly because of the following

reasons:

• The NIST SRE-2004 1-s corpus contains sufficient variability for the task;

• Other research groups have shown success in UBM training using this set alone [104,

105, 106];

• The actual number of speakers in this experiment is not a primary concern;

• Each utterance is 5 min in duration, making it convenient to extract equal amounts of

data from each utterance uniformly.

The total duration of the training data is 168.25 hours. Next, the UBM is trained using only

the first n feature frames from each utterance, and evaluated within the GMM-UBM system

(described in Section 4.3) for the male trials only. For different values of n, the equivalent

total data amount in hours, the EER and CPU times required for training are obtained.

How the UBM data amount affects the CPU training time and system EER is illustrated in

Figure 4.1(a) and (b), respectively.

4.4.1 Average Weighted Variance (AWV)

A simple formula is proposed to measure the variability of the data. Since a variance measure

of a 39 dimensional feature vector may not provide an accurate measure of variability of the

62

data, an analysis of the UBM covariance matrix is performed. Here, a parameter, the

Average Weighted Variance (AWV) Σ, is defined which is computed from the UBM diagonal

covariance matrices as follows. For the acoustic features x ∈ RK , let the UBM Λ0 be

expressed as,

f(x|Λ0) =M∑

g=1

πg(2π)K/2|Σg|1/2

exp

[−1

2(x− µg)TΣ−1

g (x− µg)], (4.12)

where πg, µg, Σg, K and M denote the weights, mean vectors, covariance matrices, feature

dimension and number of Gaussian mixtures, respectively. Assuming a diagonal covariance

matrix Σg, let

Σg =

σ2g,1 . . . 0

.... . .

...

0 . . . σ2g,K

. (4.13)

Next, define the average weighted variance (AWV), Σ, as

Σ =1

K

M∑

g=1

πg

K∑

j=1

σ2g,j. (4.14)

Thus, this measure will serve as our data diversity measure υX as defined in Eq. (4.11).

From Figure 4.1(b), it is clear that performance is comparable to the baseline system

using only ∼1.5 hour of UBM data, which results in about ∼2.7 seconds of data from each

utterance. This is not surprising since this ∼1.5 hour of data contains all the inter-speaker

variabilities present in all utterances from the original NIST SRE-2004 corpus used. Very

interestingly, from Figure 4.2(a), a clear relation with Σ and system EER is observed. After

using more than 1.5 hours of data, the variance parameter Σ, saturates, which indicates

that the 2.7 seconds of data used from each utterance is actually sufficient to represent the

variability, with greater amounts having less impact on the UBM, in this case. Obviously

this duration measure cannot be completely generalized. The scatter plot in Fig 4.2(b)

shows the correlation between Σ and EER more clearly, with these two parameters having

63

0.1 0.5 1 2 4 8 16 32 64 1680

100

200

300(a)

Amount of UBM Data (hrs)

CP

U T

ime (

min

s)

11 11.5 12 12.5 13 13.5 14 14.5 15

0.58

0.6

0.62

0.64

(b)ρ = −0.8372

EER

Σ

0.1 0.5 1 2 4 8 16 32 64 16811

12

13

14

15

(b)


EE

R

0.1 0.5 1 2 4 8 16 32 64 168

0.58

0.6

0.62

0.64

(a)


Σ

Baseline EER

Figure 4.1. (a) UBM training CPU time variation with changing amount of UBM data(hrs). (b) Variation of system EER with total amount of UBM data (hrs). Feature framesare selected uniformly from each utterance.

a correlation coefficient of −0.8372. From Figure 4.1(a), an exponential relation can be seen

between UBM training CPU time and the total UBM data amount. In these experiments,

it was observed that using about 1.5 hours of data requires about 30 CPU minutes to train

the UBM, while more than 200 CPU minutes are needed if all the data is used. Since CPU

computation time can be considered linearly proportional to the complexity of training, this

indicates that, contrary to popular belief, that “more training data is better”, the addition

of five times the computational resources with more than 160 hours of training data actually

has a negligible contribution to improving overall system performance.

Thus, the conclusion drawn here is that if the selected UBM data set is well chosen, (i.e.,

contains sufficient speech/speaker variability) using all the features of each utterance is not

64

0.1 0.5 1 2 4 8 16 32 64 16811

12

13

14

15

(b)


EE

R

0.1 0.5 1 2 4 8 16 32 64 168

0.58

0.6

0.62

0.64

(a)


Σ

Baseline EER

0.1 0.5 1 2 4 8 16 32 64 1680

100

200

300(a)


CP

U T

ime (

min

s)

11 11.5 12 12.5 13 13.5 14 14.5 15

0.58

0.6

0.62

0.64

(b)ρ = −0.8372

EER

Σ

Figure 4.2. (a) Variation of UBM average weighted variance (Σ) with total amount of UBMdata (hrs). (b) Scatter plot showing the correlation between EER and average weightedvariance (Σ). All data points from Figure 4.1(b) and Figure 4.2 (a) are used to generate thisscatter plot.

necessary. In the next section, we investigate how system performance is affected based on

alternative feature selection methods.

4.5 Sub-sampling of feature frames

The idea of using a subset of features from a given UBM utterance arises as an inevitable

consequence of using a reduced amount of data. The simplest methods for sub-sampling

feature frames would include decimation and random feature selection, which have already

been utilized to improve the CPU training time of a GMM [107, 25]. Clearly, these feature

sub-sampling methods do not consider the actual acoustic content of the features, and select

65

features in a blind manner. Here, we consider an adaptive phone dependent feature sub-

sampling scheme for effectively capturing the subtle nuances of features in each utterance

using a very small amount of data. The goal is to maximize the data variability that can

be captured from a given UBM utterance using a minimal number of features. This is

motivated by considering the feature selection issue at the phone level. Since inter-speaker

variability in the UBM data has a higher contribution to system performance [19], intra-

speaker phone variation should be less relevant for the UBM. When a long duration utterance

is used for a speaker, some phones will occur more frequently and with greater duration, and

therefore would contribute to the probability density function (PDF) components in the

UBM that represent the intra-speaker distribution of that phone, causing an imbalance.

Thus, reducing the development data by means of proper selection of the training feature

vectors will obviously improve computation speed, with a possible improvement in overall

system performance as well.

In the previous section, it was established that only 2.7 seconds of data from each 5 min

utterance of the UBM data is sufficient for system performance equivalent to the baseline

configuration. In this section, several alternative approaches for selecting this subset of

development data for UBM training are considered for more effective representation of the

feature space of each utterance. Time versus frequency spectrograms of these approaches are

illustrated in Figure 4.3 (b)–(e) using a spectrogram of an original TIMIT utterance, shown

in Figure 4.3(a). The use of the first n feature vectors from each utterance, as performed

in the previous section, is termed as “leading feature selection” (LFS) and is depicted in

Figure 4.3(b). As noted earlier, sub-sampling the feature frames can also be done uniformly

or randomly [107, 25]. These methods are denoted as UFS (“uniform feature selection”)

and RFS (“random feature selection”), and illustrated in Figure 4.3(c) and (d), respectively.

Now, though the sub-sampling methods LFS, UFS and RFS would reduce computation time,

they are overly simplified and completely data independent. These methods do not consider

66

(a)F

req

ue

ncy

Time

(c)

Fre

qu

en

cy

Time

(b)

Fre

qu

en

cy

Time

(d)

Fre

qu

en

cy

Time

(e)

Fre

qu

en

cy

Time

Figure 4.3. Conceptual illustration of the feature selection schemes. (Selected frames areshown in dark.) (a) Original utterance spectrogram, (b) LFS, (c) UFS, (d) RFS and (e) IFS.

the specific distribution of the phonetic content over time. Thus, we propose a generic

method termed “intelligent” feature/frame selection (IFS), which aims to select a diverse

set of n training feature frames from the set of input training utterances. This method

assesses the similarity of successive frames using a phonetically motivated distance measure,

with selection of a feature frame only if the corresponding dissimilarity is higher than some

threshold. In Figure 4.3 (e), a conceptual IFS method that attempts to select a frame from

the beginning of each distinct phone is illustrated.

Clearly, there can be variations in this approach if alternative distance criteria between

features are used. Since one design criteria here is training speed, in this phase the Euclidean

distance is used due to its’ simplicity.

67

4.5.1 Feature selection based on Euclidean distance

In this section, an intelligent feature selection (IFS) scheme is proposed based on the simple

Euclidean distance between features (IFS-EU). The aim here is to estimate the similarity

between successive feature frames using this distance measure, and select a feature frame

only if the frame is sufficiently different from those previously selected. From an intuitive

understanding of the Euclidean distance, it is noted that this distance measure is related

to the smoothed log-spectral distance [108] when applied to cepstral feature vectors. The

formulation is initiated by deriving the probability density function (PDF) of the distance

function between feature vectors.

PDF of Euclidean distance between features

Assume that the K dimensional feature vectors of the development speaker data, originating

from a specific phone, can be modeled by an independent, wide sense stationary (WSS), white

Gaussian vector random sequence x[n] with a covariance function matrix KXX [m,n] given

by,

KXX [m,n] = diag(λ1...λK)δ[m− n], (4.15)

wherem, n denotes the feature indices, and λp (p = 1, .., K) are the variances of the individual

cepstral coefficients. The Euclidean distance between the mth and nth feature vector will

be,

d(m,n) = ||x[m]− x[n]|| 12 . (4.16)

The feature vectors have a common mean, and thus the term inside the parenthesis in

Eq. (4.16) will be a zero mean vector random sequence. Also, due to the independence

assumption, d(m,n) is independent of m and n. Thus,

d(m,n) = d = ||Z|| 12 , (4.17)

68

0 20 40 60 80 100 120 140 1600

0.005

0.01

0.015

0.02

0.025

d

PD

Fofd

f(d)

f(d)Histogram

0 20 40 60 80 100 120 140 1600

0.005

0.01

0.015

0.02

0.025

d

PD

Fofd

dth forα = 0.1

dth forα = 0.2

Area = 0.1

Figure 4.4. (a) Comparison of the theoretical PDF, its Gaussian approximation and theactual PDF obtained from a feature-distance histogram. 13 dimensional MFCC coefficientswere used and the parameter λ was calculated directly from the data. For this data, λ =281.6836, µD = 83.9506 and σ2

D = 276.07. (b) A PDF of the inter-feature Euclidean distanceand the proposed distance threshold (shown for α = 0.1 and 0.2).

where Z = x[m] − x[n] is a zero mean Gaussian random vector having a covariance matrix

KZZ , and found to be,

KZZ = diag(2λ1...2λK). (4.18)

The factor of 2’s are introduced because each element of Z is constructed from the

subtraction of two independent white Gaussian random variables having variances λp where

p = 1 . . . K. From Eq. (4.17), it is possible to write,

d2 =K∑

i=1

Z2i =

K∑

i=1

(2λi)W2i , where Wi ∼ N (0, 1). (4.19)

For simplification, assume that the effect of the individual λi values in Eq. (4.19) can be

approximated using a lumped parameter λ. Thus,

d2 ≈ 2λK∑

i=1

W 2i = 2λY, (4.20)

where λ is defined as the average variance given by

λ =1

K

K∑

i=1

λi. (4.21)

69

In Eq. (4.20), Y =∑K

i=1 W2i is a squared sum of zero mean independent Gaussian random

variables, and thus will follow a chi-squared distribution given by,

fY (y) =(1/2)

K2

Γ(K/2)y(K

2−1)e−y/2. (4.22)

From Eq. (4.20), d =√

2λY . Using this transformation in Eq. (4.22), the PDF of d can be

obtained as,

fD(d) =21−K

Γ(K/2)

dK−1

λK/2exp

(− d

2

4λ

). (4.23)

The mean and variance of this distribution can be found as,

µD =2√λΓ(1+K

2)

Γ(K/2)and (4.24)

σ2D = 2Kλ− µ2

D, respectively. (4.25)

Thus, the PDF of d provides the distribution of the distances between any two randomly

selected feature vectors in the data set. In other words, if the goal is to select feature vectors

that are farther apart on average, a set of features should be selected in which each pair has

a distance greater than µD, given, the PDF parameters are known.

Calculation of distance threshold

In this feature selection problem, the data is processed on a frame-by-frame basis. Assuming

that the PDF parameters are known for the current frame, select the next frame if its distance

from the current frame is greater than a threshold dth. For a fixed value α ∈ [0, 1], define

dth as,

P [d > dth] =

∫ ∞

dth

fD(z)dz = α. (4.26)

The process is illustrated in Figure 4.4(b) for α = 0.1 and 0.2. This implies that a feature

vector is selected only if its’ distance from the current feature is so high that the event is

less probable than α, suggesting a high likelihood of a change in the phone represented by

the feature. It is observed that the PDF, fD(d), can be closely approximated by a Gaussian

70

distribution having a mean µD and variance σ2D. Figure 4.4(a) compares the function fD(d),

its Gaussian approximation fD(d), and a histogram plot, estimated from 13 dimensional

MFCC coefficients of a UBM utterance. Using the Gaussian approximation, it is possible to

obtain from Eq. (4.26),

dth = µD +√

2σDerfc−1(2α), (4.27)

where erfc−1 is the inverse of the complementary error function (erfc). Here, erfc() is defined

as:

erfc(x) =

√2

π

∫ x

0

e−t2

dt.

Estimation of PDF parameters

Here, a recursive method is employed for estimating the feature vector mean and variance

similar to [109]. Denoting λ[n] as the vector containing the diagonal elements of KXX [0, 0],

and µX [n] as the mean vector of the nth frame, the equations,

µX [n] = βmµX [n− 1] + (1− βm)x[n], and (4.28)

λX [n] = βvλX [n− 1] + (1− βv)||x[n]− µX [n]|| (4.29)

are used, where βm, βv ∈ [0, 1) are overall smoothing parameters.

Implementation

Let i denote the current frame index and set j = i + 1. For initialization (i = 1), x[i] is

always selected, and µX [i] and λX [i] are calculated from x[i] and x[j] as,

µX [i] = 0.5(x[i] + x[j]) and (4.30)

λX [i] = 0.5(x[i] ·2 +x[j]·2)− µX [i]·2, (4.31)

where ()·2 denotes an element-wise square operation. Next, λ and dth are calculated using

Eq. (4.21) and Eq. (4.27). Now, j is iteratively incremented by 1 and d(i, j) is calculated

71

Table 4.1. Comparison of different UBM training schemes with respect to EER and trainingCPU time.

Method % Data used EER (%) CPU Time h:mm

Baseline 100% 11.43 3:46

LFS 1% 11.48 0:24

UFS 1% 11.54 0:22

RFS 1% 11.41 0:18

IFS-EU 1% 10.99 0:27

from Eq. (4.16). The values µX [i] and λX [i] are updated in each step using Eq. (4.28) and

Eq. (4.29), along with the threshold dth. If d(i, j) > dth is found, x[j] is selected. Next, i = j

and j = i+ 1 is set and the process is repeated until the desired number of features/frames

are selected. In our experiments, the settings used are α = 0.1, βm = 0.8 and βv = 0.6.

4.5.2 Performance of sub-sampling schemes

The EER performance along with the computation time required for UBM training using

the set of presented approaches is shown in Table 4.1. Baseline performance with 100 % of

the data used to train the UBM is 11.43 %. It is clear that all four UBM training methods

considered here, using a mere 1 % of the total available UBM data employed can provide

performance equivalent to the baseline system with up to a 7 times reduction in CPU com-

putation time. In addition, using the proposed feature selection scheme, denoted by IFS-EU,

it is noted that a ∼0.4 % reduction in EER is achieved in comparison to the baseline system.

This is because the selected features in the IFS-EU method are better able to represent the

diverse speaker pool, while suppressing some of the fine model traits of the intra-speaker

phone variability, which, it is believed, are less important for construction of an effective

UBM. However, we clarify that we do not claim that the 99% data not used for UBM train-

ing are specifically harmful for the system. We simply emphasize the fact that more data is

not necessarily better, and that sub-sampling this data properly can provide equivalent or

even better performance than using all the available data.

72

4.6 UBM data: Number of unique speakers

Variability in the UBM data is also related to the number of speakers present in the data-set.

In this section, the impact of changing the number of unique speakers in the UBM training

data on system performance is considered. It was shown in the previous section, that if

the data contains sufficient variability, a very small portion of data should be sufficient for

training the UBM. Now, different speakers should possess different feature speech/physiology

characteristics, indicating that an increase in the number of speakers in the UBM should

also lead to an increase in the variance of the data. Intuitively, this should be beneficial for

overall system performance. An experiment is performed to validate this hypothesis.

In this experiment, the amount of data was kept fixed and the number of unique speakers

was varied from 10 to 320 in an exponential manner. The UBM dataset-2 was used in this

case because it has a larger number of speakers. For each UBM training run, the specified

number of speakers are selected randomly from the pool and system performance is computed

for the UBM trained with those speakers’ features/frames. Five independent experimental

runs are performed and the average of those EERs are calculated. The average weighted

variance (AWV) values are also calculated for each UBM using Eq. (4.14). In Figure 4.5(a)

and (b), AWV (Σ) and EER are plotted against the number of UBM speakers. As we

expected, system performance is improved drastically as the number of UBM speakers are

increased, with an increase of the AWV. At some level, overall performance saturates. Thus,

we justify that introducing a new speaker increases the variability of the UBM data (as

reflected in the increase in AWV), which can benefit system performance. This in effect

helps further justify the argument of [19] regarding the amount of training data for both the

UBM and specific speaker models.

73

10 20 40 80 160 320

12

12.5

13

(b)

No. of UBM speakers

EE

R

10 20 40 80 160 3200.61

0.62

0.63

0.64(a)

No. of UBM speakers

Σ

Figure 4.5. Variation of (a) AWV (Σ) and (b) system performance with the change of numberof UBM speakers.

4.7 Selection of UBM speakers

We would like to investigate the issue of using a subset of all the available speakers in

the UBM. It has been established that having a large number of dissimilar speakers in the

UBM data aids in improving system performance. However, it is known that many speakers

have similar acoustic properties, (i.e., the cohort speakers) that may again introduce an

unbalance in the UBM data. This issue can be illustrated with a hypothetical feature space

in Figure 4.6. In this data-set, if equal amounts of data from all speakers are used to train

the UBM, the similar speakers that are clustered together, (i.e., cohort groups 1,2 and 3)

will be emphasized in the UBM. This would result in a higher score from the UBM in the

likelihood ratio test if the a test speaker is from that cohort group. Now, it would be better

74

Figure 4.6. A schematic diagram of the speaker space in the UBM.

if these speakers are spread out in the feature space as much as possible so that the entire

acoustic space is uniformly covered (assuming a uniform open speaker test space). However,

practically there are some problems in such a scenario.

• Since the feature space is multidimensional, uniformly covering the entire acoustic

space of speakers would require a very large number of speakers.

• In reality, the speaker features are not very easily distinguishable as in the simplistic

illustration in Figure 4.6, rather they are highly overlapping.

Thus, the motivation here is to use a reduced number of speakers versus all the available

speakers according to some speaker divergence criteria, so that closely related speakers are

not used in the UBM (i.e., if the speaker is already included in the training set, do not

include an acoustically close neighbor as well).

75

4.7.1 KL divergence based speaker selection (KL-D)

Here a UBM speaker selection method is developed using the Kullback-Leibler (KL) diver-

gence between speaker models. For each UBM speaker si, i ∈ (1, Ns), a GMM model Λi is

trained. To calculate the similarity between GMMs, the symmetric KL divergence [110, 111],

is used, given by,

DsKL(Λi,Λj) = EΛi(x)

[log

Λi(x)

Λj(x)

]EΛj(x)

[log

Λj(x)

Λi(x)

], (4.32)

where Λi(x) and Λj(x) are likelihoods of occurrence of the observation vector x, given that

it belongs to speaker model Λi and Λj, respectively. Next, compute the Ns ×Ns divergence

matrix, obtaining the KL score for each pair of speakers. To measure how diverse speaker i

is from the all other speakers, define a diversity factor Di, given by

D(KL)

i =1

Ns

∑

j∈Ns,j 6=i

DsKL(Λi,Λj). (4.33)

This relation means, Di is a measure of the average divergence of the model Λi, from all

other speaker models. Thus, after computing all Di values, they are sorted according to

their absolute value, and the top ND most divergent speakers are selected for the UBM.

4.7.2 Speaker selection using prototype UBM (P-UBM)

In this method, to find the most divergent speakers, all the data is pooled and a prototype

UBM model, Λ0 is trained. Assuming this UBM holds a central position in the GMM space,

an attempt to find speakers that are most divergent from this UBM is performed. The

diversity factor for each speaker i is computed simply from the likelihood of occurrence of

that speaker’s features Xi given the model Λ0.

D(P)

i = Λ0(Xi). (4.34)

In this computation, the individual feature frames are assumed to be independent as dis-

cussed in Chapter 3. In a similar way, the D(P)

i values are sorted and the top ND divergent

76

20 60 100 140 180 220 260

11.5

12

12.5

13

13.5

No. of UBM speakers

EE

R

KL−D

P−UBM

Baseline EER

Figure 4.7. System performance variation with the change of number of UBM speakersselected using different methods.

speakers from the prototype UBM is selected. It is noted that this is a simplistic method

for speaker selection and does not guarantee that the selected speakers are diverse among

themselves, versus being diverse from the given initial UBM.

4.7.3 Results of speaker selection methods

In these experiments, the UBM database-2 is used and the total amount of data is held

fixed to 1.5 hours. The system is evaluated by varying the number of speakers, ND from

20 to 300 in an exponential fashion while the proposed KL-D and P-UBM methods were

used for selecting the best speakers. The exact number of frames were selected from each

utterance using the LFS method (described in Secion V) from the selected speakers’ data

such that the total amount of data equals 1.5 hours. For all values of ND, the EER values

obtained are plotted in Figure 4.7. It should be noted that 1.5 hours of data used in this

case is only 0.3% of the complete UBM database-2 (which contains 473.75 hours of data).

Thus, according to experiments in Section 4.4, this amount is not sufficient to retain the

77

baseline system performance. However, this lower amount is still employed so that (a) it is

more convenient to analyze the effect of the selected speakers data, and (b) enough data per

speaker is available for the case of a lower number of selected speakers.

From Figure 4.7 we observe that, though there are some fluctuations1 it can be seen that

both methods perform very close to the baseline system using a significantly lower number

of speakers (i.e., 60 and 100 for methods KL-D and P-UBM, respectively). Notice that this

close to baseline performance is achieved using only 1.5 hours of training data, instead of

the total 473.75 hrs of data. Interestingly, after a certain point, system performance actually

degrades as the number of speakers is increased in the proposed methods, which we believe

is due to the introduction of similar/redundant speakers that are creating an imbalance

in the UBM data. As expected, the performance does not reach the baseline for a larger

number of speakers for either method, since the 1.5 hours of data is not sufficient to retain

the entire variability of the dataset. Now that we identified the regions in the plot where the

proposed methods perform the best, we attempt to increase the data amount for the selected

speakers for further performance improvement. For the KL-D approach using 60 speakers,

we increased the amount of data from 1.5 hours to 3 hours and obtained an improvement

in EER from 11.46% to 11.30%, which is better than the baseline EER. For the P-UBM

method, using 100 speakers and 1.5 hours of data already provides a slight improvement

over the baseline system which uses the full UBM database-2. The results are summarized

in Table 4.2.

Note that there are 392 speakers in UBM database-2, which means less than 30% of the

speakers were used in both proposed methods. Also, less than 1% of the total amount of

data was used for training the UBM. Thus, we conclude that if a diverse set of speakers

1We believe this fluctuation in EER is due to the fact that the GMMs are trained on multiple utterancesof the same speaker from different channels. This creates a mild bias toward the dominating channel typein each GMM, resulting in slight channel dependent clustering in some cases. Future studies could exploreUBM construction by suppressing channel effects from the UBM utterances using techniques like JFA, totalvariability features [2], etc.

78

Table 4.2. Comparison of different speaker selection approach for UBM training with respectto EER and No. of speakers.

Method % Data used No. of speakers EER (%)

Baseline 100% 392 11.41

KL-D 0.31% 60 11.46

KL-D 0.62% 60 11.30

P-UBM 0.31% 100 11.32

can be carefully selected, a much lower number of speaker data can provide performance

equivalent to/better than the baseline system for speaker recognition.

4.8 Conclusions

In this chapter, an organized method was developed for determining the data to be selected

for effective UBM training. Rigorous experiments were performed showing the relationship

between data variance and overall speaker verification system performance. Four efficient

sub-sampling schemes for frame selection were presented with potential benefits of reducing

the computation time by up to 7 fold. A new intelligent frame sub-sampling algorithm was

proposed which is experimentally shown to outperform the baseline system that uses all the

available data. The implication of selectively using speaker data for UBM construction was

analyzed and two effective speaker selection methods were proposed and evaluated. The

results showed that carefully selecting a reduced speech data size and speaker count are

sufficient to achieve effective speaker verification performance.

After analyzing the data requirements for the UBM, in the next chapter, we aim to focus

on alternate acoustic modeling strategies. Since the UBM is the central acoustic model in

state-of-the-art speaker recognition schemes, in the next part of this dissertation, we will

provide further analysis on the UBM aiming to obtain a more robust acoustic model. In this

chapter, we exclusively dealt with diagonal covariance UBM models, whereas, in the next

chapter, we will consider full-covariance UBM models, their variants and derivatives.

CHAPTER 5

ACOUSTIC FACTOR ANALYSIS

Factor analysis based channel mismatch compensation methods for speaker recognition are

based on the assumption that speaker/utterance dependent Gaussian Mixture Model (GMM)

mean supervectors can be constrained to reside in a lower dimensional subspace. Different

methods based on various subspace assumptions in this space is discussed in Section 3.4.

These approaches, however, do not consider the fact that conventional acoustic feature vec-

tors also reside in a lower dimensional manifold of the feature space, when feature covariance

matrices contain close to zero eigenvalues. In this chapter, based on observations of the co-

variance structure of acoustic features, we propose a factor analysis modeling scheme in the

acoustic feature space instead of the GMM supervector space, and derive a mixture depen-

dent feature transformation. We demonstrate how this single linear transformation performs

feature dimensionality reduction, de-correlation, normalization and enhancement, all at once.

The proposed transformation will be shown to be closely related to signal subspace based

speech enhancement schemes. This factor analysis model will be shown to be derived from

a well-trained full-covariance UBM, thus, providing an improved acoustic modeling scheme.

In contrast to traditional front-end mixture dependent feature transformations, where fea-

ture alignment is performed using the highest scoring mixture, the proposed transformation is

integrated within the speaker recognition system using a probabilistic feature alignment tech-

nique, which nullifies the need for regenerating the features/retraining the Universal Back-

ground Model (UBM). Incorporating the proposed method with a state-of-the-art i-Vector

and Gaussian Probabilistic Linear Discriminant Analysis (PLDA) framework, we perform

evaluations on National Institute of Standards and Technology (NIST) Speaker Recogni-

tion Evaluation (SRE) 2010 core telephone and microphone tasks. The experimental results

79

80

010

2030

4050

0

10

20

30

40

50

60

−1

−0.5

0

0.5

1

1.5

2

∆Static

∆∆

(a)

0 10 20 30 40 50 600

1

2

3

4

5

6

7

8

9

Sorted Index

Eig

envalu

e

90% energy is accountedby the first 32 Eigenvalues

(b)

Figure 5.1. Analysis of full covariance matrices of a UBM trained using 60-dimensionalMFCC feature (20 static+∆+∆∆). (a) A 3-D surface plot of the covariance matrix showinghigh values in the diagonal and significant off-diagonal values indicating correlation amongdifferent feature coefficients. (b) Sorted eigenvalues of the same covariance matrix demon-strating that most of the energy is accounted for by in the first few dimensions.

demonstrate the superiority of the proposed scheme compared to both full-covariance and

diagonal covariance UBM based systems. Also, simple equal-weight based fusion of baseline

and proposed systems yield significant performance gains, suggesting a complimentary basis

in the proposed solution.

The research methods and evaluation results discussed in this chapter were previously

published in [30, 29] and [28] c©2012 IEEE.

5.1 Motivation

One limitation of the conventional GMM supervector domain representation and subsequent

factor analysis modeling is that, it does not take into account the fact that the original

acoustic features contain redundancy. In general, the speech short-time spectrum is known to

be representable in a lower dimensional subspace, which motivates a separate class of speech

81

enhancement methods known as signal subspace approaches [112, 113]. Linear correlation

among the speech spectral components are quite high, which justifies the success of these

methods. This phenomenon is also valid for popular acoustic features, such as Mel-frequency

Cepstral Coefficients (MFCC) [50, 114], even though these features are processed through a

Discrete Cosine Transform (DCT) to achieve de-correlation before use in training or test.

5.1.1 Analysis on full-covariance UBMs

To motivate the proposed work, we first demonstrate that the conventional acoustic features

can be constrained to reside in a lower dimensional subspace. For this purpose, we train a

1024 mixture full covariance GMM UBM using 60 dimensional MFCC features on a large

background speech data set.1 For a typical mixture of this UBM, the covariance matrix

and distribution of its eigenvalues is shown in Figure 5.1. From Figure 5.1(a) it is clear

that the full covariance matrix, which shows strong diagonal terms, has significant non-

zero off-diagonal elements, indicating that the feature coefficients are not fully uncorrelated.

Figure 5.1(b) shows the sorted eigenvalues of the same covariance matrix revealing that most

of it’s energy is accounted for by the first few dimensions only. This shows that the acoustic

feature space is actually lower dimensional and features can thus be further compacted or

enhanced by using a factor analysis model. Also, it is known that the first few directions

obtained by the Eigen-decomposition of acoustic feature covariance matrices are mostly

speaker dependent (e.g. see Zhou and Hansen (2005) [115] for a quantitative analysis), while

other directions are more phoneme dependent. Considering these noted observations on the

acoustic features, we aim at investigating a factor analysis scheme on the acoustic features

for speaker recognition. We name this method acoustic factor analysis.

1More details on feature extraction and development data are given in Section 5.5.1 and 5.5.2, respectively

82

5.1.2 Limitations of factor analysis on GMM supervectors

Before proceeding with the formulation of the factor analysis scheme front-end features,

we first defend the argument that the traditional factor analysis schemes do not take full

advantage of the acoustic feature covariances. In a standard i-Vector system, the GMM

supervectors are dimensionality reduced by a total factor analysis model, which is based on

the idea that utterance supervectors lie in a lower dimensional subspace. Let mu denote a

GMM supervector extracted from an utterance u, and xn would denote the acoustic features.

For a randomly chosen utterance u, it is generally assumed that mu is normally distributed

with mean m0 and covariance matrix B [80]. Here, m0 denotes the speaker independent

mean vector obtained by concatenating the UBM mean vectors m0[g]. Let the UBM covari-

ance matrices be Σg, where g denotes the mixture number. The main motivation of both

Eigenvoice and total variability modeling, is that the super-covariance matrix B contains

zero eigenvalues and thus some dimensions of mu can be disregarded. For the g-th Gaussian

mixture, the utterance dependent mean vector mu[g] is estimated from the posterior mean

of the acoustic features that belong to u, that is xn ∈ u.2 Thus, mu[g] is a deterministic

parameter given the utterance u. However, when the utterance u is randomly selected, the

sub-vectors mu[g] becomes normally distributed random vectors having covariance matrix

B[g], which is the g-th sub-matrix of the super-covariance matrix B. Clearly, the matrices

B[g] are not related to the feature covariance matrices Σg, since the former represents the

covariance of the mean sub-vectors mu[g] obtained from separate utterances, while the latter

represents the covariance of the acoustic features xn which is independent of the utterance.3

Thus, assuming that the matrix B contains zero eigenvalues is not equivalent to assuming the

2In this notation, we assume that the frame indices n are independent of an utterance. Thus, any featurevector xn is uniquely identified by n for a given dataset.

3Utterance dependent covariance matrices can also be extracted through MAP adaptation. However, weassume that each utterance GMM shares a common UBM covariance and corresponding weights.

83

same for the Σg matrices. Though this reasoning is based on full covariance UBM models,

similar arguments can be made for a diagonal covariance based system.

5.1.3 Feature dimensionality reduction

Given that the conventional acoustic features reside in a lower dimensional subspace, it is

important now to ask the question how we can use this knowledge to effectively extract

utterance level features. Since speaker dependent information is contained in the leading

eigen-directions of the acoustic features [115], using all the feature coefficients for modeling

channel degraded data will result in retaining some nuisance components along with speaker

dependent information in the GMM supervectors and i-Vectors. Therefore, we propose a

dimensionality reduction transformation of the acoustic features for each GMM mixture

that emphasizes the speaker dependent information in the leading eigenvectors of the corre-

sponding mixture covariance matrix, while suppressing some unwanted channel components.

In this manner, the GMM supervectors will be “enhanced” in the sense that they will be

more speaker discriminative, while the subsequently extracted i-Vectors will also inherit this

quality.

Dimensionality reduction of the acoustic features for de-correlation/enhancement is not

a new concept. There are many techniques found in the literature that perform this task,

including DCT, Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA),

Heteroscedastic LDA (HLDA), to name but a few [116, 117, 118]. The main goal for this pro-

cess has been to be able to model the features using diagonal covariance matrix GMM/HMMs

for speech/speaker recognition. These techniques can be classified mainly into two groups

based on their mode of operation including: 1) the signal processing domain, and 2) the

model domain. In the first scenario, some transformation (supervised/unsupervised) is used

at the signal/acoustic feature level in order to achieve improved information compression

(i.e., energy compaction). The most common technique is the application of the DCT for

84

the log-filterbank energies [50] popularized by the MFCC representation. PCA can also be

used [117] by learning the principal directions from the Eigen-decomposition of the covari-

ance matrix trained on the utterance data itself. In general, this class of processing only

depends on the speech data under consideration and does not use any outside knowledge.

In the second scenario, raw acoustic features (e.g., filter-bank energies) are initially used

to train a large model, which is then used to derive the feature transformations. One such

technique used in speaker recognition is HLDA [116], where first a GMM-UBM is trained

on the raw acoustic features. Each mixture is then assumed to represent a separate class,

and an HLDA transformation is trained so that discrimination between these classes is max-

imized. In a similar fashion, PCA projections can also be used in each GMM mixture as a

transformation [119]. In these methods, after the initial training phase, the acoustic features

are aligned to the mixture component providing the highest posterior probability and the

corresponding transformation is used for dimensionality reduction.

Both the signal processing domain and model domain feature dimensionality reduction

techniques previously used in essence have one common property: they re-generate the

acoustic features after a dimensionality reduction. This means, the sub-sequent procedures

for the speaker recognition system require that we begin the training process from these newly

extracted features. Model domain dimensionality reduction has an additional inconvenience

of mixture-alignment. Speech features are known to be highly intertwined and overlapped

in the vector space for different acoustic conditions and generally do not form meaningful

clusters [120]. Thus, using the top posterior probability for aligning a feature vector to a

single mixture may not be appropriate. To demonstrate this, we select MFCC feature vectors

xn from 10 development utterances that were used in UBM training, and for each feature

vector, we find the highest posterior probability among the 1024 mixtures of the UBM,

maxg p(g|xn). A histogram of these top mixture probabilities is shown in Figure 5.2, which

clearly demonstrate that only a few frames are unquestionably aligned to a specific Gaussian

85

0 0.2 0.4 0.6 0.8 10

0.005

0.01

0.015

0.02

0.025

0.03

Top posterior probability, maxg [p(g|xn)]

Occ

ura

nce

probabilit

y

Figure 5.2. Distribution of top posterior probabilities p(g|xn) obtained from a subset ofdevelopment data.

mixture (indicated by the high peak near maxg p(g|xn) = 1). In actuality, a majority of the

feature vectors are aligned with more than one mixture, resulting in a top mixture probability

in the region of 0.3 ∼ 0.8. Thus, using the top scoring mixture for hard alignment of feature

vectors to a specific mixture can introduce inaccuracies and should be avoided if possible.

5.1.4 Implications of the proposed method

Historically, feature extraction, dimensionality reduction, enhancement and normalization

has always been thought of as a separate process from acoustic modeling. In this chapter,

we propose a new modeling scheme of the acoustic features that bridges the gap between these

two processing domains through integrated feature dimensionality reduction and modeling.

86

We demonstrate that the proposed method not only performs dimensionality reduction, it

also removes the need for fixed feature clustering to a specific mixture, and does not require

retraining of the UBM from the new features, thereby incorporating a built-in feature nor-

malization and enhancement scheme. All this is achieved using a single linear transformation

derived from a pre-trained full covariance matrix UBM and applying this in a probabilistic

fashion to the mixture dependent Baum-Welch statistics.

5.2 Proposed method

In this section, we describe the proposed factor analysis model of acoustic feature vectors,

discuss its formulation and mixture-wise application for dimensionality reduction.

5.2.1 Acoustic Factor Analysis (AFA)

Let X = {xn|n = 1 . . . T} be the collection of all acoustic feature vectors from the devel-

opment set obtained from a large corpus of many speakers’ recordings in diverse environ-

ment/channel conditions. Using a factor analysis model, the d×1 dimensional feature vector

x can be represented by,

x = Wy + µ+ ε. (5.1)

Here, W is a d× q low rank factor loading matrix that represents q < d bases spanning the

subspace with important variability in the feature space, and µ is the d× 1 mean vector of

x. We denote the latent variable vector or latent factors y ∼ N (0, I), as acoustic factors,

which is of dimension q × 1. We assume that the remaining noise component ε ∼ N (0, σ2I)

is isotropic, and therefore the model is equivalent to Probabilistic Principal Component

Analysis (PPCA) [77]. In this model, the feature vectors are also normally distributed such

that, x ∼ N (µ, σ2I + WWT).

The advantage of this model is that the acoustic factors y, defining the weights of the

factor loadings, explain the correlation between the feature coefficients x, which we believe

87

are more speaker dependent [115], while the noise component ε incorporates the residual

variance of the data. It should be emphasized that even though we denote the term ε as

“noise”, when modeling cepstral features, this term actually represents convolutional channel

distortion [52]. A mixture of these models [77] can be used to incorporate the variations

caused by different phonemes uttered by multiple speakers in distinct noisy/channel degraded

conditions, given by,

p(x) =∑

g

πgp(x|g) (5.2)

where for the g-th mixture,

p(x|g) = N (µg, σ2gI + WgW

Tg ). (5.3)

Here, µg, πg, Wg and σ2g represent the mean vector, mixture weight, factor loading matrix,

and noise variance for the g-th AFA model, respectively.

5.2.2 Mixture dependent transformation

One advantage of using the mixture of PPCA for acoustic factor analysis is that, its parame-

ters can be conveniently extracted from a GMM trained using the Expectation-Maximization

(EM) algorithm [77]. Thus, we utilize a full covariance UBM to derive the AFA model pa-

rameters. The proposed feature transformation and dimensionality reduction procedure is

presented below:

Step 1: UBM training

A full covariance UBM model Λ0, is trained on the development dataset acoustic features

X = {xn|n = 1 . . . T}, given by,

p(x|Λ0) =M∑

g=1

πgN (x|µg,Σg) (5.4)

88

where πg represents the mixture weights, M is the total number of mixtures, µg are the mean

vectors and Σg are the full covariance matrices. The mixture mean and weight parameters

of the UBM will be identical to the mixture model of Eq. (5.2).

Step 2: Noise subspace selection

We require a pre-set value for q, which defines the number of principal dimensions which

should be retained. In other words, we assume the lower d−q dimensions of the features will

actually represent the noise subspace [112]. Using this value of q, we find the noise variance

for the g-th mixture as,

σ2g =

1

d− qd∑

i=q+1

λg,i (5.5)

where λg,q+1 . . . λg,d are the smallest eigenvalues of the covariance matrix Σg. Thus, σ2g is

essentially the average variance lost per discarded dimension. It may be noted that the

model allows the use of different values of q for each mixture. This has been investigated in

[29].

Step 3: Compute the factor loading matrix

The maximum likelihood estimation of the factor loading matrix Wg of the g-th mixture of

the AFA model in Eq. (5.2) is given by,

Wg = Uqg(Λqg − σ2gI)1/2Rg (5.6)

where Uqg is a d × q matrix whose columns are the q leading eigenvectors of Σg, Λqg is

a diagonal matrix containing the corresponding q eigenvalues, and Rg is a q × q arbitrary

orthogonal rotation matrix. In this work, we set Rg = I.

89

Step 4: Feature transformation

The posterior mean of the acoustic factors yn can be used as the transformed and dimen-

sionality reduced version of xn for the g-th component of the AFA model. This can be shown

to be,

E{yn|xn, g} = 〈yn|xn, g〉 = ATg (xn − µg) , zn,g (5.7)

where

Ag = WgM−Tg and (5.8)

Mg = σ2gI + WT

g Wg. (5.9)

We term the matrix Ag as the g-th AFA transform. In this operation, we are essentially

replacing the original feature vectors xn by the mixture dependent transformed acoustic

feature zn,g. Each feature vector xn can be transformed by Ag, corresponding to the mixture

component it is aligned with and a new set of features can then be obtained. However, as

noted earlier, we will not regenerate the acoustic features and instead use a probabilistic soft-

alignment in our system. This is described in Section 5.4 where we discuss the integration

of AFA within an i-Vector system.

5.3 Properties of the AFA transform

In this section, we discuss the general properties and advantages of the proposed acoustic

feature model, the resulting transformation and the transformed features.

5.3.1 Probability distribution of the transformed features

Here, we derive the probability distribution of the transformed acoustic features and show

how AFA performs feature de-correlation. Let zn,g = 〈yn|xn, g〉 indicate the AFA trans-

90

formed feature vector for the g-th mixture. We have the following mean vector of zn,g:

µzg = E{〈yn|xn, g〉}

= E{ATg (xn − µg)} = 0, (5.10)

and its corresponding covariance matrix,

Σzg = E{zn,gzTn,g} − µzgµ

Tzg

= ATgE{(xn − µg)(xn − µg)T}Ag

= ATg ΣgAg. (5.11)

For further simplification, we first substitute the value of Wg from Eq. (5.6) into Eq. (5.9)

and use Rg = I to obtain,

Mg = σ2gI + WT

g Wg

= σ2gI + (Λqg − σ2

gI)T/2

UqTg Uqg(Λqg − σ2

gI)1/2

= Λqg. (5.12)

Next, substituting the values of Wg and Mg from Eq. (5.6) and Eq. (5.12) into Eq. (5.8) we

have,

ATg = Λq

−1g (Λqg − σ2

gI)T/2

UqTg . (5.13)

Using this expression of ATg in Eq. (5.11) we obtain,

Σzg = Λq−1g (Λqg − σ2

gI)T/2

Λqg(Λqg − σ2gI)1/2Λq

−Tg

= (Λqg − σ2gI)Λq

−Tg

= I− σ2gΛq

−1g . (5.14)

Here, we utilize the expression UqTg ΣgUqg = Λqg and take advantage of the diagonal system.

Thus, we show that for a given mixture alignment g, the posterior mean of the acoustic

factors, or the transformed feature vectors zn,g follow a Gaussian distribution with zero

mean and a diagonal covariance matrix given by I−σ2gΛq

−1g . Thus, the AFA transformation

de-correlates the mean normalized acoustic features in each mixture.

91

−40 −30 −20 −10 0 10 20 30 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SNR [db]

GA

IN

Wiener Gain, GW

Square-rootWiener Gain, G√

W(ξ

ξ+1

)1/2ξ

ξ+1

Figure 5.3. Input SNR [dB] (ξ) vs. Wiener gains. Wiener gain and square-root Wiener gainare shown with a solid (-) and dashed (- -) line, respectively.

5.3.2 Acoustic feature enhancement

In the g-th mixture, the AFA transformation matrix ATg expression given in Eq. (5.13) can

be expressed as:

ATg = Λq

−1g (Λqg − σ2

gI)T/2

UqTg

= Λq− 1

2g GgUq

Tg (5.15)

where we introduced a diagonal gain matrix given by:

Gg = Λq− 1

2g (Λqg − σ2

gI)T/2

. (5.16)

The i-th diagonal entry of Gg is given by,

Gg(i) =√(

λg,i − σ2g

)/λg,i. (5.17)

92

Setting aside temporarily the term Λq− 1

2g in Eq. (5.15), we observe that the transformation

operation performed by ATg in Eq. (5.7) first computes the inner product of the mean normal-

ized acoustic feature with the q principal eigenvectors of Σg, then for each i-th eigenvector

direction applies the gain function defined by Gg(i). The second term in Eq. (5.17) can be

identified as a square-root Wiener gain function [121]. This becomes clearer if we define the

classic speech enhancement terminology a priori SNR ξ as [112, 122],

ξ =λg,i − σ2

g

σ2g

(5.18)

and use this to express the gain equations. The Wiener gain GW and the square-root Wiener

gain G√W are given by:

GW =ξ

ξ + 1and (5.19)

G√W =

(ξ

ξ + 1

)1/2

. (5.20)

The Wiener and square-root Wiener gain functions are plotted against ξ in Figure 5.3. As

discussed in [121] (page 179, Sec 6.6.3), in case of additive noise, the square-root Wiener

filter is applied, when instead of the magnitude spectrum, the power spectrum of the filtered

signal and the clean signal are desired to be equal. The operation performed by the AFA

transformation in Eq. (5.15) can be interpreted as a gain function operating on a transformed

space defined by the i-th eigenvector to obtain a clean eigenvalue λg,i − σ2g from the noisy

eigenvalue λg,i [29]. Since the eigenvalues can be interpreted as a power spectrum obtained

from the principal components [123], it is understandable why G√W arises in this scenario

instead of GW. Due to this square-root operation on the gain function, the square-root

Wiener obviously shows lower attenuation characteristics compared to the standard Wiener

filter, as depicted in Figure 5.3. It may be noted that conventional factor analysis techniques

in the supervector space can also be interpreted using similar Wiener like gain functions as

discussed in [124].

93

In the signal subspace speech enhancement method [112], a similar gain function is ob-

tained by starting from the same model in Eq. (5.1), except for the standard normal assump-

tion on the latent factors y. In that work, the term Wy+µ , a in Eq. (5.1) was interpreted

as the “clean signal”, x as the noisy signal and ε as the additive noise. The goal was to find

an estimate of the clean signal a by finding the posterior mean of a given the noisy signal x

and noise variance. However, in the AFA scheme, the goal is to estimate the posterior mean

of the latent factors y for an “enhanced” and more compact version of the “noisy” (channel

degraded) acoustic features x [77]. This difference between the two approaches yields two

different optimization criteria and their resulting gain functions.

Another contrast between the speech enhancement schemes and AFA transformation is

the interpretation of the noise. In conventional speech enhancement methods, the noise

statistics are estimated from silence regions between speech segments [125], and thus for the

signal subspace based method, the noise variance σ2g is assumed to be known in the model

Eq. (5.1). In our case, the noise we are attempting to remove or compensate for is actually

an additive distortion in the cepstral domain, which will not exist in the silence regions. In

addition, even if silence segments were modeled in the UBM, it is highly unlikely that the

mixture components modeling the silences would be useful in determining the noise level in

other components. Thus, even though the AFA dimension q is related to the noise variance,

we resort to set the value of q arbitrarily and compute the corresponding noise variance for

each mixture using Eq. (5.5).

5.3.3 Acoustic feature variance normalization

Referring back to Eq. (5.15), the term Λq− 1

2g normalizes the variance of the acoustic feature

stream in the i-th eigen-direction, since λg,i is the expected feature variance along this direc-

tion [126]. This means, the AFA transformation assumes that the features that are closely

aligned with the g-th mixture, originates from the same random process, and performs this

94

normalization in addition to the enhancement mentioned in the previous section. This pro-

cess is interestingly similar to the cepstral variance normalization frequently performed in the

front-end. However, feature domain processing considers the temporal movement of the fea-

tures in performing these normalizations assuming that the feature streams are independent,

while AFA groups the features together in a mixture irrespective of their time location and

performs the normalization in an orthogonal axis derived from the corresponding mixture

covariance matrix. It would be interesting to see how AFA systems perform if the feature

domain normalizations are removed from the front-end. Recent studies [127] show that in the

full-covariance UBM based i-Vector scheme, a very basic scale normalization technique out-

performs Cepstral Mean and Variance Normalization (CMVN) and feature Gaussianization

[56]. This may be due to the uncorrelated assumption among feature coefficients inherently

assumed while applying these normalization schemes.

5.4 AFA integrated i-Vector system

In this section, we describe how the proposed method can be incorporated into a conven-

tional i-Vector system [2]. The fundamentals of this system including algorithmic details are

described in Section 3.4.7.

5.4.1 UBM and AFA model training

First, a full covariance UBM model, Λ0 given by Eq. (5.4), is trained on the development

data vectors. Next, the AFA dimension q is set, which defines the number of principal axes

to retain from each mixture component. Using the value of q, we find the noise variance

for the g-th mixture using Eq. (5.5). The factor loading matrix Wg and transformation

matrix Ag are then calculated using Eq. (5.6) and Eq. (5.8), respectively. After applying the

transformation as in Eq. (5.7), the posterior means of the acoustic factors zn,g = 〈yn|xn, g〉

are used as the mixture dependent transformed acoustic features.

95

5.4.2 UBM transformation

Following the discussion from Section 5.3.1, and using Eq. (5.10) and Eq. (5.14), the AFA

transformation would require a new transformed UBM Λ0 that models zn,g instead of xn.

However, this is not a modeling of acoustic features, but rather a transformation of UBM

mean and covariance matrices for the i-Vector system. For the g-th mixture component, this

transformation is given by:

µg → 0 (5.21)

Σg → Σg (5.22)

where

Σg = I− σ2gΛq

−1g = Σzg .

The mixture weights πg remain the same.

5.4.3 Baum-Welch statistics estimation

In this step, the zero and first order Baum-Welch statistics are extracted from each feature

vector with respect to the UBM. Using the AFA transformed features, extraction of the

statistics can be accomplished as follows. The probabilistic alignment of feature xn with the

g-th mixture is given by:

γn(g) = p(g|xn) =p(xn|g,Λ0)πgp(xn|Λ0)

. (5.23)

For an utterance u, the zero order statistics are extracted as:

Nu(g) =∑

n∈u

γn(g), (5.24)

which follows the standard procedure [80, 2]. Conventionally, the first order statistics are

extracted as:

Fu(g) =∑

n∈u

γn(g)xn.

96

However, with the present AFA transform, the first order statistics Fu(g) are extracted using

the transformed features in the corresponding mixtures instead of the original features,

Fu(g) =∑

n∈u

γn(g)zn,g =∑

n∈u

γn(g)ATg (xn − µg)

= ATg [Fu(g)−Nu(g)µg] = AT

g Fu(g)

where Fu(g) is the centralized first order statistics [4]. This transformation of statistics is

somewhat similar to the approach in [128], where it was performed to normalize the UBM

parameters to zero means and identity covariance matrices. However, in [128] the goal

was to simplify the i-Vector system algorithm, theoretically preserving the procedure with

added computational benefits; whereas in the proposed method, we are performing feature

transformation and dimensionality reduction for possible improvement of the i-Vector system

performance.

5.4.4 Hyper-parameter estimation

Training of the Total Variability (TV) matrix T for the i-Vector system follows a very

similar procedure as discussed in [2]. In this system, an utterance dependent supervector

mu is expressed as,

mu = m0 + Twu (5.25)

where the Md dimensional vector m0 denotes the speaker independent mean supervector

(i.e., concatenation of the UBM means µg = m0[g]), T is an Md × R low rank matrix

(R < Md) whose columns span the total variability space, and wu is a normal distributed

random vector of size R, known as the total factors. The posterior mean vector of wu given

an utterance data is know as an i-Vector.

Initialization

Depending on the AFA parameter q, the size of the matrix T needs to be defined. In the

AFA based i-Vector system, the supervector dimension becomes K = Mq instead of Md.

97

Thus, the T matrix size needs to be set to K × R, and randomly initialized. We define a

parameter, supervector compression (SVC) ratio α = K/Md = q/d, measuring compaction

obtained through the AFA transformation.

EM iterations

For each utterance u ∈ S, an R×R precision matrix Lu and R× 1 vector Bu are estimated

as [86]:

Lu = I +M∑

g=1

Nu(g)TT[g]Σ

−1g T[g] and (5.26)

Bu =M∑

g=1

TT[g]Σ

−1g Fu(g) (5.27)

respectively, where T[g] is the g-th sub-matrix of T of dimension q×R, Σg is the q× q AFA

transformed UBM covariance matrix. The total factors for the utterance u are estimated as:

wu = L−1u Bu. (5.28)

In each iteration, the g-th block of the T matrix is updated using the following equation:

T[g] =∑

s∈S

Fu(g)wTu

[∑

s∈S

(L−1u + wuw

Tu )Nu(g)

]−1

(5.29)

which follows the same procedure as a conventional i-Vector system [2, 86].

5.5 System description

We perform our experiments on the male trials of the NIST SRE 2010 telephone and mi-

crophone conditions (core conditions 1-5, extended trials). A standard i-Vector system [2]

with a Gaussian Probabilistic Linear Discriminant Analysis (PLDA) [87] back-end is used for

evaluation. Specific blocks of the baseline system implementation and details of the proposed

scheme are described below. An overall block diagram of the proposed system is included in

Figure 5.4.

98

5.5.1 Feature extraction

In order to remove the silence frames, an independent Hungarian phoneme recognizer [101]

combined with an energy based voice activity detection (VAD) scheme is used. A 60-

dimensional feature vector (19 MFCC +Energy + ∆ + ∆∆) is extracted using a 25 ms

analysis window with subsequent 10 ms shifts, and then Gaussianized utilizing a 3-s sliding

window [56].

5.5.2 UBM training

Gender dependent UBMs having full and diagonal-covariance matrices with 1024 mixtures

are trained on telephone utterances selected from the Switchboard II Phase 2 and 3, Switch-

board Cellular Part 1 and 2, and the NIST 2004, 2005, 2006 SRE enrollment data. We use

the HTK toolkit for training with 15 iterations per mixture split. The UBM full covariance

values were floored to 10−5 using the -v option in HTK HERest toolkit [102].

5.5.3 Total variability modeling

For the TV matrix training, the UBM training dataset is utilized. Five iterations are used

for the EM training. We use 400 total factors (i.e., our i-Vector size was 400). All i-Vectors

are first whitened and then length normalized using radial Gaussianization [87].

5.5.4 Session variability compensation and scoring

A Gaussian probabilistic linear discriminant analysis (PLDA) model with a full-covariance

noise process is used for session variability compensation and scoring [87]. As discussed in

Section 3.4.9, the eigenchannel component is not required when the full-covariance noise

model is assumed. In this generative model, an R dimensional i-Vector wu extracted from a

speech utterance u is expressed as:

wu = w0 + Φβ + n (5.30)

99

Development utterances

Feature extraction EM training UBM

Λ0

Hyper-parameter estimation T

(a) Development phase

Extract Baum-Welch statistics

Extract AFA parameters

q

Transform statistics

σg Ag

Target data Feature extraction

Test data Feature extraction

Baum-Welch statistics

Baum-Welch statistics

I-vector extraction

I-vector extraction

T

Target i-vector wTarget

Classifier

Test i-vector wTest

Output Score

(b) Evaluation phase

UBM Λ0

σg Ag



Figure 5.4. A block diagram of the proposed AFA integrated i-Vector system. The systemis shown in two phases: (a) development and (b) evaluation. In the evaluation phase, onlyi-Vector extraction procedure is depicted assuming an arbitrary classifier. For details on thePLDA classifier used, refer to Section 5.5.4

where w0 is an R×1 speaker independent mean vector, Φ is the R×Nev rectangular matrix

representing a basis for the speaker-specific subspace/eigenvoices, β is an Nev×1 latent vector

having a standard normal distribution, and n is the R×1 random vector representing the full

covariance residual noise. The only model parameter here is the number of eigenvoices Nev,

that is the number of columns in the matrix Φ. i-Vectors extracted from the UBM training

dataset and additional microphone data selected from SRE 2004 and 2005, are utilized to

train this PLDA model.4

4We would like to thank Dr. Daniel Garcia-Romero for providing the Gaussian PLDA software that isused in this experiment (https://sites.google.com/site/dgromeroweb/software).

https://sites.google.com/site/dgromeroweb/software

100

5.6 Evaluation results

5.6.1 Performance evaluation of AFA systems

In this evaluation, four different experiments were performed with the AFA dimension set to

q = 36, 42 and 48 (retaining q coefficients from the d = 60 dimensional features) using the

proposed method. We vary the number of eigenvoices Nev in the PLDA model from 50 to

400 in 50 step increments. The performance metrics used are %Equal Error Rate (EER) and

minimum Detection Cost Functions (minDCF) defined in NIST SRE 2008 [14] (minDCF’08)

and NIST SRE 2010 [15] (minDCF’10). These performance metrics are described in Sec-

tion 2.5. The results are summarized in the plot shown in Figure 5.5 and a subset of these

results, organized by performance metrics, is also shown in Table 5.1. The proposed sys-

tems are compared against our baseline full-covariance and diagonal covariance UBM based

i-Vector systems, referred to as “Baseline full-cov” and “Baseline diag-cov”, respectively.

From Figures 5.5(a)-(c), we observe that for q = 42 and for almost all values of Nev,

the proposed AFA system performs better than both baseline systems with respect to all

three performance metrics. For q = 48, the AFA system is superior to the baselines in

minDCF’10, but very close with respect to the other performance measures. For q = 42 and

Nev = 200, we achieve the best EER performance of 1.73% which is 11.28% lower relative

to the corresponding Baseline full-cov system EER. The results in Fig 5.5 and Table 5.1

indicate that the proposed AFA transformation of the acoustic features are successfully able

to reduce nuisance directions in the feature space, producing i-Vectors with better speaker

discriminating ability. We also note that our full-covariance baseline system and AFA based

systems perform significantly better than the diagonal-covariance system.

101

50 100 150 200 250 300 350 4001.8

22.22.42.62.8

(a)%

EE

R

50 100 150 200 250 300 350 400

0.12

0.14

0.16

(b)

min

DC

F’0

8

50 100 150 200 250 300 350 4000.35

0.4

0.45

0.5

0.55

(c)

Eigenvoice dimension, NEV

min

DC

F’1

0

Baseline Full−cov

Baseline Diag−cov

AFA−Fix (q = 36)

AFA−Fix (q = 42)

AFA−Fix (q = 48)

Figure 5.5. Performance comparison between proposed AFA and baseline i-Vector systemwith respect to (a) %EER, (b) minDCF’08 and (c) minDCF’10 for different eigenvoice sizeNev of the PLDA model. Evaluation is performed on NIST SRE 2010 core condition-5 usingthe extended trials.

102

Table 5.1. Performance comparison between baseline i-Vector and proposed AFA systemsfor different values of Nev and q. Evaluation performed on NIST SRE 2010 core condition-5extended trials

PLDA Baseline system AFA systemNev full-cov diag-cov q = 36 q = 42 q = 48

% Equal Error Rate (EER)

100 2.0274 2.4896 2.1706 2.0115 1.9117150 2.0396 2.5548 2.0632 1.7944 1.9554200 1.9535 2.4750 1.9756 1.7322 1.9706250 1.9551 2.4854 2.0216 1.8233 1.9183300 1.9467 2.5343 2.0980 1.8497 1.9422

minDCF’08 (NIST SRE 2008)

100 0.1145 0.1348 0.1110 0.1073 0.1120150 0.1124 0.1285 0.1039 0.1014 0.1037200 0.1033 0.1229 0.1071 0.1015 0.1040250 0.1053 0.1237 0.1090 0.1017 0.1035300 0.1061 0.1247 0.1092 0.1009 0.1024

minDCF’10 (NIST SRE 2010)

100 0.4050 0.4526 0.4103 0.4056 0.4057150 0.4056 0.4365 0.3928 0.3869 0.3635200 0.4093 0.4444 0.3678 0.3732 0.3468250 0.4251 0.4501 0.3639 0.3765 0.3620300 0.4234 0.4428 0.3844 0.3750 0.3473

5.6.2 Effect of different AFA dimension

In Figure 5.6, AFA system performance is compared with the Baseline full-cov system for

different values of q, keeping the parameter Nev fixed at 150. Here we use q = 24, 30, 36, 42, 48

and 54, yielding supervector compression (SVC) ratios of α = 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9,

respectively. From this figure, we observe that system performance is quite sensitive to

the q parameter of the proposed AFA method, though performance improvement is achieved

compared to the baseline system in almost all cases. If the value of q is too low, some speaker

dependent information is removed by the AFA transform and system performance degrades.

Values of q close to feature dimension d yield performances similar to the baseline system.

103

24 30 36 42 48 54

−30

−25

−20

−15

−10

−5

0

5

10

AFA dimension, q

%R

elati

veIm

pro

vem

ent

(RI)

%RI in %EER

%RI in minDCF′08

%RI in minDCF′10

Figure 5.6. Performance comparison of AFA system for different values of q with respectto % Relative Improvements (RI) in %EER, minDCF’08 and minDCF’10 compared to thecorresponding baseline system performance metric. Evaluation is performed on NIST SRE2010 core condition-5 using the extended trials. The figure clearly reveals that the systemperformance drastically degrades as the value of q is reduced.

We observe consistent improvements in system performance by setting q close to 42 ∼ 48

for the AFA systems. In this region, relative improvement values of all three performance

metrics are in the range of 4 ∼ 12%. We believe the fluctuation of performance is due to

the fact that a different value of q is suitable for each mixture component. Thus, methods

of selecting the optimal AFA dimension can be viable, especially since the model allows

different values of q for each mixture. Preliminary work on variable dimension AFA was

published in [29].

5.6.3 Effect of UBM variance flooring

It is known that full covariance UBM based speaker recognition systems can be very sensitive

to small values in the UBM covariance matrices [4]. In [4], a variance flooring algorithm [129]

104

Table 5.2. UBM covariance matrix flooring function (vFloor-2) [4]

Function: S = floor(S,F)

1. Cholesky decomposition: F = LLT

2. Normalize target matrix: Q← L−1SL−T

3. Eigenvalue decomposition: Q = UDUT

4. Obtain diagonal matrix D by flooring D to 1:

dii = max(dii, 1)

5. Return to full matrix: Q← UDUT

6. De-normalization: S← LQLT

was used to tackle this issue. As mentioned in Section 5.5.2, we performed UBM variance

flooring by limiting the minimum value of a covariance matrix component to 10−5 using

HTK. We refer to this flooring method as “vFloor-1”. To observe the effect of an alternate

variance flooring on the AFA systems, we trained the UBM as described in [4]. In each EM

iteration, the full covariance matrices were processed using the flooring function described

in Table 5.2 [129, 4]. We used the floor matrix F = fΣ, where

Σ =1

M

M∑

g=1

Σg (5.31)

is the average covariance matrix,5 and f = 0.1 is set as in [4]. We refer to this flooring

method as “vFloor-2”. Baseline and AFA system results using these two different UBM

flooring methods are summarized in Table 5.3. In this experiment, PLDA size Nev was set

to 150.

From the results, we observe that the variance flooring vFloor-2 [4] provides slightly

improved baseline system performance compared to vFloor-1, with respect to %EER and

minDCF’08 but degrades in minDCF’10 measure. The proposed AFA transformation achieves

much better performance over the baseline system when using vFloor-1. AFA provides im-

provement over the baseline system using vFloor-2 only for q = 54, whereas performance

5Note that this is not the Average Weighted Variance (AVW) measure discussed in the previous chapter,given in Eq. (4.14).

105

Table 5.3. Performance comparison between baseline i-vector and different AFA systemsusing alternate UBM flooring. Evaluations performed on NIST SRE 2010 core condition-5extended trials

System EER minDCF’08 minDCF’10

UBM variance flooring using vFloor-1Baseline full-cov 2.03961 0.11236 0.40556

AFAq = 36 2.04234 0.10417 0.35646q = 42 1.79444 0.10143 0.38688q = 48 1.95537 0.10375 0.36349q = 54 1.90054 0.09884 0.39321

UBM variance flooring using vFloor-2 [4]Baseline full-cov 1.93923 0.10315 0.41917

AFAq = 36 1.90352 0.11016 0.39275q = 42 2.05354 0.10328 0.38583q = 48 2.00150 0.10168 0.38034q = 54 1.92451 0.10138 0.39755

improvement is observed for q = 42, 48 and 54 when vFloor-1 is used. This deterioration

of AFA system performance can be expected, since the vFloor-2 algorithm modifies the

eigenvalues of the covariance matrices on which the AFA approach directly relies on. Not-

ing that AFA with vFloor-1 provides the best overall performance and vFloor-2 does not

provide sufficient advantage over vFloor-1, we use the vFloor-1 method in all subsequent

experiments.

Table 5.4. Common evaluation conditions in NIST SRE 2010No. Train Test

1 Interview speech Interview speech from the same microphone2 Interview speech Interview speech from a different microphone3 Interview speech Telephone speech4 Interview speech Telephone speech recorded over a room microphone channel5 Telephone speech Telephone speech

106

Table 5.5. Performance comparison between baseline i-Vector and different AFA systems.Evaluation performed in NIST SRE 2010 core condition-1 extended trials

System EER minDCF’08 minDCF’10Baseline full-cov 2.09767 0.08539 0.31712

AFA

q = 36 2.26966 0.08024 0.28560q = 42 2.07210 0.07921 0.28063q = 48 1.93024 0.07849 0.28756q = 54 2.01850 0.07914 0.33058



AFA

q = 36 3.78150 0.16862 0.48979q = 42 3.58186 0.15783 0.48376q = 48 3.67975 0.16186 0.50176q = 54 3.80477 0.16084 0.50948



AFA

q = 36 3.45395 0.16002 0.48452q = 42 3.15838 0.14754 0.44873q = 48 3.10171 0.14656 0.42633q = 54 2.89653 0.14827 0.43774



AFA

q = 36 2.01237 0.09314 0.30255q = 42 2.01237 0.09314 0.30255q = 48 1.80459 0.09456 0.28637q = 54 1.82594 0.08728 0.27816

107

5.6.4 Performance in microphone conditions

In this section, we present evaluation results of the proposed systems on the NIST SRE 2010

core conditions 1-4 using the extended trials. In these experiments, additional microphone

data from SRE 2005 and 2006 corpora were included for UBM and TV matrix training. The

PLDA model was trained using both telephone and microphone data as before. The results

are given in Tables 5.5-5.8. We compare the following systems: Baseline full-cov, and AFA

with q = 36, 42, 48 and 54. The PLDA parameter Nev was set to 150. We did not evaluate

the diagonal UBM system in these conditions. The definitions of the common conditions

defined in NIST SRE 2010 are provided in Table 5.4.

From the results, again we observe that the proposed AFA systems consistently outper-

form the baseline system, especially for conditions 1-3. However, it seems a single parameter

setting of q does not always provide the best performance across all the performance metrics.

Considering the best %EER values, the proposed systems achieved 8.14%, 6.43%, 8.67% and

12.33% relative improvements in conditions 1, 2, 3 and 4, respectively. These results demon-

strate the effectiveness of the proposed scheme in the microphone mismatched conditions as

well.

5.6.5 Fusion of multiple systems

We select three of our systems for fusion: (i) Baseline full-cov, (ii) AFA (q = 42) and (iii)

AFA (q = 48). The PLDA Nev parameter was set to 150 for all systems. Simple equal-weight

linear fusion was used with mean and variance normalization of individual system scores to

(0, 1) for calibration. Results are shown for NIST SRE 2010 core condition 5 and pooled

condition (combining all trials from condition 1-5) in Table 5.9 and 5.10, respectively.

From the results, fusion performance of systems (i) and (ii) clearly reveal that AFA

and baseline system have complementary information, since %EER and the minDCF val-

ues improve. This is observed for both telephone and pooled condition. The best result

108

is achieved by fusing systems (i)-(iii), to obtain 16.52%, 14.47% and 14.09% relative im-

provement in %EER, minDCF’08 and minDCF’10, respectively, compared to the baseline

system in condition-5. In the pooled condition, this fusion provides 13.75%, 14.0% and

11.80% relative improvement in %EER, minDCF’08 and minDCF’10, respectively. Perfor-

mance comparison of the systems (i), (ii) and their fusion for the pooled condition is shown

in Figure 5.7 using Detection Error Trade-off (DET) curves. Here, again we observe the

superiority of the proposed AFA system over the baseline system while the fusion of these

systems consistently provide further improvement in the full DET range.

5.6.6 Computational advantages

In our experiments, we observe that the TV matrix training process using the AFA transform

is computationally less expensive compared to the conventional process. This is expected

since the computational complexity of an i-Vector system is proportional to the supervector

size Md [128], which is reduced to Mq for an AFA based system. Thus, the computational

complexity of the proposed system is theoretically reduced by a factor of 1/α (0 < α < 1)

compared to the baseline system.

5.7 Conclusions

In this chapter, we have proposed an alternate modeling technique to address and compensate

for transmission channel mismatch in speaker recognition. Motivated by the covariance

structure of conventional acoustic features, we developed a factor analysis technique which

operates within the acoustic feature domain utilizing a well trained UBM with full covariance

matrices. We advocated that conventional supervector domain factor analysis methods fail

to take advantage of the observation that speech features reside in a lower dimensional

manifold in the acoustic space. The proposed acoustic factor analysis scheme was utilized to

develop a mixture-dependent feature transformation that performs dimensionality reduction,

109

Table 5.9. Linear equal-weight score fusion performance of Baseline i-Vector and proposedsystems for NIST SRE 2010 core Condition-5

Individual system performances

System %EER minDCF’08 minDCF’10(i) Baseline full-cov 2.03961 0.11236 0.40556(ii) AFA (q = 42) 1.79444 0.10143 0.38688(iii) AFA (q = 48) 1.95537 0.10375 0.36349

Fusion system performances

1 Fusion of (i) & (ii) 1.77162 0.09704 0.366102 Fusion of (i) - (iii) 1.70258 0.09610 0.34839

Table 5.10. Linear equal-weight score fusion performance of Baseline i-Vector and proposedsystems for NIST SRE 2010 Core Conditions 1-5 pooled


System %EER minDCF’08 minDCF’10(i) Baseline full-cov 3.02720 0.13995 0.46022(ii) AFA (q = 42) 2.86091 0.13316 0.43030(iii) AFA (q = 48) 2.88596 0.13615 0.43086


1 Fusion of (i) & (ii) 2.69742 0.12199 0.414592 Fusion of (i) - (iii) 2.61094 0.12035 0.40591

de-correlation, normalization and enhancement at the same time. Finally, the transformation

was effectively integrated within a standard i-Vector-PLDA based speaker recognition system

using a probabilistic feature alignment technique. The superiority of the proposed method

was demonstrated by experiments performed using the NIST SRE 2010 extended trials across

five core conditions. Measurable improvements over two baseline systems were shown in

terms of EER, min minDCFs and DET curves.

The observations in this chapter suggest that a linear transformation of acoustic features

in different mixture components of the UBM can be an effective way of session variability

compensation. In the next chapter, we study the prospects of applying various traditional

linear transformations in this manner. In essence, we treat the AFA strategy as a frame-work

110

0.1 0.2 0.5 1 2 5 10

0.5

1

2

5

10

20


Mis

s pr

obab

ility

(in

%)

Baseline full-cov

AFA (q = 42)

Fusion

Figure 5.7. Performance comparison of baseline, AFA and fusion systems using DET curves.Evaluation is performed by pooling results of the core conditions 1-5 of NIST SRE 2010extended trials. (i) Baseline i-Vector system using Full Covariance UBM (Baseline full-cov),(ii) AFA i-Vector system (q = 42), and (iii) Equal-weight linear fusion of systems (i) & (ii).

for mixture-dependent transforms, and aim to investigate different transformation matrices

that do not originate from a factor analysis model.

CHAPTER 6

MIXTURE-DEPENDENT FEATURE TRANSFORMATIONS

State-of-the-art session variability compensation for speaker recognition are generally based

on various linear statistical models of the Gaussian Mixture Model (GMM) mean super-

vectors, while front-end features are only processed by standard normalization techniques.

Motivated by the Acoustic Factor Analysis (AFA) framework discussed in Chapter 5, we

propose a front-end channel compensation frame-work using mixture-localized linear trans-

forms that operate before the supervector domain modeling begins. In this approach, local

linear transforms are trained for each Gaussian component of a Universal Background Model

(UBM), and then applied to the acoustic feature dimensions according to their mixture-

wise probabilistic alignment, yielding an operation that is globally non-linear. We examine

Principal Component Analysis (PCA), whitening, Linear Discriminant Analysis (LDA) and

Nuisance Attribute Projection (NAP) as front-end feature transformations. We also propose

a method, Nuisance Attribute Elimination (NAE), which is similar to NAP but performs

dimensionality reduction in addition to channel compensation. All of these techniques are

known to work in the supervector/i-Vector domain, as previously discussed as background

in Chapter 3. We show that the proposed frame-work can be readily integrated with a stan-

dard i-Vector system by simply applying the transformations on the first order Baum-Welch

statistics and transforming the UBM. Experiments performed on the telephone trials of the

NIST SRE 2010 demonstrate significant performance gain from the proposed frame-work,

especially using LDA as the front-end transformation. The proposed work in this chapter is

published in [31].

111

112

6.1 Motivation

Despite the success of linear statistical models in speaker recognition and in other pattern

classification tasks in general [130], acoustic features are not generally compensated using

these techniques. Popular front-end channel compensation methods such as Mean and Vari-

ance Normalization (MVN) and Gaussianization [56] rely on normalizing the coefficients

based on their temporal statistics alone. Linear statistical methods such as LDA and PCA

have also been applied on acoustic features [119, 118, 131] in speech and speaker recognition.

However, when advanced supervector domain compensation techniques are considered, the

impact many feature domain normalization techniques become insignificant [116, 127].

In this chapter, we propose an effective frame-work for utilizing linear statistical methods

on acoustic features as a pre-processing stage, before the supervector domain modeling be-

gins. This frame-work originates from the AFA approach presented in the previous chapter.

We first train a UBM on the development dataset and derive PCA, LDA, NAP and whiten-

ing transformation matrices for each GMM mixture. We also propose a new dimensionality

reduction transformation similar to NAP, termed Nuisance Attribute Elimination (NAE).

Conventionally, when GMMs are used for feature clustering and transformation, the most

likely mixture component is obtained given the input feature vector and the corresponding

transform is used [116]. This approach assumes that one feature vector aligns with a single

Gaussian mixture only, and new models need to be trained from the transformed feature

set. In the proposed frame-work, instead of returning to the feature space and retraining

the UBM, the transformations are applied to the first order Baum-Welch statistics and the

UBM itself. In this way, the front-end processing can be effectively integrated into a standard

i-Vector PLDA based system.

113

6.2 Proposed method

6.2.1 Mixture-wise feature transformation

Let X = {xn|n = 1 . . . T} be the collection of all d dimensional feature vectors from the

development dataset. Let us define a transformation matrix A, and transformed feature

vectors zn, so that,

zn = A(xn − µ). (6.1)

Here, xn represents the d × 1 dimensional feature vector obtained from X , A is a d × q

transformation matrix where q ≤ d, and µ is the d × 1 mean vector of xn. The matrix

A could be obtained from any linear statistical method, such as PCA, LDA, NAP, etc. If

q < d, this transformation performs dimensionality reduction. Considering the variability

of acoustic features in various environmental conditions and phonetic context, we presume

that different regions of the feature space should have a unique transform. Thus, we utilize

a UBM Λ0 for clustering the acoustic features, given by,

p(xn|Λ0) =M∑

g=1

πgexp

[−1

2(xn − µg)TΣ−1

g (xn − µg)]

(2π)d/2|Σg|1/2

where πg is the mixture weight, µg and Σg represent the mixture mean vector and covariance

matrix, and M is the number of mixtures. The transformed feature vector in the g-th mixture

is:

zn,g = Ag(xn − µg) (6.2)

where Ag is now a mixture dependent transformation. It can be shown that zn,g has a zero

mean and a covariance matrix given by:

Σzg = AgΣgATg . (6.3)

114

Thus, after this mixture dependent transformation is applied, the UBM Λ0 is replaced by a

transformed UBM model Λ0, given by,

p(z|Λ0) =M∑

g=1

πgN (0,Σzg). (6.4)

6.2.2 Integration within the i-Vector system

After feature extraction and UBM training, the first step of training a total variability

matrix/i-Vector extraction is estimating the zero and first order Baum-Welch statistics.

These statistics are computed from acoustic features with respect to the UBM model. For

an utterance u, the zero order statistics, also known as the probabilistic count for each

mixture, and are extracted as,

Nu(g) =∑

n∈u

γn(g), where γn(g) = p(g|xn,Λ0). (6.5)

In the proposed frame-work, the first order statistics Fu(g) are extracted using the trans-

formed feature vectors in the corresponding mixtures, instead of the original acoustic features

xn.

Fu(g) =∑

n∈u

γn(g)zg,n = Ag

∑

n∈u

γn(g)(xn − µg) (6.6)

As expected, this is simply a transformed version of the centralized first order statistics [23].

Each feature vector is thus transformed according to its alignment with different mixtures

that are locally effective in performing channel compensation. This process is similar to

a mixture of experts model [132] for front-end channel compensation. The rest of the i-

Vector system procedure follows the conventional approach, with acoustic feature dimension

q and UBM model Λ0. Also the supervector dimension reduces to K = Mq from Md, and

the TV matrix size becomes K × R. We define a parameter supervector compression ratio

represented as α = K/Md, measuring overall dimension reduction in the system.

115

6.2.3 Mixture-wise PCA (m-PCA)

Here, we describe how a mixture-wise PCA [77] is implemented in the proposed frame-work.

First, a full covariance UBM Λ0 is trained on the development data. Next, for each mixture

covariance matrix Σg, the eigenvalue decomposition is performed as:

Σg = UTg ΛgUg (6.7)

where the columns of Ug contain the eigenvectors of Σg, and Λg contains the corresponding

eigenvalues in its main diagonal. Retaining the first q principal directions in the feature

space within this mixture, the g-th transformation matrix is defined as:

APCA-q[g] = UT[q]g (6.8)

where U[q]g contains the first q columns of Ug corresponding to the largest eigenvalues. The

transformed covariance matrix is:

Σzg = U[q]gΣgUT[q]g = Λ[q]g (6.9)

where Λ[q]g is a q× q diagonal matrix containing the first q largest eigenvalues of Σg. In this

case, the modified UBM is given by,

p(z|ΛPCA) =M∑

g=1

πgN (0,Λqg). (6.10)

Thus, using this transformation, the acoustic features are de-correlated, and when q < d is

set, the least important dimensions in the acoustic space are also suppressed.

6.2.4 Mixture-wise Whitening (m-WHT)

This is very similar to PCA, except that the transformation whitens the features in each

mixture in addition to de-correlating them. The least important dimensions can be removed

116

using this transform in the same way as PCA. The whitening transformation for the g-th

mixture retaining q ≤ d components, is given by:

AWHT-q[g] = Λ− 1

2

[q]gUT

[q]g . (6.11)

In this case, the new UBM ΛWHT has an identity covariance matrix.

p(z|ΛWHT) =M∑

g=1

πgN (0, I). (6.12)

6.2.5 Mixture-wise LDA (m-LDA)

To implement a mixture-wise LDA transformation, we need the mixture dependent within

class and between class scatter matrices, Swg and Sbg, respectively. We also need develop-

ment data speaker labels. We proceed as follows. For each speaker s ∈ S, we compute the

speaker dependent mean vector for the g-th mixture as:

xgs =1

ns

∑

n∈s

γn(g)xn. (6.13)

Here, ns is the total number of feature frames belonging to the speaker s. From total S

speakers’ data, the between class and within class scatter matrices for each mixture is then

computed as:

Sbg =1

S

S∑

s=1

Ns(g)(xgs − µg)(xgs − µg)T , (6.14)

Swg =1

S

S∑

s=1

∑

n∈s

γn(g)(xn − xgs)(xn − xgs)T (6.15)

whereNs(g) =∑

n∈s γn(g) is the probabilistic count of mixture g for speaker s. Next, the g-th

LDA transformation matrix is computed through the following eigenvalue decomposition:

Sw−1g Sbg = VT

g DgVg. (6.16)

117

Here, Vg contains the eigenvectors as its columns and Dg contains the corresponding eigen-

values as its main diagonal. If V[q]g denotes the matrix containing q ≤ d columns of Vg

corresponding to the q largest eigenvalues, the LDA transform matrix is given by:

ALDA-qg = VT[q]g . (6.17)

Using this transformation, the transformed UBM covariance matrices can be computed using

Eq. (6.3).

One common problem in implementing multi-class LDA occurs when the feature dimen-

sion becomes larger than the number of classes, leading to singular between/within-class

covariance matrices [130]. For this reason, LDA is generally applied to the lower dimen-

sional i-Vectors [2] instead of the GMM mean supervectors. When applying LDA on the

acoustic features in the proposed frame-work, we observe that if the number of speakers S

is larger than the acoustic feature dimension d (a condition which can be easily met) the

between class covariance matrices in Eq. (6.14) should always be full rank. However, if in

a given mixture, Ns(g) is zero for a large number of speakers, Sbg can be low rank. Such

cases are very rare, since the same corpus will be used to train the UBM and estimate these

matrices. Similarly, to ensure that the Sbg matrices in Eq. (6.15) are non-singular, most of

the posterior probabilities for each speaker and mixture should be greater than zero.

In order to verify if these conditions are met in our system, we train a 1024 mixture

UBM using 60-dimensional MFCC features extracted from our development data set. The

full dataset X contains a total of 162, 093, 376 frames obtained from 984 speakers. The

probabilistic counts NX (g) =∑

n∈X p(g|xn,Λ0) are calculated for each mixture across the

entire dataset, and Ns(g) values for each speaker s ∈ S and mixture g is computed. The

probability distributions of NX (g) and Ns(g) are then estimated using normalized histograms

and shown in Figure 6.1. We obtain the distributions p(NX (g)) and p(Ns(g)) shown from

M = 1024 and MS = 1024× 984 data points, respectively. Here, we observe that for most

118

10−10 10−8 10−6 10−4 10−2 100 102 104 106 108

0

0.05

0.1

Mixture-wise probabilistic count

Pro

bab

ility

p(NX (g))

p(Ns(g))

Figure 6.1. Distribution of mixture-wise probabilistic feature count. The distributionsp(NX (g)) and p(Ns(g)) are obtained from 1024 mixture counts for all data, and comput-ing the same for each 984 speakers, respectively.

cases Ns(g) > 10−2 and NX (g) > 102. Since we have S = 984 speakers, Ns(g) ∼ 0 for some

mixtures and therefore, a few speakers cannot make Sbg low-rank. However, if NX (g) is

close to zero for a mixture, it can lead to a singular Swg matrix. If this occurs, we do not

perform the LDA transformation in that mixture and use an identity matrix instead.

6.2.6 Mixture-wise NAP (m-NAP)

The NAP algorithm was originally proposed in [21]. In this method, the feature space is

transformed using an orthogonal projection in the channel’s complementary space, which

depends only on the speaker. The projection is calculated using the within-class covariance

matrix. To apply NAP on acoustic features, we define a d × d projection matrix [21] of

co-rank k < d for the g-th mixture as:

Pg = I−W[k]gWT[k]g , ANAP-kg (6.18)

119

where W[k]g is a rectangular matrix of low rank whose columns are the k principal eigenvec-

tors of the matrix Swg in Eq. (6.15). The transformed UBM covariance matrices are found

using Eq. (6.3):

Σzg = PgΣgPTg . (6.19)

Since NAP removes some nuisance directions from the feature space in each mixture, the

operation in Eq. (6.19) on the mixture covariance matrices Σg, results in rank-deficient, and

thus non-invertible transformed matrices Σzg . To avoid inverting Σzg , we use its pseudo-

inverse in our solution, which is calculated using the Singular Value Decomposition (SVD)

method presented in [133]. We note that, NAP does not reduce the feature dimension. Thus,

the supervector compression ratio α = 1 in this case.

6.2.7 Mixture-wise Nuisance Attribute Elimination (m-NAE)

We propose a dimensionality reduction transformation that uses the same principles as NAP,

but eliminates the nuisance directions from the feature space instead of projecting them out.

In this way, the transformed UBM covariance matrices are smaller in size, but are still full

rank and invertible. For the proposed method, we transform the features using the first

q = (d − k) eigenvectors corresponding to the largest eigenvalues of Swg denoted by W[q]g .

Here, k is the number of dimensions eliminated. The NAE transform is given by:

ANAE-qg = WT[q]g . (6.20)

Here, the acoustic features are dimensionality reduced from d to q.

6.3 Experiments

We perform our experiments on the male trials of NIST SRE 2010 telephone train/test

condition (condition 5, normal vocal effort). A standard i-Vector system with a Gaussian

Probabilistic Linear Discriminant Analysis (PLDA) back-end is used for the evaluation.

Different blocks of the system is described below.

120


For voice activity detection (VAD), a phoneme recognizer [101] and energy based scheme is

used. A 60-dimensional feature vector (19 MFCC +Energy + ∆ + ∆∆) is extracted, using

a 25 ms analysis window with 10 ms shift and filtered by feature warping using a 3-s sliding

window [56].

6.3.2 UBM training

A gender dependent full-covariance UBM with 1024 mixtures is trained on utterances selected

from Switchboard II Phase 2 and 3, Switchboard Cellular Part 1 and 2, and the NIST 2004,

2005, 2006 SRE enrollment data. For training, we used the HTK toolkit with up to 15

iterations per mixture split.

6.3.3 Total variability modeling

For the total variability matrix training, the UBM training dataset is utilized. The i-Vector

dimension was set to 400, with all i-Vectors first whitened and then length normalized [87].

6.3.4 Back-end channel compensation and scoring

A Gaussian PLDA with full-covariance noise model is used for both session variability com-

pensation and scoring. In this model, the only free parameter is the number of Eigenvoices

Nev, which was set to 150. This approach is the same as utilized in Section 5.5.4.


The results from our experiments are summarized in Table 6.1. The “Baseline” system refers

to the traditional i-Vector PLDA system. For the proposed front-end channel compensation

methods m-PCA, m-WHT, m-LDA, m-NAP and m-NAE, various parameter values shown

121

Table 6.1. Comparison between baseline i-Vector and proposed systems with respect to%EER, minDCF’08 and minDCF’10 for Nev = 150. Percent relative improvement (%r) andsupervector compression ratio (α) are also shown.

System α %EER/%r minDCF’08/%r minDCF’10/%rBaseline 1.00 2.13284 0.11308 0.39845

Method Parameter

m-PCAq = 42 0.70 1.827/14.35 0.106/6.45 0.397/0.32q = 48 0.80 2.030/4.82 0.108/4.56 0.363/8.92q = 60 1.00 1.899/10.98 0.105/7.40 0.387/2.90

m-WHTq = 42 0.70 1.887/11.55 0.105/7.29 0.379/4.79q = 48 0.80 1.908/10.54 0.109/3.56 0.372/6.60q = 60 0.80 1.920/9.96 0.108/4.79 0.381/4.38

m-LDAq = 36 0.60 2.065/3.20 0.105/6.86 0.384/3.71q = 40 0.66 1.718/19.43 0.096/15.05 0.40/-0.47q = 48 0.80 1.857/12.94 0.107/5.70 0.389/2.49

m-NAPk = 5 1.00 2.011/5.71 0.113/0.22 0.411/-3.2k = 10 1.00 2.130/0.11 0.115/-1.3 0.413/-3.7k = 20 1.00 2.108/1.16 0.117/-3.7 0.44/-10.3

m-NAEk = 5 0.92 1.982/7.05 0.112/0.79 0.416/-4.4k = 10 0.83 2.079/2.54 0.106/6.55 0.418/-4.8k = 20 0.66 2.120/0.61 0.122/-7.8 0.45/-11.6

in Table 6.1 are used. The performance metrics used are: Equal Error Rate (%EER), and

minimum Detection Cost Functions of NIST SRE 2008 (minDCF’08) and 2010 (minDCF’10)

[15]. From the results, we observe that m-PCA and m-WHT can generally improve the

system performance up to ∼ 10% relative to the baseline. Improvements are observed

for both with and without dimensionality reduction (i.e., 60 dimensions correspond to no

dimensionality reduction, α = 1.00). The m-LDA method provides the best performance of

all the transforms. An EER of 1.718% is obtained, yielding a 19.4% relative improvement

compared to the baseline system when q = 40 is used. The techniques m-NAP and m-

NAE performed worse compared to m-PCA, m-WHT and m-LDA, with the proposed m-

NAE technique generally outperforming m-NAP. Given the simplicity of the transforms

122

Table 6.2. Linear score fusion of baseline and proposed systems


System %EER minDCF’08 minDCF’10(i) Baseline 2.13284 0.11308 0.39845(ii) m-PCA42-i-Vector 1.82672 0.10579 0.39719(iii) m-WHT42-i-Vector 1.88659 0.10484 0.37935(iv) m-LDA40-i-Vector 1.71848 0.09606 0.40034


1 Fusion of (i) & (ii) 1.81949 0.09845 0.356952 Fusion of (i) & (iii) 1.77720 0.09817 0.364363 Fusion of (i) - (iv) 1.68627 0.09307 0.35549

used, the performance gains clearly demonstrate the effectiveness of the proposed channel

compensation scheme.

In Table 6.2, fusion performance of the following four systems are presented: (i) Base-

line, (ii) m-PCA42, (iii) m-WHT42, and (iv) m-LDA40. From these results, it is clear that

the proposed systems provide complimentary information compared to the baseline system.

The best performance is attained by fusing all four systems to reach a performance of,

EER = 1.686%, minDCF’08 = 0.093 and minDCF’10 = 0.355. Finally, in Figure 6.2, the

performance comparison of these systems are shown using a Detection Error Tradeoff (DET)

curve.

6.5 Conclusions

In this chapter, we have shown that the AFA frame-work can be extended to develop a

channel compensation strategy utilizing various linear statistical methods operating in each

mixture component of a UBM. Mixture-localized formulations of PCA, LDA, whitening and

NAP were described in the proposed frame-work. A new transformation termed nuisance at-

tribute elimination was also presented. Instead of regenerating the acoustic features, mixture-

localized transforms were applied to the UBM and the first-order Baum-Welch statistics, and

thus, were integrated within a standard i-Vector PLDA speaker recognition system. Exper-

123

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20


Mis

s pr

obab

ility

(in

%)

Baselinem−WHT−42m−PCA−42m−LDA−40Fusion

Figure 6.2. Performance comparison between proposed, baseline and fusion systems demon-strated using Detection Error Trade-off (DET) curves.

iments were performed on NIST SRE 2010 telephone trials demonstrating the effectiveness

of the proposed channel compensation frame-work. Significant performance improvements

compared to the baseline system were obtained when using LDA as a front-end transforma-

tion.

CHAPTER 7

MAXIMUM-LIKELIHOOD ACOUSTIC FACTOR ANALYSIS

Motivated by our AFA method presented in Chapter 5, in this chapter, we utilize a mixture

of Acoustic Factor Analyzers (AFA) to model the acoustic features instead of a GMM-

UBM. Following the AFA technique, this model is based on the assumption that the speaker

relevant information lies in a lower dimensional subspace in the multi-dimensional feature

space localized by the mixture components. Unlike the previous AFA method, here we train

the AFA-UBM model directly from data using an Expectation-Maximization (EM) algorithm

instead of transforming a previously trained UBM as discussed in Section 5.4.2. This method

will show robustness to noise as the nuisance dimensions are removed in each EM iteration.

Two variants of the AFA model will be considered utilizing an (i) isotropic and (ii) diagonal

covariance residual term. The method will be integrated within a standard i-Vector system

where the hidden variables of the model, termed acoustic factors, are used as the input for

total variability modeling. Experimental results are obtained on the 2012 National Institute

of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) core-extended

trials, where the proposed strategy will be assessed in both clean and noisy conditions. A

preliminary version of the work presented in this chapter is published in [32].

7.1 Motivation

The issue of channel variability has been carefully studied in recent years [2, 23] leading to

several breakthroughs in this area. Various compensation strategies have been proposed in

the past to reduce unwanted variability between training and test utterances, while retaining

the speaker identity information. To address issues related to noisy and channel degraded

124

125

conditions, the most effective techniques operate on the utterance models, including GMM

supervectors [20] and various factor analysis schemes built in this domain [80, 23], as well as i-

Vectors with Probabilistic Linear Discriminant Analysis (PLDA) based classifiers along with

various pre-processing techniques [100, 4, 87]. Robust feature development [134, 135, 136],

enhancement [122, 112], effective front-end compensation methods [56, 58, 137] and score

domain techniques have also been considered [6, 138] for mismatch compensation. Many

techniques have evolved and are being replaced by new variants over the last decade, but for

short-term spectrum based systems, a GMM is generally used as the background model.

Since the advent of i-Vectors, the most effective and convenient way of dealing with

mismatched conditions has been to include degraded data similar to the test utterances in the

PLDA training. Such utterances can also be included during the UBM and i-Vector extractor

training. We have observed this during the recent NIST SRE 2012, where additive noise and

mixed duration utterances were introduced in test conditions. One straightforward solution is

to add noisy and mixed duration data into the PLDA training phase [139, 8]. Even though

PLDA is a linear model, it seems to be quite effective for additive noise, convoluational

channel and duration variability. This work, however, is motivated by the presumption that

improved solutions to noise robustness can lie in earlier stages of the system, especially where

the degraded features are being modeled for the first time.

7.1.1 Connection with AFA

In Chapter 5, we proposed a factor analysis scheme for front-end features that operates on

different mixtures of the UBM. The principal motivation of that approach was the assumption

that acoustic features reside in a lower dimensional subspace, similar to the assumption made

on the GMM supervectors. The technique operates on the first order Baum-Welch statistics

in each mixture with a transformation matrix, effectively reducing the feature dimension

126

within the model. Integrated within an i-Vector system, this method led towards a two-

stage factor analysis scheme for speaker recognition. We also showed the similarity of the

AFA technique between signal sub-space based speech enhancement schemes [28].

In this chapter, we take the AFA concept further by completely replacing the traditional

UBM with a Mixture of Factor Analyzers (MFA) model and propose an i-Vector extraction

strategy that utilizes the first order statistics of the hidden variables (i.e., the acoustic fac-

tors), instead of the acoustic features. In our past studies [28, 30], we derived an AFA model

from a full-covariance UBM which utilized an isotropic residual term, making it equivalent

to a Probabilistic Principal Component Analyzer (PPCA) [77] model. The method was in-

terpreted as a transformation of acoustic features in different mixture components, which in

effect would also transform the UBM.

7.1.2 AFA in noisy conditions

In our experiments during the NIST SRE 2012 evaluation [24], we observed that extracting

the AFA model parameters from the full-covariance UBM would degrade system performance

in noisy conditions. Our UBM dataset was clean, which led us to believe that the sub-

spaces learned from the eigen-decomposition of the full-covariances are not as useful in the

separation of the signal from the noisy sub-space. Later, we added noisy data into the UBM,

but that by itself was not sufficient for the method to be effective. Next, we hypothesize

that this could be due to the full-covariance model training which considered the full feature

space in each iteration, leading to a mixture model which is already affected by the noisy

directions. This implies that the noisy directions in each mixture should have been removed

in each iteration.

Motivated by these observations made during the NIST SRE 2012 preparation, in this

chapter we propose to utilize a mixture of AFA model in place of a UBM to develop an i-

Vector system. We consider the scenarios where the model residual is isotropic and diagonal,

127

NM

Ψg

Wg

µgxn

yn g

N (I,0)πg

Figure 7.1. Probabilistic graphical model of a Mixture of Factor Analyzer (MFA) modelfor acoustic features. The box on the right denotes a ‘plate’ representing a dataset of Nindependent observations of acoustic features xn. Here, yn are the hidden variables, oracoustic factors, and g indicate the responsible mixture component in the model. The boxon the left represent the parameters of the g-th model component out of a total of Mmixtures.

leading to a mixture of PPCA model [77] and diagonal MFA model [140], respectively. These

models are iteratively trained using an EM algorithm. The advantage of using these models

when training a UBM with noisy data is that they only consider the dominant directions

of the feature space in each mixture, providing more robustness to the noisy test data.

However, as will be demonstrated shortly, significant improvement can be obtained through

the method if only the posterior statistics of the hidden variables are utilized for the i-Vector

extraction. This confirms the original motivation of the earlier AFA method presented in

[28], that speaker dependent information resides within the first few dominant directions in

the feature space. It should be noted that this observation was previously revealed for model

adaptation for speech recognition in [115].

7.2 Proposed method

In this section, we describe the proposed model of acoustic features, discuss its formulation

and EM-training steps and application in an i-Vector based speaker verification system.

128

−8 −6 −4 −2 0 2 4−20

−15

−10

−5

0

5

10

x

y

(a)

−8 −6 −4 −2 0 2 4−20

−15

−10

−5

0

5

10

x

y

(b)

−8 −6 −4 −2 0 2 4−20

−15

−10

−5

0

5

10

x

y

(c)

−8 −6 −4 −2 0 2 4−20

−15

−10

−5

0

5

10

x

y

(d)

Figure 7.2. Scatter plot of synthetic 2D Gaussian data with four clusters and trained mixturemodels. Means are shown as blue points while ellipses depict the covariance matrices. (a)Diagonal covariance GMM, (b) full covariance GMM, and (c) mixture of PPCA model and(d) mixture of factor analyzers (MFA) model showing a single dominant direction (of twodimensions) in each mixture component.

7.2.1 Maximum Likelihood - Acoustic Factor Analysis (ML-AFA)

The basic formulation of ML-AFA is the same as AFA as discussed in Section 5.2.1. Let

x ∈ Rd represent the acoustic feature vectors and X = {xn|n = 1 . . . T} denote the collection

of development data. Using a standard factor analysis model [141, 142], the feature vector

x can be represented by,

x = Wy + µ+ ε. (7.1)

129

Here, W is a d× q factor loading matrix that represents q < d bases spanning the sub-space

corresponding to the important variability in the feature space, and µ is the d × 1 mean

vector. Following our terminology in [28, 29, 30], we denote the latent variable vector or

latent factors y ∼ N (0, I), as acoustic factors, which is of dimension q × 1. The remaining

variability in the data is modeled by the noise component ε ∼ N (0,Ψ). In this model, the

feature vectors are normally distributed such that, x ∼ N (µ,Ψ + WWT).

Naturally, acoustic features extracted from speech data containing many different chan-

nel/noise variations are better modeled using clusters in the feature space. Thus, we utilize a

mixture of AFA models [28] similar to a traditional GMM-UBM. In this case, the probability

density function of xn is given by,

p(xn) =M∑

g=1

πgp(xn|g), (7.2)

where πg are the weights corresponding to the g-th mixture component, M is the total

number of mixtures, and p(xn|g) ∼ N (µg,Cg). Here, the model covariance matrix for each

component is given by,

Cg = Ψg + WgWTg . (7.3)

Figure 7.1 shows a probabilistic graphical model of this mixture model. In Chapter 5, we

assumed ε to be isotropic, that is Ψg = σ2gI where σ2

g denotes the average noise power, and the

AFA model parameters were derived from a full-covariance GMM-UBM. In this chapter, we

obtain the Maximum-Likelihood (ML) formulations of the mixture of AFA model assuming

Ψg to be isotropic and diagonal. This model, trained similar to a GMM, essentially replaces

the UBM model of the speaker verification system and leads to a new method for extracting

i-Vectors.

The learning behavior of these models is illustrated in Figure 7.2. Synthetic 2-dimensional

Gaussian data points distributed in four clusters are used for this example. The diagonal-

covariance GMM, full-covariance GMM and a mixture of PPCA model is used with four

130

mixtures, and the mean and covariances of the models are shown in Figures 7.2 (a-d) as points

and ellipsoids, respectively. As expected, the diagonal model is insensitive to the dominant

direction of the data, while the full covariance model is able to take this into account. The

PPCA model with q = 1 finds the dominant direction of the data and considers the other

direction as noise. The covariance shown here in Figure 7.2 (c) corresponds to the model

covariance Cg. In the proposed approach, we only consider data variation in the dominant

directions as detected by the model.

7.2.2 Isotropic residual noise

In this scenario, we assume that the noise covariance matrix in each mixture Ψg = σ2gI is

isotropic. This leads to the standard PPCA model as derived in [77]. The EM algorithm pro-

cedure for a mixture of PPCA model is as follows. In the first step, the following parameters

are computed given the initial or old parameter estimates Λ = {πg,µg,Cg}:

γn(g) = p(g|xn,Λ) =p(xn|g,Λ)πgp(xn|Λ)

, (7.4)

πg =1

N

N∑

n=1

γn(g), (7.5)

µg =

∑Nn=1 γn(g)xn∑Nn=1 γn(g)

, and (7.6)

Sg =1

Nπg

N∑

n=1

γn(g)(xn − µg)(xn − µg)T. (7.7)

Here, πg and µg are the new estimates for the weights and mean vectors, respectively. Next,

the new values, Wg and σ2g can be obtained by:

Wg = SgWg(σ2gI + M−1

g WTg SgWg)

−1, and (7.8)

σ2g =

1

dtr(Sg − SgWgM

−1g WT

g ), (7.9)

where Mg = σ2gI + WT

g Wg. The posterior covariance matrix of the distribution p(yn|xn, g)

is given by σ2gM

−1g . The posterior distribution of the acoustic factors for the g-th mixture is

131

given by:

p(yn|xn, g) = N(yn|M−1

g WTg (xn − µg), σ2

gM−1g

). (7.10)

The updated model covariance is obtained using Eq. (7.3). Using the updated parameters,

Eqs. (7.4)–(7.7) are utilized in the next iteration. We denote this model by: ML-AFAiso.

The posterior expected mean and covariance matrix of the acoustic factors are yn given by:

〈yn|g〉 = M−1g WT

g (xn − µg) and (7.11)

〈ynyTn |g〉 = σ2gM

−1g + 〈yn|g〉〈yn|g〉T, (7.12)

respectively.

7.2.3 Diagonal residual noise

Here, we assume that Ψg is diagonal. In this case, the q dominant directions represented

by the factor loading matrix Wg are no longer the principal components. Similar to the

PPCA case, the update equations for the diagonal covariance AFA model can be obtained

through maximization of the complete data likelihood function. Details of this derivation

are provided in Appendix A. The new values of πg and µg are obtained through equations

Eqs. (7.4)–(7.7) as before. Update equations for Wg and Ψg are as follows:

Wg = SgΨ−1g Wg

[I + M−1

g WTg Ψ−1

g SgΨ−1g Wg

]−1and (7.13)

Ψg = diag(Sg − SgΨ

−1g WgM

−1g WT

g

), (7.14)

where

Mg = Iq + WTg Ψ−1

g Wg. (7.15)

The diag(·) operation in Eq. (7.14) retains only the diagonal elements of the matrix. In this

case, the posterior distribution of the acoustic factors for the g-th mixture is given by:

p(yn|xn, g) = N(yn|M−1

g WTg Ψ−1

g (xn − µg),M−1g

). (7.16)

132

(a)

0

10

20

30

40

50

60

0

10

20

30

40

50

60

−2000

−1000

0

1000

2000

3000

4000

x

Patrial super-vector covariance matrix for a mixture (FULL)

y

(b)

0

10

20

30

40

50

0

10

20

30

40

50

−8000

−6000

−4000

−2000

0

2000

4000

6000

8000

x

Patrial super-vector covariance matrix for a mixture (PPCA)

y

(c)

010

2030

4050

60

0

10

20

30

40

50

60−1

−0.5

0

0.5

1

1.5

x

Covariance matrix of a UBM mixture (FULL)

y

Figure 7.3. Partial super-covariance matrices and a UBM covariance matrix obtained from aGMM and AFA model. The super-covariance is estimated using the total variability matrixT. (a) Partial super-covariance matrix of mixture-1 for a full covariance GMM-UBM. (b)Partial super-covariance matrix of mixture-1 for an AFAiso UBM model (q = 42). (c) Thefull-covariance matrix of the GMM-UBM obtained from mixture-1.

As before, the updated model covariance is obtained using Eq. (7.3) and Eqs. (7.4)–(7.7) are

utilized in the next iteration with the resulting new parameters. We denote this variant of

the model as: ML-AFAdiag.

7.2.4 i-Vector extraction

Conventionally, the i-Vectors are extracted using the zero and first order statistics calculated

from the features with respect to the UBM model. Next, as we replace the UBM model with

the AFA model (isotropic/diagonal), it is still possible to proceed as before by computing

the statistics in the traditional way considering the model as a GMM with parameters Λ =

{πg,µg,Cg}. In this case, the model covariance matrices Cg are restricted depending on the

type of model used (isotropic/diagonal). As an alternative, we propose to model the acoustic

factors for each of the mixtures as input to the next stage of the factor analyzer (i.e., the

i-Vector extractor). This is motivated by the assumption that the variation in the acoustic

factors contain the most important speaker dependent information. In this approach, we

essentially develop a two stage factor analysis scheme for speaker verification, where the

second stage (i-Vector extractor) utilizes the posterior mean and covariance matrices of

133

the hidden variables (acoustic factors) of the first stage model. Later in Section 7.4.1,

we experimentally show that modeling the acoustic factors in this way provides superior

performance.

The proposed strategy is somewhat similar to the Deep Mixture of Factor Analyzers

(DMFA) approach [141], where the later stage of factor analyzer uses the posterior mean

of the latent factors obtained from the earlier stage as features. However, in the current

scenario, the second stage of the factor analyzer is trained at the utterance level, whereas

the first stage is trained at the frame level.

Proceeding with the above method, for an utterance u, the zero order statistics are

extracted as:

Nu(g) =∑

n∈s

γn(g), (7.17)

which follows the standard procedure [2]. Here, γn(g) is extracted as in Eq. (7.4) utilizing

model parameters Λ. Conventionally, the first order statistics are extracted as:

Fu(g) =∑

n∈s

γn(g)xn.

Using the proposed model, the first order statistics are extracted as:

Fu(g) =∑

n∈s

γn(g)ATg (xn − µg)

= ATg [Fu(g)−Nu(g)µg] = AT

g Fu(g), (7.18)

where ATg = M−1

g WTg for the isotropic model and M−1

g WTg Ψ−1

g for the diagonal model (using

appropriate definitions of Mg in each case). Also, Fu(g) represents the centralized first order

statistics computed using the model parameters Λ.

The remaining procedure for the i-Vector extractor/total variability matrix training fol-

lows the exact same principles as outlined in [28]. However, when the acoustic factors yn are

used as features for the i-Vector extractor, the mean and covariance for the UBM parameter

is set to (0, I), following the original definition of the term.

134

7.2.5 Model interpretation

In order to gain further insight towards understanding the mechanisms of the proposed

method, we aim to compare the super-covariance matrices (covariance matrices obtained

from GMM mean super-vectors) in an i-Vector system using the conventional approach and

the proposed AFA integrated approach. This will illustrate the effect of using the acoustic

factors as features in the total variability model.

Using the total variability model, for a randomly chosen utterance u, the GMM super-

vector mu can be represented by,

mu = m0 + Twu, (7.19)

where m0 ∈ RMd is the speaker independent mean supervector (i.e., concatenated UBM

mean vectors µg), T is an Md×R rectangular matrix (R < Md) of low rank whose columns

span the so called total variability space [2], and wu ∈ RR is a standard normal random

vector, known as the total factors. The posterior mean vector of wu given an utterance

data is considered as an i-Vector. In this model, the covariance matrix of ms is given by,

B = TTT. Since, it is known that i-Vectors are effective lower dimensional representations

of the GMM supervectors, we are interested in observing this approximate super-covariance

matrix for specific mixture components. The training data and algorithms used for the

UBM and T matrix is provided in Section 7.3.4 and 7.3.5, respectively. A full-covariance

GMM-UBM is used in this analysis.

In Figure 7.3(a) the estimated super-covariance matrix for the first mixture is shown. In

other words, this is the first sub-matrix of B including d×d components from the upper left

corner. From Figure 7.3(a), we observe the following: i) the covariance matrix in this part

and the covariance of the UBM in Figure 7.3(c), are not the same, and thus a factor analysis

model in these two domains are not equivalent [28]; ii) a strong peak is observed near the

component (20, 20) of the matrix (this pattern is observed in other mixture blocks of the

135

Table 7.1. UBM training list description for NIST SRE 2012. Number of files used indifferent categories are presented for both genders

Male Femaleknown/unknown 6000/4323 6000/4998

Tel/Int/Mic 6000/2967/1356 6000/2940/2058Clean/HVAC/crowd 4941/2687/2695 5166/2910/2922

Total 10323 10998

matrix B as well). This indicates that strong correlations are present in specific feature

components of the GMM supervectors, ms, collected over a large number of utterances.

When an AFA model (isotropic noise with q = 42) is utilized and the acoustic factors are

the inputs to the total variability model, the partial super-covariance matrix B is then shown

in Figure 7.3(b). Interestingly, this matrix does not contain any dominant peaks as observed

in Figure 7.3(a). This further justifies the inclusion of the first stage factor analyzer which

takes into account the correlation among feature coefficients (independent of the utterance)

and provides a de-correlated input (acoustic factors) to the second stage. These two stages

are thus complimentary in nature, and can be expected to provide superior results.

7.3 System description

The experiments performed in this work are based on the male trials of the NIST SRE 2012

evaluation. A standard i-Vector system [2] with a Gaussian PLDA [87] similar to our NIST

SRE 2012 submission [139] is used as a baseline system. Specific blocks of the baseline

system implementation and details of the proposed scheme are described below.

7.3.1 Voice activity detection

The VAD algorithm follows the method in [53], available through the open-source Voicebox

toolkit [143]. For interview recordings, VAD is performed on both interviewee (A) and

interviewer (B) channels, and speech segments detected in channel B are removed from

136

channel A. Since channel B is usually corrupted by a noise floor to mask the interviewee

speech, spectral subtraction [122] is always performed before VAD on channel B. For channel

A, first the Signal to Noise Ratio (SNR) is estimated using a 2-mixture GMM trained on

segment energy. If the SNR is less than 18 dB, the audio channel is enhanced using spectral

subtraction [122] before application of VAD. Here, the noise power was estimated using the

method outlined in [125].


We use 60 dimensional Mel-Frequency Cepstral Coefficients (MFCC) features. At first,

digital zeros are replaced by a uniformly distributed noise floor having a mean zero and am-

plitude 1.75−5. A 24 channel Mel-spaced filterbank is used and 19 components are retained.

The 60 dimensional features are obtained by including log-energy, delta and acceleration

coefficients using a 25 ms analysis window with 10 ms frame shift. Finally, the features are

processed through Cepstral Mean and Variance Normalization (CMVN) utilizing a 3-sec

sliding window.

7.3.3 Noisy file generation

Since our experiments are performed on the NIST SRE 2012 tasks, we artificially noised

our development dataset. We collected 10 HVAC noise files from www.freesound.org and

generated 10 crowd noise files by summing 500–800 NIST SRE utterances from both male

and female speakers. The noise partitioning is described in [139]. We employ our in-house

tools to generate the noisy files with a psophometric weighting1 method as suggested by

NIST. The active speech level is measured according to the ITU-T Recommendation P.56.

These noisy files are used for speaker enrollment, hyper-parameter estimation, and PLDA

training.

1Based on ITU-T Recommendation O.41. See: http://www.itu.int/rec/T-REC-O.41-198811-S.

www.freesound.org

http://www.itu.int/rec/T-REC-O.41-198811-S

137

7.3.4 UBM and AFA model training

Gender dependent 1024-mixture UBMs with full-covariance and the proposed AFA mod-

els are trained on telephone utterances selected from the Switchboard-II Phase 2 and 3,

Switchboard Cellular Part 1 and 2, and the NIST SRE 2004–06 enrollment data. Noisy files

containing HVAC and crowd noise, and SRE 2012 enrollment speaker data are also included

in the UBM. The UBM utterances are approximately balanced across: (i) clean vs. noisy, (ii)

telephone vs. interview/microphone, and (iii) known vs. unknown speakers. The number of

utterances used in UBM training from various data types is summarized in Table 7.1. We

employed data sub-sampling for fast UBM training [27, 26] to perform the experiments. For

each 30 frames that are skipped, 3 consecutive frames are selected, resulting in use of only

10% of the original dataset. In this way, the correlation among the successive frames are

retained. For EM training, the initial four iterations per mixture are gradually increased to

15 for higher order mixtures.

7.3.5 i-Vector extractor training

For training the i-Vector extractor, the UBM training dataset and additional SRE 2012

target speakers’ data are used (both clean and noisy versions). This corresponds to our

NIST SRE 2012 system [139]. Here, 600-dimensional i-Vectors are extracted using 5 EM

iterations. The i-Vectors are first mean normalized and then length normalized using radial

Gaussianization [87]. Linear Discriminant Analysis (LDA) projection is performed to further

reduce the i-Vector dimension to 150 before PLDA scoring.

7.3.6 PLDA classifier

Model training

In this work, we use a Gaussian PLDA model with a full-covariance residual noise [87] as

described in Section 3.4.9. According to this model, an R dimensional i-Vector wu extracted

138

from utterance u is expressed as:

wu = w0 + Φβ + n. (7.20)

Here, w0 ∈ RR is the speaker independent mean vector, Φ is the R × Nev low rank matrix

representing the speaker dependent basis functions or eigenvoices, β ∼ N (0, I) is an Nev× 1

hidden variable, and n ∈ RR is a random vector representing the full covariance residual

noise. The model parameter Nev was set to 150. The data used for i-Vector extractor

training are utilized to train this PLDA model. No short duration utterances are included

in PLDA training as was the case in [139].

Scoring

The i-Vectors obtained from each enrollment speaker are first averaged so that one i-Vector

per speaker is obtained. The scoring is then performed as described in [17]. To determine

if i-Vectors wi and wj are obtained from the same speaker or not, we evaluate the following

likelihood ratio:

Li,j =P (wi,wj|H1)

P (wi|H0)P (wj|H0). (7.21)

The obtained scores are transformed using a compound log-likelihood ratio (LLR) trans-

formation as described in [144]. For this purpose, we set the target prior Pknown = 0.5

and assume that all speakers are equally likely. We note that, the compound LLR is used

only when individual system performances are reported. For fusion of multiple systems, the

compound LLR transformation is applied on the final fused scores.


The experiments performed in this chapter are based on the male portion of the NIST SRE

2012 core-extended trials. We use the SRE 2012 detection cost functions (DCF): Cprimary,

minCprimary(Pknown = 0.5) [16], and the % Equal Error Rate (EER) metric for evaluating the

139

Table 7.2. Common evaluation conditions in NIST SRE 2012

No. Train Test1 Clean interview speech2 Multiple Clean phone call speech3 speech Noisy interview speech4 segments Noisy phone call speech5 Phone call speech collected in noise

Table 7.3. Comparison of system performance when the proposed models are used as GMMsvs. AFAs for the i-Vector system. Results are shown for five NIST SRE 2012 commonconditions of the extended trials (male)

Model Method%EER minCprimary

c-1 c-2 c-3 c-4 c-5 c-1 c-2 c-3 c-4 c-5

GMM-diag GMM 3.244 2.817 3.127 3.113 3.228 0.264 0.312 0.129 0.271 0.307GMM-full GMM 3.302 3.714 3.328 3.770 4.142 0.273 0.378 0.137 0.318 0.354

ML-AFAiso(q = 42) GMM 3.375 3.924 3.261 3.948 4.348 0.270 0.392 0.130 0.327 0.381ML-AFAdiag(q = 42) GMM 3.522 3.717 3.213 3.946 4.143 0.271 0.390 0.125 0.317 0.368

ML-AFAiso(q = 42) AFA 3.200 2.808 3.190 2.906 3.193 0.240 0.302 0.118 0.259 0.293ML-AFAdiag(q = 42) AFA 2.993 2.655 3.242 2.928 3.027 0.221 0.291 0.107 0.257 0.282

systems. It has been argued that the EER metric is not meaningful when known non-target

speakers are involved during test [144]. However, we still report this performance metric as

it is a widely known and understood measure in speaker verification. We report these results

on the five common conditions of the NIST SRE 2012 extended trials [16]. Definitions of

these common conditions are provided in Table 7.2. Note that the clean conditions also

contain transmission channel/microphone variability.

7.4.1 Effect of the modeling method

We are interested in analyzing system performance of an AFA-UBM model when it is used

as a GMM with the parameters Λ = {πg,µg,Cg}. This means that the effect of the AFA

modeling will only be observed in the way the covariance matrix is restricted in Eq. (7.3).

The results of these experiments are summarized in Table. 7.3. We note here that the

140

Table 7.4. Performance comparison between baseline and the proposed systems in NISTSRE 2012 extended trials condition-1

UBM model %EER minCprimary Cprimary

GMM-diag 3.2428 0.2642 0.3385GMM-full 3.3020 0.2729 0.3553

Method q Absolute/%relative performance

ML-AFAiso

42 3.298/-1.7 0.245/7.3 0.336/0.748 2.779/14.3 0.241/8.6 0.334/1.454 2.874/11.4 0.236/10.8 0.326/3.8

ML-AFAdiag

42 2.993/7.7 0.221/16.5 0.316/6.648 3.008/7.2 0.242/8.3 0.339/-0.254 2.951/9.0 0.237/10.4 0.331/2.2



GMM-diag 2.8190 0.3122 0.5482GMM-full 3.7135 0.3776 0.5863


ML-AFAiso

42 2.642/6.3 0.304/2.5 0.541/1.448 2.469/12.4 0.286/8.3 0.529/3.554 2.596/7.9 0.285/8.7 0.529/3.5

ML-AFAdiag

42 2.655/5.8 0.291/6.7 0.536/2.248 2.632/6.6 0.289/7.3 0.530/3.454 2.553/9.4 0.278/10.9 0.532/3.0

full-covariance GMM-UBM based system does not perform as well as the diagonal GMM-

UBM. The AFA-UBM models utilized as GMMs (row 4-5 in Table 7.3) are seen to perform

close to the full-covariance GMM-UBM. However, when the acoustic factors are utilized for

the i-Vector modeling (noted as the AFA method in Table 7.3) we observe a significant

improvement in system performance. This confirms our original motivation for using the

acoustic factors as inputs to the i-Vector extractor.

141



GMM-diag 3.1273 0.1299 0.1421GMM-full 3.3280 0.1367 0.1460


ML-AFAiso

42 3.118/0.3 0.123/5.5 0.134/5.748 3.173/-1.5 0.113/13.4 0.123/13.454 3.178/-1.6 0.114/12.4 0.124/13.0

ML-AFAdiag

42 3.242/-3.7 0.107/17.9 0.126/11.348 3.252/-4.0 0.128/1.7 0.145/-2.354 3.122/0.2 0.112/13.6 0.125/11.8



GMM-diag 3.1130 0.2705 0.4488GMM-full 3.7704 0.3175 0.4841


ML-AFAiso

42 3.007/3.4 0.260/4.0 0.445/0.948 2.952/5.2 0.265/2.2 0.443/1.454 3.007/3.4 0.266/1.8 0.452/-0.6

ML-AFAdiag

42 2.928/6.0 0.257/5.1 0.450/-0.248 3.119/-0.2 0.256/5.4 0.439/2.254 2.757/11.4 0.247/8.6 0.437/2.6

7.4.2 Variation of acoustic factor dimension

In this experiment, we intend to observe the effect of changing the acoustic factor dimension

on overall system performance. For both model types (ML-AFAiso and ML-AFAdiag), we

consider acoustic factor dimensions of: q = 42, 48 and 54. These parameter values correspond

to 60, 80 and 90% of the original 60 dimensional features. The results are provided in Tables

7.4–7.8 obtained from both baseline and the proposed systems across the five NIST SRE

2012 common conditions.

142



GMM-diag 3.2276 0.3072 0.5941GMM-full 4.1415 0.3537 0.6243


ML-AFAiso

42 3.080/4.6 0.294/4.4 0.582/2.048 2.848/11.8 0.275/10.6 0.571/3.954 3.105/3.8 0.263/14.3 0.575/3.3

ML-AFAdiag

42 3.027/6.2 0.282/8.1 0.584/1.748 3.039/5.8 0.281/8.6 0.574/3.454 2.850/11.7 0.269/12.4 0.578/2.7

The results in Tables 7.4–7.8 clearly demonstrate that, the proposed technique utilizing

an AFA-UBM instead of the conventional GMM-UBM provides more robust speaker recog-

nition performance across conditions including clean and noisy test utterances. Except for

condition-3 (i.e., the noisy interview case), the proposed methods provide significantly su-

perior performance compared to the baseline system in all three performance metrics. In

general, relative improvements on the order of 5−10% is obtained using the proposed meth-

ods. This improved robustness in both clean and noisy conditions justify our motivation

for utilizing the ML-AFA models in place of conventional GMM-UBMs, especially since the

proposed models attempt to remove the noise in an earlier stage of the system (i.e., within

acoustic feature models rather than utterance models).

From the performance evaluations of Tables7.4–7.8, it is apparaent that a single AFA

model parameter (acoustic factor dimension q) or model type (isotropic or diagonal) does

not always provide the best result in all conditions. This indicates that an optimal selection

of the parameter q in each mixture can provide further benefits [145, 146]. In our previous

work, we attempted to derive an automatic selection of the parameter q in [29] using the

AFA frame-work proposed in [28].

143

Table 7.9. Fusion Performance of Baseline and Proposed Systems. Absolute and %Relativeperformance is shown for Fusion systems

ID SystemminCprimary Cprimary

c-1 c-2 c-3 c-4 c-5 c-1 c-2 c-3 c-4 c-5

1 GMM-diag 0.269 0.327 0.132 0.301 0.333 0.345 0.558 0.143 0.456 0.6032 GMM-full 0.280 0.389 0.139 0.354 0.378 0.363 0.600 0.148 0.496 0.640

2 ML-AFAq=48iso 0.244 0.304 0.115 0.298 0.300 0.344 0.540 0.129 0.452 0.582

3 ML-AFAq=48diag 0.245 0.301 0.130 0.282 0.305 0.348 0.540 0.149 0.450 0.585

Fusion LR (Abs.) 0.238 0.276 0.117 0.274 0.276 0.298 0.471 0.121 0.393 0.516(1–4) CLR∗ (Abs.) 0.231 0.257 0.109 0.236 0.240 0.258 0.416 0.116 0.346 0.459

CLR (% Rel.) 14.1 21.4 17.4 21.6 27.9 25.2 25.4 18.9 24.1 23.9

∗ CLR indicates compound likelihood transformed scores.

7.4.3 System fusion and calibration

In order to test if the proposed systems can provide complementary information, we perform

fusion of several systems using a linear logistic regression method obtained from the Bosaris

toolkit [147]. An independent development test set is utilized for training the calibration

and fusion parameters. The data-set referred to as the Eval-Test in [24] is used here as

the development test set. This data-set contains utterances from the enrollment speakers so

that target and known non-target trials are present. Also, held out speaker data is included

to provide unknown non-target trials. Clean and noisy versions of telephone, interview

and microphone recordings are included in this data-set (noise types are HVAC and crowd,

following SRE 2012 test data). For training fusion and calibration, we used 15 iterations

and an effective prior of 0.001. We select the following systems for fusion: i) baseline

with diagonal covariance UBM, ii) baseline with full-covariance UBM, iii) ML-AFAq=48iso and

iv) ML-AFAq=48diag . The fusion performance is summarized in Table 7.9.

From these results, we observe the complementary nature of the proposed and baseline

systems, yielding significant relative improvements of between 20–25% over the baseline sys-

tem performance with respect to the primary cost metric Cprimary. For the metric minCprimary,

144

very similar improvements are observed for all five conditions. Since the Cprimarycost func-

tion is the most important metric for the NIST SRE 2012, the results obtained are quite

encouraging. Fusion of system 1–3 and 1,2,4, in general provided better results compared

to fusing all 4 systems. This indicates that the systems ML-AFAq=48iso and ML-AFAq=48

diag may

not fuse well for conditions but can provide significant improvement when individually fused

with the baseline systems. However, improvements obtained by fusing all four systems are

more uniform. For example, in cc-3, relative improvements in minCprimaryfrom Fusion1−3

and Fusion1,2,4 are 19.8% and 10.6%, respectively, while Fusion1−4 yields 17.4%. Fusion of

all four systems, always improve the performance by at least 14%.

7.5 Conclusions

In this chapter, we have developed an AFA frame-work towards a generative mixture model

as an alternative to a conventional GMM based UBM for speaker verification. The proposed

modeling scheme was designed to iteratively learn a limited number of dominant feature sub-

spaces in different mixture components using clean and noisy training data. Two variations

of the proposed model were investigated, one with an isotropic and the other with a diagonal

residual noise assumption. The method was integrated within an i-Vector system frame-work

where the hidden variables of the proposed model (i.e., acoustic factors), were used as input

for total variability modeling. The interpretation and implication of the proposed method

was discussed and analyzed. Extensive experiments were performed on both clean and noisy

test conditions from the NIST SRE 2012 extended trials. The proposed methods were found

to be superior in multiple noisy conditions in SRE 2012, providing a significant gain in

performance when fusion of multiple systems were considered.

CHAPTER 8

CONCLUSIONS

The work presented in this dissertation has focused on effective acoustic modeling for robust

speaker verification. Particular attention was given to robustness in transmission channel and

noise mismatched conditions. As discussed in Chapters 3 and 4, acoustic modeling begins

with UBM training, which holds a central role in speaker characterization in the GMM

supervector space. Thus, our work in this dissertation has been related to the UBM in a

number of specific ways. The contributions of this research is summarized in the following

section, followed by a discussion of possible directions for future work. Throughout this

work, the standardized datasets and tasks provided by NIST evaluations have been utilized

for experimental evaluations.

8.1 Dissertation contributions

8.1.1 Study on UBM training

This part of the work was detailed in Chapter 4 and published in [27, 26]. Here, an organized

approach was taken to determine how to effectively select the training materials for UBM

training. Most research groups utilize a very large amount of data which requires significant

computational resources. In this chapter, elaborate experiments have been performed to an-

alyze the relationship between UBM training data variability and overall speaker verification

performance in the context of a GMM-UBM system [19]. In order to reduce the dataset size,

four feature/frame sub-sampling schemes were presented along with their trade-offs in com-

putational benefits. Using only a fraction of the original data amount, these sub-sampling

methods were shown to reduce the computation time to as low as 15%. Among the novel

145

146

techniques, an intelligent feature frame sub-sampling algorithm was proposed that automati-

cally selects a feature frame based on its variability compared to previously observed frames.

The idea here was to extract the most versatile samples of the feature vectors from an utter-

ance. This method was experimentally shown to outperform the baseline system that uses

the full dataset. Discussion on these sub-sampling methods can be found in Section 4.5.

This work also explored the open question on selectively using speaker data for UBM

construction. The underlying assumption was that similar speakers in the UBM training

data can create a speaker/demographic dependent bias. Since the UBM requires that the

model be a “universal” model, the motivation was to remove such imperfections. In this

line of work, two effective speaker selection methods were proposed and evaluated. These

methods were described in Sections 4.7.1 and 4.7.2. The data selection methods presented in

this chapter were evaluated on the telephone tasks of the NIST SRE 2008 corpora [14]. The

evaluation results exhibit that selectively limiting the speech dataset size and speaker count

during UBM training can achieve effective speaker verification performance while providing

a significant gain in computational processing.

8.1.2 Acoustic Factor Analysis (AFA): UBM based linear transforms

In this section of the dissertation, alternate acoustic modeling strategies were considered,

with the aim to improve robustness to channel degradation. The fundamentals of this model-

ing method were detailed in Chapter 5 and are published in [30, 29, 28]. The proposed method

was based on the observation that traditional cepstral features are somewhat confined to a

lower dimensional manifold in the acoustic space, with high correlation and moderate cross-

correlation among the coefficients. This covariance structure supported the use of a factor

analysis of acoustic features derived from each mixture component of a well trained full-

covariance UBM. The proposed model was shown to yield linear mixture-dependent feature

transformation matrices that performed dimensionality reduction, feature de-correlation,

147

variance normalization and enhancement, concurrently. These concepts were discussed in

Sections 5.3.2 and 5.3.3. Finally, the AFA transformation strategy was effectively integrated

within a conventional i-Vector and PLDA based speaker verification system. This integration

procedure was shown to affect only the Baum-Welch statistics extraction procedure. The

algorithmic aspects of this technique was discussed in Section 5.4. The superiority of the

proposed method was demonstrated by the experiments performed using the NIST SRE 2010

extended trials. Five core conditions were used including various microphone and telephone

train/test utterances. Measurable improvements over the baseline system were demonstrated

in terms of EER, minDCF performance metrics and DET curves (See Section 5.6).

8.1.3 Feature domain channel compensation within UBM mixtures

The work in this part, detailed in Chapter 6, stems from the AFA frame-work that utilizes

mixture dependent linear transformations as a means to accomplish channel compensation.

Various traditional linear statistical methods were utilized to operate in different mixture

components of a UBM in order to improve robustness to channel degradation. Mixture-

localized formulations of PCA, LDA, whitening and NAP were described in this chapter.

A new transformation termed Nuisance Attribute Elimination (NAE), discussed in Sec-

tion 6.2.7, was also proposed. The NAE technique was inspired by NAP with the goal of

achieving feature dimensionality reduction.

The key aspect of the proposed frame-work was that it provides integrated feature pro-

cessing within the modeling, similar to AFA. Instead of regenerating the acoustic features,

mixture-localized transforms were applied to the UBM and the first-order Baum-Welch statis-

tics, and thus, integrated within a standard i-Vector and PLDA based speaker verification

system. Similar to Chapter 5, the experimental evaluations were performed on NIST SRE

2010. The telephone trials were used. The results demonstrated the effectiveness of the

148

proposed front-end mixture-dependent channel compensation frame-work. Notable perfor-

mance improvements compared to the baseline system were obtained using an LDA based

transformation.

8.1.4 ML - Acoustic Factor Analysis: An alternative to the UBM

In the final part of the dissertation, described in Chapter 7, the Acoustic Factor Analysis

(AFA) frame-work was further developed towards a generative mixture model that can be

used as an alternative to a conventional GMM-UBM. With the aim to provide improved

performance in additive noise scenarios, this iterative mixture modeling method was designed

to iteratively learn a limited number of dominant feature sub-spaces in different mixture

components using clean and noisy training data. Two variations of the proposed model were

investigated. One of these consists of an isotropic approach, while the other employed a

diagonal residual noise assumption. The method was incorporated within an i-Vector frame-

work in a way where the hidden variables of the proposed model (i.e., acoustic factors)

would be used as the input features for total variability modeling. The formulation and

implementation details of the proposed techniques were described in Section 7.2. In addition,

the interpretation and implication of the proposed method was discussed and analyzed in

Section 7.2.5. Rigorous experimental evaluations were performed on both clean and noisy

test conditions from the NIST SRE 2012 extended trials. The proposed models were found

to be superior compared to a traditional UBM in multiple noisy test conditions provided

in SRE 2012. A significant gain in performance was also obtained when fusion of multiple

systems were considered.

This dissertation has therefore contributed meaningful advancements for improved speaker

modeling, compensation, and system advancements for robust speaker recognition and veri-

fication.

149

8.2 Directions for future work

The methods presented in this dissertation can be extended in a number of directions.

Further analysis can be conducted on effective data selection and parameter optimization

for UBM training. The exact relationship between the number of mixtures used and data

amount/variability is still unknown. Thus, finding the optimal number and best location

of the mixture components could also be an attractive direction to consider. Also, data

sub-sampling methods can be studied for training the i-Vector extractor (total variability

matrix [2]) or other hyper-parameter training (e.g., JFA [22]).

As mentioned earlier, the AFA strategy leads way to a two-stage modeling approach for

speaker recognition. Also, the ML-AFA strategy can be connected to deep modeling methods

such as, Deep Mixture of Factor Analyzers (DMFA) [141]. Methods based on deep learning

have been shown to provide superior results in speech recognition [148] and initial work

on speaker recognition using such models are already under way [149]. The AFA methods

demonstrate that a multi-layered model can be beneficial for robust speaker recognition and

different forms of multi-stage models could also be studied. Also, larger dimensional acoustic

features (e.g., filter-bank energy) may be used, possibly extracting more speaker dependent

information from a speech segment. In the current systems, such large dimensional features

will increase the computational load on the i-Vector extraction process, while using mixture-

wise dimensionality reduction as in the AFA based methods can restrict this issue. The

computational benefits achieved by standard AFA method was discussed in Section 5.6.6,

where further research is possible.

The ML-AFA strategy could also be modified to be a supervised model, incorporating

speaker identity information leading to a discriminative AFA. We already demonstrated in

Chapter 6 that mixture-dependent LDA could be effective for channel compensation. In a

modified ML-AFA approach, the acoustic features can be processed by each mixture of the

AFA-UBM and the most speaker discriminative coefficients can be retained before forming

150

the supervector. In the next stages, an i-Vector can be extracted and channel compensation

performed again. Larger dimensional features again can provide benefit in this scenario.

APPENDIX A

CLASSICAL MAP AND THE GMM-UBM APPROACH

The generalized MAP adaptation framework for GMM supervecotor was given in Eq. (3.28).

The equation is repeated below for convenience:

ms = m0 + Dzs. (A.1)

We will now relate this generalized model to the MAP adaptation method proposed in [19],

and described as the GMM-UBM approach in Chapter 3. For an utterance obtained from

speaker s, we define the centralized first order statistic for the g-th mixture as:

Fs(g) =T∑

n=1

γn(g)(xn − µg) , Fs(g)−N(g)µg. (A.2)

The original zeroth order statistic Ns(g), and first order statistics Fs(g) were defined by

Eq. (3.13) and Eq. (3.14), respectively. It is shown in Proposition 1 of [80], the posterior

distribution of the hidden variables zs is Gaussian with the following mean vector,

E[zs|X ] = L−1s DΣ−1Fs, (A.3)

and covariance matrix L−1s , where,

Ls = I + DTΣ−1NsD, (A.4)

where Σ, Fs and Ns are the concatenation of the UBM covariance matrices Σg (diagonal),

centralized first order statistics Fs(g), and zeroth order statistics Ns(g), given by:

Σ =

Σ1 0

. . .

0 ΣM

, Fs =

Fs(1)

...

Fs(M)

and Ns =

Ns(1)I 0

. . .

0 Ns(M)I

.

151

152

Thus, given the training data X , the adapted mean supervectors in this method can be

obtained by,

ms = m0 + DE[zs|X ]. (A.5)

As discussed in [22], instead of training D, if we set,

D2 =1

rΣ, (A.6)

the MAP adaptation formulation given in Eq. (3.19) [19] arises, where r is the relevance

factor used in computing αg in Eq. (3.21). This can be shown as follows. In this special

case, we have,

Ls = I + D2Σ−1Ns = I +1

rNs. (A.7)

Thus, the adapted mean supervector is given by,

ms = m0 + DE[zs|X ]

= m0 +

(I +

1

rNs

)−1

D2Σ−1Fs

= m0 +

(I +

1

rNs

)−1(1

rI

)Fs

= m0 +1

r

(I +

1

rNs

)−1

Fs. (A.8)

Since Ns is diagonal and has a fixed value for each mixture, this equation can be written for

the g-th mixture component as,

ms[g] = µg = m0[g] +1

r

(1 +

Ns(g)

r

)−1

Fs(g). (A.9)

Substituting the value of Fs(g) and replacing m0[g] by µg, we have,

µg = µg +1

Ns(g) + r

T∑

n=1

(xn − µg)γn(g)

= µg +1

Ns(g) + r

(T∑

n=1

γn(g)xn −Ns(g)µg

)

153

=Ns(g)

Ns(g) + r

(1

Ns(g)

T∑

n=1

γn(g)xn

)+

r

Ns(g) + rµg

=Ns(g)

Ns(g) + r

(1

Ns(g)

T∑

n=1

γn(g)xn

)+

(1− Ns(g)

Ns(g) + r

)µg

= αgEg[xn] + (1− αg)µg. (A.10)

Thus, we obtain the same MAP adaptation formulation as in Eq. (3.19) using the linear

model of Eq. (3.28).

APPENDIX B

EM ALGORITHM FOR AFA WITH UN-CORRELATED NOISE

Given the mixture AFA model, or more generally, the mixture of factor analyzers model

from Eq. (5.1), first we obtain the posterior probability density functions (PDFs) of xn given

the latent variables yn. Here, we assume only one mixture component at this stage.

p(xn|yn) =exp

[−1

2(xn −Wyn − µ)TΨ−1(xn −Wyn − µ)

]

(2π)d/2|Ψ| 12,

p(yn) = (2π)−q/2 exp

(−1

2yny

Tn

).

The model covariance matrix is given by C = Ψ + WWT. Thus, the data model is:

p(xn) =

∫p(xn|yn)p(yn)dyn

=exp

[−1

2(xn − µ)TC−1(xn − µ)

]

(2π)d/2|C| 12. (B.1)

Now, the posterior probability of the hidden variables yn is:

p(yn|xn) =p(xn|yn)p(yn)

p(xn)

=(2π)−q/2

|C|− 12 |Ψ| 12

exp

[−1

2

(yn −M−1WT(xn − µ)

)T

×M(yn −M−1WT(xn − µ)

)]. (B.2)

where

M = Iq + WTΨ−1W. (B.3)

It can be shown that, |M| = |Ψ−1C| = |Iq + Ψ−1WWT|. Thus, from Eq. (B.2) we observe

that M−1 is the posterior covariance of yn, providing its Gaussian PDF:

p(yn|xn) = N (M−1WT(xn − µ),M−1). (B.4)

154

155

Thus, the first and second order moments of yn given xn is given by:

〈yn〉 = M−1WTΨ−1(xn − µ), (B.5)

〈ynyTn 〉 = M−1 + 〈yn〉〈yn〉T. (B.6)

For a mixture AFA model, these moments will have different values for each mixture com-

ponent. We denote these as 〈yn|g〉 and 〈ynyTn |g〉, respectively. For a mixture of factor

analyzers, the complete-data log likelihood plus the Lagrangian multiplier is given by,

N∑

n=1

M∑

g=1

γn(g)

[ln πg −

1

2ln |Ψg| −

1

2tr(〈ynyTn |g〉

)−

1

2tr(Ψ−1g (xn − µg)(xn − µg)T

)+

〈yn|g〉TWTg Ψ−1

g (xn − µg)−1

2tr(WT

g Ψ−1g Wg〈ynyTn |g〉

)]

+λ

(M∑

g=1

πg − 1

). (B.7)

We note that the last term in the equation with the Lagrangian multiplier is required to

constrain the sum of the mixture weights πg to be equal to unity. Differentiating Eq. (B.7)

with respect to πg and λ, setting to zero, and solving the equations provide the new value

of the model parameter πg given by,

πg =1

N

N∑

n=1

γn(g). (B.8)

Next, maximizing 〈LC〉 with respect to µg, we obtain,

µg =

∑Nn=1 γn(g)

(xn − Wg〈yn|g〉

)

∑Nn=1 γn(g)

. (B.9)

Maximizing for Wg we obtain its update equation given by:

Wg =

[N∑

n=1

γn(g)(xn − µg)〈yn|g〉T](

N∑

n=1

γn(g)〈ynyTn |g〉)−1

. (B.10)

156

Finally, differentiating 〈LC〉 with respect to Ψ−1g gives,

Ψg =1

Nπg

N∑

n=1

[(xn − µg)(xn − µg)T

−2(xn − µg)〈yn|g〉TWTg + Wg〈ynyTn |g〉WT

g

]. (B.11)

These solutions are very similar to what was obtained in the PPCA case, except for the noise

covariance term Ψg, which is now considered to be a diagonal matrix for optimization.1

Since the M-step update equations of Wg and Ψg (i.e., Eqs. (B.10) and (B.11)), are

coupled, we proceed in the same way as in [77]. First, we ignore the latent variables yn and

maximize the likelihood function for µg and πg. This gives us the update Eqs. (7.5) and

(7.6) as in the isotropic noise case. Next, to update Wg and Ψg, we only seek to increase the

likelihood function instead of maximizing it, which is in principal similar to the Generalized

Expectation Maximization (GEM) method. The parameters µg and πg are assumed to be

fixed. Also, the statistics 〈yn|g〉 and 〈ynyTn |g〉 are obtained from the estimated parameters

in the first step using Eqs. (B.5) and (B.6) for each mixture. In this case, the parameters W,

Ψ and M are also considered mixture dependent. Now, when the maximization is carried

out assuming these parameters as pre-computed constants, we obtain a new set of simplified

update equations for Wg and Ψg, given by:

Wg = SgΨ−1g Wg

[I + M−1

g WTg Ψ−1

g SgΨ−1g Wg

]−1and (B.12)

Ψg = diag(Sg − SgΨ

−1g WgM

−1g WT

g

). (B.13)

Here, the diag(·) operation retains only the diagonal elements of the matrix that it operates

on. The value of Sg is obtained from Eq. (7.7).

1The approaches presented in [142] and [140] could also be followed in this procedure. However, we choseto utilize the methods in [77] for the EM formulation to obtain a set of compact M-step equations.

REFERENCES

[1] B. Delgutte, D. N. Caplan, F. H. Guenther, J. R. Melcher, J. C. Adams, J. S.Perkell, K. E. Hancock, and M. C. Brown. (2005) Brain mechanisms for hear-ing and speech. [Online] http://ocw.mit.edu/courses/health-sciences-and-technology/hst-722j-brain-mechanisms-for-hearing-and-speech-fall-2005/index.htm.

[2] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factoranalysis for speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19,no. 99, pp. 788 – 798, May 2010.

[3] E. Adar, “GUESS: A language and interface for graph exploration,” in Proc. SIGCHI.ACM, 2006, pp. 791–800.

[4] P. Matejka, O. Glembek, F. Castaldo, M. Alam, O. Plchot, P. Kenny, L. Burget,and J. Cernocky, “Full-covariance UBM and heavy-tailed PLDA in i-vector speakerverification,” in Proc. IEEE ICASSP, Prague, Czech Republic, May 2011, pp. 4828 –4831.

[5] D. Reynolds, M. Zissman, T. Quatieri, G. O’Leary, and B. Carlson, “The effects oftelephone transmission degradations on speaker recognition performance,” in Proc.IEEE ICASSP, 1995, pp. 329–332.

[6] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score normalization for text-independent speaker verification systems,” Digital Signal Process., vol. 10, no. 1-3, pp.42–54, Jan. 2000.

[7] R. Rose, E. Hofstetter, and D. Reynolds, “Integrated models of signal and backgroundwith application to speaker identification in noise,” IEEE Trans. Speech, Audio Pro-cess., vol. 2, no. 2, pp. 245–257, 1994.

[8] Y. Lei, L. Burget, L. Ferrer, M. Graciarena, and N. Scheffer, “Towards noise-robustspeaker recognition using probabilistic linear discriminant analysis,” in Proc. IEEEICASSP, Mar. 2012, pp. 4253–4256.

[9] Q. Jin, T. Schultz, and A. Waibel, “Far-field speaker recognition,” IEEE Trans. Audio,Speech, Lang. Process., vol. 15, no. 7, pp. 2023–2032, 2007.

[10] J. H. L. Hansen, “Analysis and compensation of speech under stress and noise forenvironmental robustness in speech recognition,” Speech Comm., vol. 20, no. 1-2, pp.151–173, Nov. 1996.

157

http://ocw.mit.edu/courses/health-sciences-and-technology/hst-722j-brain-mechanisms-for-hearing-and-speech-fall-2005/index.htm

http://ocw.mit.edu/courses/health-sciences-and-technology/hst-722j-brain-mechanisms-for-hearing-and-speech-fall-2005/index.htm

158

[11] X. Fan and J. H. L. Hansen, “Speaker identification within whispered speech audiostreams,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 5, pp. 1408 – 1421,July 2011.

[12] C. Zhang and J. H. L. Hansen, “Whisper-island detection based on unsupervised seg-mentation with entropy-based speech feature processing,” IEEE Trans. Audio, Speech,Lang. Process., vol. 19, no. 4, pp. 883 – 894, May 2011.

[13] J. H. L. Hansen and V. Varadarajan, “Analysis and compensation of Lombard speechacross noise type and levels with application to in-set/out-of-set speaker recognition,”IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 2, pp. 366 – 378, Feb. 2009.

[14] NIST, “The NIST year 2008 speaker recognition evaluation plan,” 2008, [Online] http://www.itl.nist.gov/iad/mig/tests/spk/2008/sre08 evalplan release4.pdf.

[15] ——, “The NIST year 2010 speaker recognition evaluation plan,” 2010, [Online] http://www.itl.nist.gov/iad/mig/tests/spk/2010/NIST SRE10 evalplan.r6.pdf.

[16] ——, “The NIST year 2012 speaker recognition evaluation plan,” 2012, [Online] http://www.nist.gov/itl/iad/mig/upload/NIST SRE12 evalplan-v17-r1.pdf.

[17] P. Kenny, “Bayesian speaker verification with heavy tailed priors,” in Proc. Odyssey,Brno, Czech Republic, 2010.

[18] D. Reynolds and R. Rose, “Robust text-independent speaker identification using Gaus-sian mixture speaker models,” IEEE Trans. Speech, Audio Process., vol. 3, no. 1, pp.72–83, 1995.

[19] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted Gaussianmixture models,” Digital Signal Process., vol. 10, no. 1–3, pp. 19 – 41, 2000.

[20] W. Campbell, D. Sturim, D. Reynolds, and A. Solomonoff, “SVM based speaker veri-fication using a GMM supervector kernel and NAP variability compensation,” in Proc.IEEE ICASSP, May 2006, pp. 97–100.

[21] A. Solomonoff, W. Campbell, and I. Boardman, “Advances in channel compensationfor SVM speaker recognition,” in Proc. IEEE ICASSP, vol. 1, 2005, pp. 629–632.

[22] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A study of interspeakervariability in speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16,no. 5, pp. 980–988, July 2008.

[23] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versusEigenchannels in speaker recognition,” IEEE Trans. Audio, Speech, Lang. Process.,vol. 15, no. 4, pp. 1435–1447, May 2007.

http://www.itl.nist.gov/iad/mig/tests/spk/2008/sre08_evalplan_release4.pdf

http://www.itl.nist.gov/iad/mig/tests/spk/2008/sre08_evalplan_release4.pdf

http://www.itl.nist.gov/iad/mig/tests/spk/2010/NIST_SRE10_evalplan.r6.pdf

http://www.itl.nist.gov/iad/mig/tests/spk/2010/NIST_SRE10_evalplan.r6.pdf

http://www.nist.gov/itl/iad/mig/upload/NIST_SRE12_evalplan-v17-r1.pdf

http://www.nist.gov/itl/iad/mig/upload/NIST_SRE12_evalplan-v17-r1.pdf

159

[24] T. Hasan, S. O. Sadjadi, G. Liu, N. Shokouhi, H. Boril, and J. H. Hansen, “CRSSSystems for 2012 NIST Speaker Recognition Evaluation,” in Proc. IEEE ICASSP,Vancouver, Canada, May. 2013.

[25] A. Chandrasekaran, “Efficient methods for rapid UBM training (RUT) for robustspeaker verification,” Master’s thesis, The University of Texas at Dallas, Aug. 2008.

[26] T. Hasan and J. H. L. Hansen, “A study on universal background model training inspeaker verification,” IEEE Trans. Audio, Speech, Lang. Process., pp. 1890–1899, Sep.2011.

[27] T. Hasan, Y. Lei, A. Chandrasekaran, and J. H. L. Hansen, “A novel feature sub-sampling method for efficient universal background model training in speaker verifica-tion,” in Proc. IEEE ICASSP, Dallas, TX, March 2010, pp. 4494 – 4497.

[28] T. Hasan and J. H. L. Hansen, “Acoustic factor analysis for robust speaker verifica-tion,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 4, pp. 842 – 853, Oct.2012.

[29] ——, “Integrated feature normalization and enhancement for robust speaker recogni-tion using acoustic factor analysis,” in Proc. InterSpeech, Portland, OR, Sept. 2012,pp. 1568–1571.

[30] ——, “Factor analysis of acoustic features using a mixture of probabilistic principalcomponent analyzers for robust speaker verification,” in Proc. Odyssey, Singapore,June 2012.

[31] ——, “Front-end channel compensation using mixture-dependent feature transforma-tions for i-vector speaker recognition,” in Proc. InterSpeech, Portland, OR, Sept. 2012,pp. 1091–1094.

[32] ——, “Acoustic factor analysis based universal background model for robust speakerverification in noise,” in Proc. InterSpeech, Lyon, France, Aug. 2013.

[33] A. E. Rosenberg, “Automatic speaker verification: A review,” Proc. of IEEE, vol. 64,no. 4, pp. 475–487, 1976.

[34] B. S. Atal, “Automatic recognition of speakers from their voices,” Proc. of the IEEE,vol. 64, no. 4, pp. 460–475, 1976.

[35] G. R. Doddington, “Speaker recognitionidentifying people by their voices,” Proc. ofIEEE, vol. 73, no. 11, pp. 1651–1664, 1985.

[36] J. Naik, “Speaker verification: A tutorial,” IEEE Communications Magazine, vol. 28,no. 1, pp. 42–48, 1990.

160

[37] S. Furui, “Speaker-dependent-feature extraction, recognition and processing tech-niques,” Speech Comm., vol. 10, no. 5, pp. 505–520, 1991.

[38] H. Gish and M. Schmidt, “Text-independent speaker identification,” IEEE Signal Pro-cessing Magazine, vol. 11, no. 4, pp. 18–32, 1994.

[39] R. Mammone, X. Zhang, and R. Ramachandran, “Robust speaker recognition: Afeature-based approach,” IEEE Signal Processing Magazine, vol. 13, no. 5, p. 58, 1996.

[40] S. Furui, “Recent advances in speaker recognition,” Pattern Recognition Lett., vol. 18,no. 9, pp. 859–872, 1997.

[41] J. Campbell, “Speaker recognition: A tutorial,” Proc. of the IEEE, vol. 85, no. 9, pp.1437–1462, Sep 1997.

[42] F. Bimbot, J. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier,T. Merlin, J. Ortega-Garcıa, D. Petrovska-Delacretaz, and D. Reynolds, “A tutorial ontext-independent speaker verification,” EURASIP J. on Applied Signal Process., vol.2004, pp. 430–451, 2004.

[43] J. Flanagan, Speech Analysis Synthesis and Perception, 2nd ed. New York and Berlin:Springer-Verlag, 1972.

[44] J. Deller, J. H. L. Hansen, and J. Proakis, Discrete Time Processing of Speech Signals,2nd ed. IEEE Press, 2nd Edition, 2000.

[45] J. Markel, B. Oshika, and A. Gray Jr, “Long-term feature averaging for speaker recog-nition,” IEEE Trans. Acoust., Speech, Signal Process., vol. 25, no. 4, pp. 330–337,1977.

[46] K. Li and E. Wrench Jr, “An approach to text-independent speaker recognition withshort utterances,” in Proc. IEEE ICASSP, vol. 8, 1983, pp. 555–558.

[47] J. J. Wolf, “Efficient acoustic parameters for speaker recognition,” J. of the Acoust.Soc. of America, vol. 51, p. 2044, 1972.

[48] D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin,D. Klusacek, J. Abramson, R. Mihaescu et al., “The SuperSID project: Exploitinghigh-level information for high-accuracy speaker recognition,” in Proc. IEEE ICASSP,vol. 4, 2003, pp. IV–784.

[49] E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, and A. Stolcke, “Modelingprosodic feature sequences for speaker recognition,” Speech Comm., vol. 46, no. 3, pp.455–472, 2005.

161

[50] S. Davis and P. Mermelstein, “Comparison of parametric representations for mono-syllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoustics,Speech, Signal Processing, vol. 28, no. 4, pp. 357–366, Aug 1980.

[51] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” J. Acoust. Soc.Am., vol. 87, no. 4, pp. 1738–1752, 1990.

[52] A. Oppenheim and R. Schafer, “From frequency to quefrency: A history of the cep-strum,” IEEE Signal Processing Magazine, vol. 21, no. 5, pp. 95–106, 2004.

[53] J. Sohn, N. Kim, and W. Sung, “A statistical model-based voice activity detection,”IEEE Signal Process. Lett., vol. 6, no. 1, pp. 1–3, 1999.

[54] F. Beritelli and A. Spadaccini, “The role of Voice Activity Detection in forensic speakerverification,” in Proc. DSP. IEEE, 2011, pp. 1–6.

[55] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans.Speech, Signal Processing, vol. 29, no. 2, pp. 254–272, 2003.

[56] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verification,” inProc. Odyssey, Crete, Greece, 2001, pp. 213–218.

[57] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans. on SAP,vol. 2, no. 4, pp. 578 –589, Oct. 1994.

[58] H. Boril and J. H. L. Hansen, “Unsupervised equalization of Lombard effect for speechrecognition in noisy adverse environments,” IEEE Trans. Audio, Speech, Lang. Pro-cess., vol. 18, no. 6, pp. 1379–1393, Sep. 2010.

[59] G. R. Doddington, M. A. Przybocki, A. F. Martin, and D. A. Reynolds, “The nistspeaker recognition evaluation–overview, methodology, systems, results, perspective,”Speech Communication, vol. 31, no. 2, pp. 225–254, 2000.

[60] N. Brummer, “Measuring, refining and calibrating speaker and language informationextracted from speech,” Ph.D. dissertation, Stellenbosch: University of Stellenbosch,2010.

[61] F. Soong, A. Rosenberg, L. Rabiner, and B. Juang, “A vector quantization approachto speaker recognition,” in Proc. IEEE ICASSP, vol. 10. IEEE, 1985, pp. 387–390.

[62] D. Burton, “Text-dependent speaker verification using vector quantization source cod-ing,” IEEE Trans. Acoust., Speech, Signal Process., vol. 35, no. 2, pp. 133–143, 1987.

[63] F. Soong, A. Rosenberg, B. Juang, and L. Rabiner, “A vector quantization approachto speaker recognition.” AT & T TECH. J., vol. 66, no. 2, pp. 14–26, 1987.

162

[64] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incompletedata via the EM algorithm,” J. of The Royal Statistical Society, Series B, vol. 39,no. 1, pp. 1–38, 1977.

[65] D. A. Reynolds, “Comparison of background normalization methods for text-independent speaker verification,” in Proc. InterSpeech, Rhodes,Greece, 1997.

[66] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate gaus-sian mixture observations of markov chains,” IEEE Trans. Speech, Audio Process.,vol. 2, no. 2, pp. 291–298, 1994.

[67] R. Kuhn, P. Nguyen, J.-C. Junqua, L. Goldwasser, N. Niedzielski, S. Fincke, K. Field,and M. Contolini, “Eigenvoices for speaker adaptation,” in Proc. ICSLP, vol. 98, Syd-ney, Australia, 1998, pp. 1771–1774.

[68] P. Kenny, M. Mihoubi, and P. Dumouchel, “New MAP estimators for speaker recog-nition,” in Proc. Eurospeech, vol. 3, Geneva, Switzerland, 2003, pp. 2964–2967.

[69] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3,pp. 273–297, 1995.

[70] W. Campbell, D. Sturim, and D. Reynolds, “Support vector machines using GMMsupervectors for speaker verification,” IEEE Signal Process. Lett., vol. 13, no. 5, pp.308–311, 2006.

[71] A. Hatch, S. Kajarekar, and A. Stolcke, “Within-class covariance normalization forSVM-based speaker recognition,” in Proc. InterSpeech, Pittsburgh, Pennsylvania, 2006.

[72] W. Campbell, “Generalized linear discriminant sequence kernels for speaker recogni-tion,” in Proc. IEEE ICASSP, vol. 1, Orlando, FL, 2002, pp. 161–164.

[73] C. M. Bishop, Pattern recognition and machine learning. Secaucus, NJ, USA:Springer-Verlag New York, Inc., 2006.

[74] P. Kenny and P. Dumouchel, “Disentangling speaker and channel effects in speakerverification,” in Proc. IEEE ICASSP, vol. 1, May 2004, pp. I–37.

[75] M. Knott and D. J. Bartholomew, Latent variable models and factor analysis. EdwardArnold, 1999, no. 7.

[76] T. K. Moon, “The expectation-maximization algorithm,” Signal Processing Magazine,IEEE, vol. 13, no. 6, pp. 47–60, 1996.

[77] M. Tipping and C. Bishop, “Mixtures of probabilistic principal component analyzers,”Neural computation, vol. 11, no. 2, pp. 443–482, 1999.

163

[78] S. Roweis, “EM algorithms for PCA and SPCA,” Advances in neural info. process.sys., pp. 626–632, 1998.

[79] R. Kuhn, J. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speaker adaptation inEigenvoice space,” IEEE Trans. Audio, Speech, Lang. Process., vol. 8, no. 6, pp. 695–707, 2000.

[80] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse trainingdata,” IEEE Trans. Speech, Audio Process., vol. 13, no. 3, pp. 345–354, May 2005.

[81] P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algo-rithms,” CRIM, Montreal, (Technical Report) CRIM-06/08-13, 2005.

[82] S. J. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferencesabout identity,” in Proc. ICCV 2007, Rio de Janeiro, Oct. 2007, pp. 1–8.

[83] N. Dehak, P. Kenny, R. Dehak, O. Glembek, P. Dumouchel, L. Burget, V. Hubeika,and F. Castaldo, “Support vector machines and joint factor analysis for speaker veri-fication,” in Proc. IEEE ICASSP 2009, Taipei, Taiwan, Apr. 2009, pp. 4237–4240.

[84] N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Dumouchel, “Supportvector machines versus fast scoring in the low-dimensional total variability space forspeaker verification,” in Proc. InterSpeech, Brighton, UK, Sep. 2009, pp. 1559–1562.

[85] N. Dehak and S. Shum, “Low-dimensional speech representation based on factor anal-ysis and its applications,” 2011.

[86] D. Matrouf, N. Scheffer, B. Fauve, and J. Bonastre, “A straightforward and efficientimplementation of the factor analysis model for speaker verification,” in Proc. Inter-Speech, Antwerp, Belgium, Aug. 2007, pp. 1242–1245.

[87] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector length normalizationin speaker recognition systems,” in Proc. InterSpeech, Florence, Italy, Oct. 2011, pp.249 – 252.

[88] G. Doddington, “Speaker recognition based on idiolectal differences between speakers,”in Proc. Eurospeech, vol. 1, 2001, pp. 2521–2524.

[89] D. Sturim, D. Reynolds, R. Dunn, and T. Quatieri, “Speaker verification using text-constrained gaussian mixture models,” in Proc. IEEE ICASSP, vol. 1. IEEE, 2002,pp. 677–680.

[90] D. Reynolds, “Channel robust speaker verification via feature mapping,” in Proc. IEEEICASSP, vol. 2. IEEE, 2003, pp. 53–56.

164

[91] W. Campbell, J. Campbell, D. Reynolds, D. Jones, and T. Leek, “Phonetic speakerrecognition with support vector machines,” Advances in Neural Info. Process. Sys.,vol. 16, pp. 1377–1384, 2003.

[92] D. Sturim and D. Reynolds, “Speaker adaptive cohort selection for tnorm in text-independent speaker verification,” in Proc. IEEE ICASSP, vol. 1, 2005, pp. 741–744.

[93] A. O. Hatch, B. Peskin, and A. Stolcke, “Improved phonetic speaker recognition usinglattice decoding,” pp. 169–172, 2005.

[94] A. Stolcke, L. Ferrer, S. Kajarekar, E. Shriberg, and A. Venkataraman, “MLLR trans-forms as features in speaker recognition,” in Proc. InterSpeech, 2005, pp. 2425–2428.

[95] Y. Bar-Yosef and Y. Bistritz, “Adaptive individual background model for speakerverification,” in Proc. InterSpeech, Brighton, U.K., 2009, pp. 1271–1274.

[96] A. Sarkar, S. Umesh, and S. P. Rath, “Text-independent speaker identification usingvocal tract length normalization for building universal background model,” in Proc.InterSpeech, Brighton, U.K., 2009.

[97] D. Reynolds, “Speaker identification and verification using Gaussian mixture speakermodels,” Speech Commun., vol. 17, no. 1-2, pp. 91–108, 1995.

[98] A. E. Rosenberg, J. DeLong, C. H. Lee, B. H. Juang, and F. K. Soong, “The use ofcohort normalized scores for speaker verification,” in Proc. ICSLP. Banff, Alberta,Canada: ISCA, 1992, pp. 599–602.

[99] T. Matsui and S. Furui, “Similarity normalization method for speaker verification basedon a posteriori probability,” in ISCA Workshop on Automatic Speaker Recognition,Identification and Verification, 1994.

[100] J. Villalba and N. Brummer, “Towards fully Bayesian speaker recognition: Integratingout the between-speaker covariance,” in Proc. InterSpeech, Florence, Italy, Oct. 2011,pp. 505 – 508.

[101] P. Schwarz, P. Matejka, and J. Cernocky, “Hierarchical structures of neural networksfor phoneme recognition,” in Proc. IEEE ICASSP, vol. 1, Toulouse, France, May 2006,pp. 325–328.

[102] S. Young, “HTK reference manual,” Cambridge University Engineering Department,1993.

[103] L. Burget, M. Fapso, V. Hubeika, O. Glembek, M. Karafiat, M. Kockmann, P. Mate-jka, P. Schwarz, and J. Cernocky, “BUT system description: NIST SRE 2008,” inProc. 2008 NIST Speaker Recognition Evaluation Workshop. Montreal, CA: NationalInstitute of Standards and Technology, 2008, pp. 1–4.

165

[104] H. Li, B. Ma, K.-A. Lee, H. Sun, D. Zhu, K. C. Sim, C. You, R. Tong, I. Karkkainen, C.-L. Huang, V. Pervouchine, W. Guo, Y. Li, L. Dai, M. Nosratighods, T. Tharmarajah,J. Epps, E. Ambikairajah, E. S. Chng, T. Schultz, and Q. Jin, “The I4U system inNIST 2008 speaker recognition evaluation,” in Proc. IEEE ICASSP, Taipei, Taiwan,April 2009, pp. 4201–4204.

[105] W. Guo, Y. Long, Y. Li, L. Pan, E. Wang, and L. Dai, “iFLY system for the NIST2008 speaker recognition evaluation,” in Proc. IEEE ICASSP, Taipei, Taiwan, April2009, pp. 4209–4212.

[106] E. Dalmasso, F. Castaldo, P. Laface, D. Colibro, and C. Vair, “Loquendo - Politecnicodi Torino’s 2008 NIST speaker recognition evaluation system,” in Proc. IEEE ICASSP,Taipei, Taiwan, April 2009, pp. 4213–4216.

[107] C. Barras, X. Zhu, J.-L. Gauvain, and L. Lamel, “The CLEAR’06 LIMSI acousticspeaker identification system for CHIL seminars,” in Proc. CLEAR. Southampton,UK: Springer-Verlag, 2007, pp. 233–240.

[108] S. Furui, Digital Speech Processing, Synthesis, and Recognition. Marcel Dekker Inc,1989.

[109] M. M. Bruce, Estimation of variance by a recursive equation. NASA Technical note,1969.

[110] S. Kullback and R. A. Leibler, “On information and sufficiency,” The Annalsof Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951. [Online]. Available:http://www.jstor.org/stable/2236703

[111] M. Ben and F. Bimbot, “D-MAP: A distance-normalized MAP estimation of speakermodels for automatic speaker verification,” in Proc. IEEE ICASSP, Hong Kong, China,April 2003, pp. II–69–72.

[112] Y. Ephraim and H. L. Van Trees, “A signal subspace approach for speech enhance-ment,” IEEE Trans. Speech, Audio Process., vol. 3, no. 4, pp. 251–266, Jul 1995.

[113] A. Rezayee and S. Gazor, “An adaptive KLT approach for speech enhancement,” IEEETrans. Speech, Audio Process., vol. 9, no. 2, pp. 87–95, Feb 2001.

[114] M. Kuhne, D. Pullella, R. Togneri, and S. Nordholm, “Towards the use of full covari-ance models for missing data speaker recognition,” in Proc. IEEE ICASSP, Las Vegas,Nevada, April 2008, pp. 4537 – 4540.

[115] B. Zhou and J. H. L. Hansen, “Rapid discriminative acoustic model based onEigenspace mapping for fast speaker adaptation,” IEEE Trans. Speech, Audio Pro-cess., vol. 13, no. 4, pp. 554 – 564, July 2005.

http://www.jstor.org/stable/2236703

166

[116] L. Burget et. al., “Analysis of feature extraction and channel compensation in a GMMspeaker recognition system,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 7,pp. 1979–1986, Sept. 2007.

[117] E. Batlle, C. Nadeu, and J. Fonollosa, “Feature decorrelation methods in speech recog-nition. A comparative study,” in Proc. ICSLP, vol. 7, Sydney, Australia, 1998, pp. 2907– 2910.

[118] T. Eisele, R. Haeb-Umbach, and D. Langmann, “A comparative study of linear featuretransformation techniques for automatic speech recognition,” in Proc. ICSLP, vol. 1,1996, pp. 252–255.

[119] K. Y. Lee, “Local fuzzy PCA based GMM with dimension reduction on speaker iden-tification,” Pattern Recogn. Lett., vol. 25, pp. 1811–1817, December 2004.

[120] T. Kinnunen, I. Karkkainen, and P. Franti, “Is speech data clustered? - statisticalanalysis of cepstral features,” in Proc. InterSpeech, Aalborg, Denmark, Sept. 2001, pp.2627–2630.

[121] S. Vaseghi, Advanced signal processing and digital noise reduction. Wiley, 1996.

[122] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEETrans. Acoust. Speech Signal Process., vol. 27, no. 2, pp. 113–120, Apr 1979.

[123] M. Hayes, Statistical digital signal processing and modeling. Wiley, 2009.

[124] A. McCree, D. Sturim, and D. Reynolds, “A new perspective on GMM subspace com-pensation based on PPCA and Wiener filtering,” in Proc. InterSpeech, Florence, Italy,Oct. 2011, pp. 145 – 148.

[125] R. Martin, “Noise power spectral density estimation based on optimal smoothing andminimum statistics,” IEEE Trans. Speech, Audio Process., vol. 9, no. 5, pp. 504–512,Jul 2001.

[126] T. Anderson, “Asymptotic theory for principal component analysis,” The Annals ofMathematical Statistics, vol. 34, no. 1, pp. 122–148, 1963.

[127] M. Alam, P. Ouellet, P. Kenny, and D. O’Shaughnessy, “Comparative evaluation of fea-ture normalization techniques for speaker verification,” Advances in Nonlinear SpeechProcess., vol. 7015, pp. 246–253, 2011.

[128] O. Glembek, L. Burget, P. Matejka, M. Karafiat, and P. Kenny, “Simplification andoptimization of i-vector extraction,” in Proc. IEEE ICASSP, Florence, Italy, Oct. 2011,pp. 4516 – 4519.

167

[129] D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel,M. Karafit, A. Rastrow, R. C. Rose, P. Schwarz, and S. Thomas, “The subspaceGaussian mixture modelA structured model for speech recognition,” Computer Speech& Lang., vol. 25, no. 2, pp. 404 – 439, 2011.

[130] K. Delac, M. Grgic, and S. Grgic, “Independent comparative study of PCA, ICA, andLDA on the FERET data set,” Intl. J. of Imaging Sys. and Tech., vol. 15, no. 5, pp.252–260, 2005.

[131] Q. Jin and A. Waibel, “Application of LDA to speaker recognition,” in Proc. ICSLP,2000, pp. 250–253.

[132] M. Jordan and R. Jacobs, “Hierarchical mixtures of experts and the EM algorithm,”Neural computation, vol. 6, no. 2, pp. 181–214, 1994.

[133] G. Golub and W. Kahan, “Calculating the singular values and pseudo-inverse of amatrix,” Journal of the Society for Industrial and Applied Mathematics: Series B,Numerical Analysis, pp. 205–224, 1965.

[134] E. ETSI, “202 050 v1. 1.3: Speech processing, transmission and quality aspects (stq);distributed speech recognition; advanced front-end feature extraction algorithm; com-pression algorithms,” ETSI standard, 2002.

[135] S. O. Sadjadi, T. Hasan, and J. H. L. Hansen, “Mean Hilbert Envelope Coefficients(MHEC) for Robust Speaker Recognition,” in Proc. InterSpeech, Portland, OR, Sept.2012, pp. 1696–1699.

[136] U. H. Yapanel and J. H. L. Hansen, “A new perceptually motivated MVDR-basedacoustic front-end (PMVDR) for robust automatic speech recognition,” Speech Com-mun., vol. 50, pp. 142–152, Feb. 2008.

[137] H. Boril and J. H. L. Hansen, “UT-scope: Towards LVCSR under Lombard effectinduced by varying types and levels of noisy background,” in Proc. IEEE ICASSP,Prague, Czech Republic, May 2011, pp. 4472 – 4475.

[138] T. Hasan, R. Saeidi, J. H. L. Hansen, and D. A. van Leeuwen, “Duration Mis-match Compensation for I-vector based Speaker Recognition Systems,” in Proc. IEEEICASSP, Vancouver, Canada, May. 2013.

[139] T. Hasan, G. Liu, S. O. Sadjadi, N. Shokouhi, H. Boril, A. Misra, K. W. Godin, andJ. H. Hansen, “UTD-CRSS systems for 2012 NIST speaker recognition evaluation,” inNIST 2012 Speaker Recognition Evaluation Workshop, Orlando, FL, Dec. 2012.

[140] G. J. McLachlan and D. Peel, “Mixtures of factor analyzers,” in Proc. ICML, Jun.2000, pp. 599–606.

168

[141] Y. Tang, R. Salakhutdinov, and G. Hinton, “Deep mixtures of factor analysers,” inProc. ICML, Edinburgh, Scotland, Jun. 2012.

[142] Z. Ghahramani, G. Hinton et al., “The EM algorithm for mixtures of factor analyzers,”Technical Report CRG-TR-96-1, University of Toronto, Tech. Rep., 1996.

[143] M. Brooks. VOICEBOX: Speech Processing Toolbox for MATLAB. [Online] http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.

[144] N. Brummer. (2012) SRE’12 - BOSARIS Toolkit. [Online] https://sites.google.com/site/bosaristoolkit/sre12.

[145] C. M. Bishop, “Bayesian PCA,” Advances in neural information processing systems,pp. 382–388, 1999.

[146] S. Nakajima, M. Sugiyama, and D. Babacan, “On Bayesian PCA: Automatic dimen-sionality selection and analytic solution,” in Proc. ICML, Bellevue, WA, 2011, pp.497–504.

[147] N. Brummer and E. de Villiers, “The BOSARIS toolkit: Theory, algorithms and codefor surviving the new DCF,” in NIST SRE Analysis Workshop, Atlanta, GA, Dec.2011.

[148] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neuralnetworks for large-vocabulary speech recognition,” IEEE Trans. Audio, Speech, Lang.Process., vol. 20, no. 1, pp. 30–42, 2012.

[149] M. Senoussaoui, N. Dehak, P. Kenny, R. Dehak, and P. Dumouchel, “First attempt ofboltzmann machines for speaker verification,” in Proc. Odyssey 2012, Singapore, June2012.

http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

https://sites.google.com/site/bosaristoolkit/sre12

https://sites.google.com/site/bosaristoolkit/sre12

VITA

Taufiq Hasan Al Banna received his B.Sc. and M.Sc. degrees in Electrical and ElectronicEngineering (EEE) from Bangladesh University of Engineering and Technology (BUET),Dhaka, Bangladesh, in 2006 and 2008, respectively. He was a Lecturer in the EEE Depart-ment at United International University (UIU), Dhaka, Bangladesh from December 2006to June 2008. He earned his doctorate degree in the Department of Electrical Engineering,Erik Jonsson School of Engineering and Computer Science, The University of Texas at Dallas(UTD). He is a member of the Center for Robust Speech Systems (CRSS). He was the leadperson responsible for the CRSS efforts on the National Institute of Standards and Tech-nology (NIST) Speaker Recognition Evaluation (SRE) 2012. His research interests includeacoustic modeling for robust speaker recognition, speech enhancement, front-end processing,feature normalization and audio/video processing for video summarization.

Journal papers

1. T. Hasan and J. H. L. Hansen, “Acoustic factor analysis for robust speaker verifica-tion,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 4, pp. 842 – 853, Oct.2012.

2. T. Hasan and J. H. L. Hansen, “A study on universal background model trainingin speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., pp. 1890–1899,Sep. 2011.

3. T. Hasan and M. K. Hasan, “An MMSE estimator for speech enhancement consideringthe constructive and destructive interference of noise,” Signal Processing, IET, vol. 4,no. 1, pp. 1 –11, Feb. 2010.

4. T. Hasan and J. H. L. Hansen, “Suppression of residual noise from speech signalsusing empirical mode decomposition,” IEEE Signal Process. Lett., vol. 16, no. 1, pp.2–5, Jan. 2009.

Conference papers

1. T. Hasan and J. H. L. Hansen, “Acoustic factor analysis based universal backgroundmodel for robust speaker verification in noise,” in Proc. InterSpeech, Lyon, France,Aug. 2013.

2. T. Hasan, S. O. Sadjadi, G. Liu, N. Shokouhi, H. Boril, and J. H. Hansen, “CRSSSystems for 2012 NIST Speaker Recognition Evaluation,” in Proc. IEEE ICASSP,Vancouver, Canada, May. 2013.

3. T. Hasan, R. Saeidi, J. H. L. Hansen, and D. A. van Leeuwen, “Duration Mis-match Compensation for I-vector based Speaker Recognition Systems,” in Proc. IEEEICASSP, Vancouver, Canada, May. 2013.

4. G. Liu, T. Hasan, H. Boril, and J. H. Hansen, “An investigation on back-end forspeaker recognition in multi-session enrollment,” in Proc. IEEE ICASSP, Vancouver,Canada, May. 2013.

5. T. Hasan and J. H. L. Hansen, “Integrated feature normalization and enhancementfor robust speaker recognition using acoustic factor analysis,” in Proc. InterSpeech,Portland, OR, Sept. 2012, pp. 1568–1571.

6. T. Hasan and J. H. L. Hansen, “Front-end channel compensation using mixture-dependent feature transformations for i-vector speaker recognition,” in Proc. Inter-Speech, Portland, OR, Sept. 2012, pp. 1091–1094.

7. K. W. Godin, T. Hasan, and J. H. L. Hansen, “Glottal waveform analysis of physicaltask stress speech,” in Proc. InterSpeech, Portland, OR, Sept. 2012.

8. S. O. Sadjadi, T. Hasan, and J. H. L. Hansen, “Mean Hilbert Envelope Coefficients(MHEC) for Robust Speaker Recognition,” in Proc. InterSpeech, Portland, OR, Sept.2012, pp. 1696–1699.

9. T. Hasan and J. H. L. Hansen, “Factor analysis of acoustic features using a mixtureof probabilistic principal component analyzers for robust speaker verification,” in Proc.Odyssey, Singapore, June 2012.

10. T. Hasan, H. Boril, A. Sangwan, and J. H. L. Hansen, “A multi-modal highlight ex-traction scheme for sports videos using an information-theoretic excitability measure,”in IEEE ICASSP, Kyoto, Japan, 2012, pp. 2381–2384.

11. T. Hasan and J. H. L. Hansen, “Robust speaker recognition in non-stationary roomenvironments based on Empirical Mode Decomposition,” in Proc. InterSpeech, Flo-rence, Italy, Oct. 2011, pp. 2733–2736.

12. H. Boril, A. Sangwan, T. Hasan, and J. H. L. Hansen, “Automatic excitement-leveldetection for sports highlights generation,” in Proc. InterSpeech, Makuhari, Chiba,Japan, September 2010, pp. 2202–2205.

13. T. Hasan, Y. Lei, A. Chandrasekaran, and J. H. L. Hansen, “A novel feature sub-sampling method for efficient universal background model training in speaker verifica-tion,” in Proc. IEEE ICASSP, Dallas, TX, March 2010, pp. 4494 – 4497.

14. M. R. Khan, T. Hasan, and M. R. Khan, “Iterative noise power subtraction techniquefor improved speech quality,” in Proc. ICECE, 2008.

15. T. Hasan, M. Huq, R. Mitra, and M. K. Hasan, “A two stage speech enhancementmethod for further improvement of speech quality by extracting signal from residual,”in Proc. ISSPA, 12-15 Februaty 2007, Sharjah, UAE.

16. T. Hasan and M. K. Hasan, “A probabilistic speech enhancement filter utilizingthe constructive and destructive interference of noise,” in Proc. EUSIPCO 2007, 3-7September 2007, Poznan, Poland.

Workshop papers

1. T. Hasan, G. Liu, S. O. Sadjadi, N. Shokouhi, H. Boril, A. Misra, K. W. Godin, andJ. H. Hansen, “UTD-CRSS systems for 2012 NIST speaker recognition evaluation,” inNIST 2012 Speaker Recognition Evaluation Workshop, Orlando, FL, Dec. 2012.

2. J.-W. Suh, S. O. Sadjadi, G. Liu, T. Hasan, K. W. Godin, and J. H. Hansen, “Ex-ploring Hilbert envelope based acoustic features in i-vector speaker verification usingHT-PLDA,” in NIST 2011 Speaker Recognition Evaluation Workshop, Atlanta, GA,Dec. 2011.

3. G. Liu, S. O. Sadjadi, T. Hasan, J.-W. Suh, and J. H. Hansen, “UTD-CRSS Systemsfor NIST Language Recognition Evaluation 2011,” in NIST 2011 Language RecognitionEvaluation Workshop, Atlanta, GA, Dec. 2011.

4. Y. Lei, T. Hasan, J.-W. Suh, A. Sangwan, H. Boril, G. Liu, K. Godin, C. Zhang,and J. H. L. Hansen, “The CRSS systems for the 2010 NIST speaker recognitionevaluation,” 2010.

Submitted papers

1. T. Hasan and J. H. L. Hansen, “Maximum Likelihood Acoustic Factor Analysis Mod-els for Robust Speaker Verification in Noise,” IEEE Trans. Audio, Speech, Lang.Process., 2013.

2. T. Hasan, H. Boril, A. Sangwan and J. H. L. Hansen, “Multi-modal highlight gener-ation for sports videos using aninformation-theoretic excitability measure,” EURASIPJ. of Advanced Signal Process., 2013.

EFFECTIVE ACOUSTIC MODELING FOR ROBUST SPEAKER RECOGNITION ... · EFFECTIVE ACOUSTIC MODELING FOR...

Documents

Transcript of EFFECTIVE ACOUSTIC MODELING FOR ROBUST SPEAKER RECOGNITION ... · EFFECTIVE ACOUSTIC MODELING FOR...