Post on 21-Dec-2015
description
Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification
Man-Wai MAK and Wei RAOThe Hong Kong Polytechnic University
enmwmak@polyu.edu.hkhttp://www.eie.polyu.edu.hk/~mwmak/
2
Outline
GMM-UBM for Speaker Verification GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Experiments on NIST SRE
3
Speaker Verification
To verify the identify of a claimant based on his/her own voices
Is this Mary’s voice?
I am Mary
4
FeatureExtraction
John’sModel
ImpostorModel
Score Normalization and Decision
Making
+
_
DecisionThreshold
Accept/Reject
John’s “Voiceprint”
Impostors “Voiceprints”
I’m John
Scores
Verification Process
5
Acoustic Features Speech is a continuous evolution of the vocal tract Need to extract a sequence of spectra or sequence of spectral coefficients Use a sliding window - 25 ms window, 10 ms shift
DCTLog|X(ω)|MFCC
6
M
j
sj
sj
sj
s pp1
)()()()( ),|()|( xx
GMM-UBM for Speaker Verification
• The acoustic vectors (MFCC) of speaker s is modeled by a prob. density function parameterized by
Mj
sj
sj
sj
s1
)()()()( },,{
• Gaussian mixture model (GMM) for speaker s:
Mj
sj
sj
sj
s1
)()()()( },,{
7
M
jjjj pp
1
)ubm()ubm()ubm()ubm( ),|()|( xx
• The acoustic vectors of a general population is modeled by another GMM called the universal background model (UBM):
• Parameters of the UBM
Mjjjj 1
)ubm()ubm()ubm()ubm( },,{
GMM-UBM for Speaker Verification
8
Client Speaker Model
Universal Background
Model
)(s
ubm)(
MAP
Enrollment Utterance (X(s)) of Client Speaker
)1()( )ubm()()(jj
sjj
sj XE
GMM-UBM for Speaker Verification
9
2-class Hypothesis problem:H0: MFCC sequence X(c) comes from to the true speakerH1: MFCC sequence X(c) comes from an impostor
Verification score is a likelihood ratio:
)|(log)|(log)1|(
)0|(logScore ubm)()()()(
)(
)(
cscc
c
XpXpHXp
HXp
Featureextraction
BackgroundModel
Decision+−
accept Score
reject Score
Score
SpeakerModel )(s
ubm)(
GMM-UBM Scoring
)(cX
)(cX
10
Outline
GMM-UBM for Speaker Verification GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Acoustic Vector Resampling for GMM-SVM Results on NIST SRE
11
)(s)(sutt
UBM
Feature Extraction
)(sX Mean Stacking
MAPAdaptation
1 2 M
1 2 Mi
1
2)(
1
MDM
s
1
2)(
1
MDM
s
GMM GMM supervectorsupervector
Mapping)(sX
GMM-SVM for Speaker Verification
12
)( Bbutt
)( 2butt
UBM
Feature Extraction
Feature Extraction
)(sX
)()( ,,1 Bbb XX
Compute GMM-Supervector of Target
Speaker s
Compute GMM-Supervectors of
Background Speakers
Feature Extraction
UBM
)(cXCompute GMM-Supervector of
Claimant c
)(sutt
)(cutt
GMM-SVM Scoring
)( 1butt
)( )(SVM-GMM
cXS
SVM ScoringSVM Scoring
),( )()( sc XXK
),( )()( 1bc XXK
),( )()( Bbc XXK
…
)(sd
M
j
bjjj
cjjj
bc BBXXK1
)(T
)()()( 21
21
),(
)(0
s
)(1
s
)(si
)(sB
)()()(
bkg fromSV
)()()()(0
)(SVM-GMM ),(),()( sbc
i
si
scsc dXXKXXKXS i
13
GMM-UBM Scoring Vs. GMM-SVM Scoring
)()()(
bkg fromSV
)()()()(0
)(SVM-GMM ),(),()( sbc
i
si
scsc dXXKXXKXS i
)|(log)|(log)( ubm)()()()()(UBM-GMM cscc XpXpXS
GMM-UBM:
GMM-SVM:
)()(
1
)(T
)()()(
21
21
21
21
),(
sT
c
M
j
sjjj
cjjj
sc XXK
Normalized GMM-supervector of
claimant’s utterance
Normalized GMM-supervector of target-speaker’s utterance
14
Outline
GMM-UBM for Speaker Verification GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Results on NIST SRE
150 1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
x1
x 2Linear SVM, C=10.0, #SV=3, slope=-1.00
Speaker ClassImpostor Class
For each target speaker, we only have one utterance (GMM-supervector) from the target speaker and many utterances from the background speakers.
So, we have a highly imbalance learning problem.
Only one training
vector from the target speaker
Data Imbalance in GMM-SVM
16
0 1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
9
x1
x 2Linear SVM, C=10.0, #SV=3, slope=-1.44
Speaker ClassImpostor Class
Orientation of the decision boundary
depends mainly on impostor-class
data
Data Imbalance in GMM-SVM
17
A 3-dim two-class problem illustrating the problem that the SVM decision plane is largely governed by the impostor-class supervectors.
Impostor Class
Speaker Class
Region for which the target-speaker vector can be located without
changing the orientation of the decision plane
Data Imbalance in GMM-SVM
18
Outline
GMM-UBM for Speaker Verification GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Results on NIST SRE
19
Partition an enrollment utterance of a target speaker into number of sub-utterances, with each sub-utterance producing one GMM-supervector.
Utterance Partitioning
20
)(4
)(0
)(4
)(0
,,
,,1 Bbb
ss
mm
mm
)(utt Bb
Target-speaker’s Enrollment Utterance
Feature Extraction
Background-speakers’ Utterances
Feature Extraction(s)0X
(s)1X (s)
2X (s)4X(s)
3X
)(b0
1X
)(b2
1X)(b1
1X )(b4
1X)(b3
1X
)(b0
2X
)(b2
2X)(b1
2X )(b4
2X)(b3
2X
)(b0
BX
)(b2
BX)(b1
BX )(b4
BX)(b3
BX
MAP Adaptation and
Mean Stacking
SVM Training
(s)4
(s)0 ,, XX
UBM
)( 1utt b
)( 2utt b
(s)utt
SVM of Target Speaker s
Utterance Partitioning
21
Length-Representation Trade-off
• When the number of partitions increases, the length of sub-utterance decreases.
• If the utterance-length is too short, the supervectors of the sub-utterances will be almost the same as that of the UBM
(s)utt
0 1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
9
x1
x 2
Linear SVM, C=10.0, #SV=3, slope=-1.44
Speaker ClassImpostor Class
Supervector corresponding to
the UBM
22
1. Randomly rearrange the sequence of acoustic vectors in an utterance;
2. Partition the acoustic vectors of an utterance into N segments;
3. If Step 1 and Step 2 are repeated R times, we obtain RN+1 target-speaker’s supervectors .
Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)
Procedure of UP-AVR:
Goal: Increase the number of sub-utterances without compromising their representation power
MFCC seq. before randomization
MFCC seq. after randomization
23
Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)
)(4
)(0
)(4
)(0
,,
,,1 Bbb
ss
mm
mm
)(utt Bb
Target-speaker’s Enrollment Utterance
Feature Extraction andIndex Randomization
Background-speakers’ Utterances
(s)0X
(s)1X (s)
2X (s)4X(s)
3X
)(b0
1X
)(b2
1X)(b1
1X )(b4
1X)(b
31X
)(b0
2X
)(b2
2X)(b1
2X )(b4
2X)(b
32X
)(b0
BX
)(b2
BX)(b1
BX)(b
4BX)(b
3BX
MAP Adaptation and
Mean Stacking
SVM Training
(s)4
(s)0 ,, XX
UBM
)( 1utt b
)( 2utt b
(s)utt
SVM of Target Speaker s
Feature Extraction andIndex Randomization
24
Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)
• Characteristics of supervectors created by UP-AVR Average pairwise distance between sub-utt SVs is larger than the
average pairwise distance between sub-utt SVs and full-utt SV. Average pairwise distance between speaker-class’s sub-utt SVs and
impostor-class’s SVs is smaller than the average pairwise distance between speaker-class’s full-utt SV and impostor-class’s SVs.
Imposter-class
Speaker-class
Sub-utt supervector
Full-utt supervector
25
Nuisance Attribute Project (NAP) [Solomonoff et al., ICASSP2005]
Nuisance Attribute Projection
Sub-space representing session variability.Defined by V
),()( hss mPm
),( hsT mVV),( hsm
),(),(),(),( 21
21
),( hsT
hchshc XXK
Recall the GMM-supervector kernel:
Define the session- and speaker-dependent supervector as
sessionfor stands andspeaker for stands where,),(),( 21
hshshs m
Remove the session-dependent part (h) by removing the sub-space that causes the session variability:
),(),()( )( hsThss mVVImPm
The New kernel becomes
),(),(
)()()()( ),(hsThc
sTcsc XXK
mPmP
mm
Goal: To reduce the effect of session variability
26
Nuisance Attribute Project (NAP) [Solomonoff et al., ICASSP2005]
Nuisance Attribute Projection
otherwise0
speaker same the tocorrespond and 1
minarg,
),(),(*
jiw
w
ij
ji
hjhiij mPmPP
P
Sub-space representing session variability.Defined by V
),()( hss mPm
),( hsT mVV),( hsm
27
Enrollment Process of
GMM-SVM with UP-AVR
MFCCs of an utterance from
target-speaker s
MAP and Mean Stacking
NAP
Session-dependent
supervectors
Session-independent supervectors
SVM Training
UBM
),( hsX
)(sim
),( hsim
Resampling/Partitioning
),( hsiX
SVM of target-speaker s
)( jbim
28
Verification Process of
GMM-SVM with UP-AVR
MFCCs of a test utterance
from claimant c
MAP and Mean Stacking
NAP
Session-dependent supervector
Session-independent supervector
SVM Scoring T-NormNormalized
scorescore
UBM
TnormModels
)(cX
)( )(cXS )(~ )(cXS
)(cm
),( hcm
SVM of target-speaker s
29
T-Norm (Auckenthaler, 2000)
)( )(cXS
)(cm
SVM Scoring
T-Norm SVM 1
SVM Scoring
T-Norm SVM R
ComputeMeanand
StandardDeviation
)(
)()()(
~)(
)()()(
c
ccc
X
XXSXS
Z-norm)(
)()(
)(
c
c
X
X
from test utterance
Goal: To shift and scale the verification scores so that a global decision threshold can be used for all speakers
T-Norm
Normalized scorescore
TnormModels
)( )(cXS
)(cm
T-Norm
Normalized scorescore
TnormModels
)( )(cXS
)(cm
30
Outline
GMM-UBM for Speaker Verification GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Experiments on NIST SRE
31
Evaluations on NIST SRE 2002 and 2004 NIST SRE 2002:
Use NIST’01 for computing the UBMs, impostor-class supervectors of SVMs, Tnorm models, and NAP parameters
2983 true-speaker trials and 36287 impostor attempts 2-min utterances for training and about 1-min utt for test
NIST SRE 2004: Use the Fisher corpus for computing UBMs, impostor-class supervectors of
SVMs, and Tnorm models NIST’99 and NIST’00 for computing NAP parameters 2386 true-speaker trials and 23838 impostor attempts 5-min utterances for training and testing
Experiments
Speech Data
32
12 MFCC + 12 ΔMFCC with feature warping 1024-mixture GMMs for GMM-UBM 256-mixture GMMs for GMM-SVM MAP relevance factor = 16 300 impostor-class supervectors for GMM-SVM 200 T-norm models 64-dim session variability subspace (NAP corank, rank of V)
Experiments
Features and Models
33
No. of mixtures in GMM-SVM (NIST’02)
Results
No
rma
lize
d
Large number of features with small
variance
Threshold below which the variances
of feature are deemed too small
40
Performance on NIST’02
EER=9.05%EER=9.05%
EER=9.39%EER=9.39%
EER=8.16%EER=8.16%
Experiments and Results
41
EER=9.46%EER=9.46%EER=10.42%EER=10.42%
EER=16.05%EER=16.05%
Performance on NIST’04
Experiments and Results
GMM-UBM
GMM-SVMGMM-SVM
w/ UP-AVR
42
1. S.X. Zhang and M.W. Mak "Optimized Discriminative Kernel for SVM Scoring and its Application to Speaker Verification", IEEE Trans. on Neural Networks, to appear.
2. M.W. Mak and W. Rao, "Utterance Partitioning with Acoustic Vector Resampling for GMM-SVM Speaker Verification", Speech Communication, vol. 53 (1), Jan. 2011, Pages 119-130.
2. M.W. Mak and W. Rao, "Acoustic Vector Resampling for GMMSVM-Based Speaker Verification, Interspeech 2010. Sept. 2010, Makuhari, Japan, pp. 1449-1452.
3. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning Approach, Prentice Hall, 2005
4. W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Processing Letters, vol. 13, pp. 308–311, 2006.
5. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, 2000.
References