Speaker Verification The present and future of voiceprint based … Distinguished Lecture... ·...
Transcript of Speaker Verification The present and future of voiceprint based … Distinguished Lecture... ·...
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Asia-Pacific Signal and Information Processing Association
Speaker Verification – The present and future of voiceprint based security
Prof. Eliathamby Ambikairajah Head of School of Electrical Engineering & Telecommunications,
University of New South Wales, Australia
21 Oct 2013
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Outline
• Introduction
• Speaker Verification Applications
• Speaker Verification System
• Performance measure
• NIST Speaker Recognition Evaluation (SRE)
• Discussion
1
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
“How are you?”
Introduction
• Speech conveys several types of information
– Linguistic: message and language information
– Paralinguistic : emotional and physiological characteristics
Speech Recognition
Language Recognition
Speaker Recognition
Emotion Recognition
Accent Recognition
“How are you?” English Hsing Ming Happy Taiwanese
Linguistic Paralinguistic
2
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Introduction
Speech Recognition
Language Recognition
Speaker Recognition
Emotion Recognition
Accent Recognition
“How are you?”
3
Speaker Identification
determines who is speaking given a set of
enrolled speakers
Speaker Verification
determines if the unknown voice is from
the claimed speaker
Speaker Diarization
partition an input audio stream into
homogeneous segments according to the speaker
identity
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia 4
Speaker Identification
determines who is speaking given a set of
enrolled speakers
Speaker Verification
determines if the unknown voice is from
the claimed speaker
Speaker Diarization
partition an input audio stream into
homogeneous segments according to the speaker
identity
3.5 4 4.5 5 5.5
x 104
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Model repositorySpeaker 1
Model
Unknown Speaker
Best Matching Speaker
Speaker 2 Model
Speaker M Model
Model repositorySpeaker 1
Model
Claimed Speaker 2
RejectSpeaker 2
Model
Speaker M Model
3.5 4 4.5 5 5.5
x 104
-0.1
-0.05
0
0.05
0.1
0.15
0.2
3.5 4 4.5 5 5.5
x 104
-0.1
-0.05
0
0.05
0.1
0.15
0.2
3.5 4 4.5 5 5.5
x 104
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Speaker 1
Speaker 2
Speaker 1
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Speaker Verification Applications - Biometrics
5
Physical facilities
Access control
Telephone credit card purchases
Transaction authentication
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Speaker Verification System – Basic Overview
• In automatic speaker verification,
– The front-end converts speech signal into a more convenient representation (typically a set of feature vectors)
– The back-end compares this representation to a model of a speaker to determine how well they match
6
3 . 5 4 4 . 5 5 5 . 5
x 1 04
- 0 . 1
- 0 . 0 5
0
0 . 0 5
0 . 1
0 . 1 5
0 . 2
Feature Extraction
ClassificationSpeech
Speaker Model
Decision MakingAccept/Reject
Front-end Back-end
UBM: represent general, speaker independent model to be compared against a person-specific model when making an accept or reject decision.
Speaker Models
Speaker 1 Model
John’s Model
Speaker Verification System
I am John
Determine level of Match
Determine level of Match
Likelihood of Generic Male
Likelihood of John
Likelihood Ratio Decision Making
NOT JOHN
Universal Background Models (UBM)
Generic Male
Generic Female
Feature Extraction
c0 c1 cnc2
Speaker Verification System – Speaker Enrolment
8
Universal Background Models (UBM)
Generic Male
Generic Female
Speaker 1
Speaker 2
Speaker N
Feature Extraction
ModelTraining
Speaker x1
Speaker x2
Speaker xM
Feature Extraction
Feature Extraction
Feature Extraction
Model Adaptation
Model Adaptation
Model Adaptation
Background male speaker data
Target male speaker data
3 . 5 4 4 . 5 5 5 . 5
x 1 04
- 0 . 1
- 0 . 0 5
0
0 . 0 5
0 . 1
0 . 1 5
0 . 2
3 . 5 4 4 . 5 5 5 . 5
x 1 04
- 0 . 1
- 0 . 0 5
0
0 . 0 5
0 . 1
0 . 1 5
0 . 2
Creating a male UBM
Speaker Models
Speaker x1 Model
Speaker x2 Model
Speaker xM Model
Creating male speaker-specific models
Step 1
Step 2
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Detailed Speaker Verification System
9
Feature
Extraction
Speaker
Modelling
Classification
(Scoring)SpeechAccept/
Reject
Feature
Normalisation
Model
Normalisation
Score
Normalisation
Decision
Making
· Cepstral Mean Subtraction
(CMS)
· RelAtive SpecTrAl (RASTA)
· Feature Warping
· Feature Mapping
· Nuisance Attribute
Projection (NAP)
· Joint Factor Analysis (JFA)
· i-vectors
· Within Class Covariance
Normalisation (WCCN)
· Linear Discriminant Analysis
(LDA)
· Probabilistic Linear
Discriminant Analysis
(PLDA)
· Zero-normalisation
(Z-norm)
· Test-normalisation
(T-Norm)
Front-end: Feature Extraction
10
3.5 4 4.5 5 5.5
x 104
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Frame 1 Frame 2 Frame 3 Frame N
25ms 25ms 25ms 25ms
Feature Vector
Feature Vector
Feature Vector
Feature Vector
Windowing
Feature Extraction
Feature Normalisation
BASIC FEATURES
-10 -5 0 5 100
0.2
0.4
-10 -5 0 5 100
0.1
0.2
0.3
0.4
Windowing
Feature Extraction
Windowing
Feature Extraction
Windowing
Feature Extraction
c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2
Normalised Feature Vector
c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2NORMALISED FEATURES
Normalised Feature Vector
Normalised Feature Vector
Normalised Feature Vector
Co distribution
Normalised Co
distribution
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia 11
Temporal Derivative
c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2
Normalised Feature vectors
(Frame 1)
Normalised Feature vectors
(Frame 2)
Normalised Feature vectors
(Frame P)
d0d1 dnd2
Delta Feature vectors
(Frame 1)
Delta Feature vectors
(Frame 2)
Delta Feature vectors
(Frame P)
d0d1 dnd2 d0d1 dnd2
Temporal Derivative
a0a1 ana2
Acceleration Feature vectors
(Frame 1)
Acceleration Feature vectors
(Frame 2)
Acceleration Feature vectors
(Frame P)
a0a1 ana2 a0a1 ana2
c0 c1 cnc2 d0d1 dnd2 a0a1 ana2Frame 1 Features:
(e.g: 39 dimensions)
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Detailed Speaker Verification System
12
Feature
Extraction
Speaker
Modelling
Classification
(Scoring)SpeechAccept/
Reject
Feature
Normalisation
Model
Normalisation
Score
Normalisation
Decision
Making
· Cepstral Mean Subtraction
(CMS)
· RelAtive SpecTrAl (RASTA)
· Feature Warping
· Feature Mapping
· Nuisance Attribute
Projection (NAP)
· Joint Factor Analysis (JFA)
· i-vectors
· Within Class Covariance
Normalisation (WCCN)
· Linear Discriminant Analysis
(LDA)
· Probabilistic Linear
Discriminant Analysis
(PLDA)
· Zero-normalisation
(Z-norm)
· Test-normalisation
(T-Norm)
Speaker Modelling
Probability density function approximated by 3-component Gaussian mixture models
Each Gaussian mixture consist of a mean (µ), covariance (Σ) and weight (w)
-8 -6 -4 -2 0 2 4 6 8 100
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Overall
Weighted
Gaussian 3
Weighted
Gaussian 2
Weighted
Gaussian 1
All Weights
must sum to 1
13
4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 93
4
5
6
7
8
9
45
67
89
3
4
5
6
7
8
9
0
0.2
0.4FEATURE SPACE
MODELLING PROBABILITY DISTRIBUTION
Dim
en
sio
n 1
(C
0)
Dimension 2 (C1)
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Database for creating UBM (example)
• Training set
– 56 male speakers (each speaker consists of 2 minutes of active speech) for creating the UBM
• Target set
– 20 male speakers (each speaker consists of 2 minutes of active speech) for speaker-specific model
• Test set
– 250 male utterances (each speaker has many test utterances) with the known identity
14
-1000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000-4
-3
-2
-1
0
1
2
0 500 1000 1500 2000 2500 3000-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 500 1000 1500 2000 2500-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia 15
UBM
Target Speaker Data
Target Model
-6 -4 -2 0 2 4 6 8-0.1
0
0.1
0.2
0.3
0.4
-6 -4 -2 0 2 4 6 8-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Mean = 0.9
Covariance = 0.9
Mean = 0.8
Covariance = 0.5
Weight = 0.2
Weight = 0.3
Feature Dimension 1
Fe
atu
re D
ime
nsio
n 2
Feature Dimension 1
Fe
atu
re D
ime
nsio
n 2
Universal Background Model (UBM) consists
of 1024 Gaussian mixtures
Target speaker model consists of 1024
Gaussian mixtures
Gaussian mixture consists of a mean (µ), covariance (Σ) and weight (w)
1
2
1024
998
1024
998
Representing GMMs
16
GMM
Mixture 1 Mixture 2 Mixture 1024
1x1
- W
eig
ht
39
x1 M
ean
s
39
x1 C
ova
rian
ces
1x1
- W
eig
ht
39
x1 M
ean
s
39
x1 C
ova
rian
ces
1x1
- W
eig
ht
39
x1 M
ean
s
39
x1 C
ova
rian
ces
1x1024 Weight vector
39x1024 Means matrix
39x1024 Covariances matrix
GMM REPRESENTATION
The UBM and each speaker model is a GMM
Each of them will be represented by a vector of weights, a matrix of means and a matrix of covariances
Universal Background Models
Generic Male
Generic Female
Speaker Models
Speaker 1 Model
John’s Model
Decision Making
Likelihood S came from speaker model Likelihood S did not come from speaker model
Score, L = log
𝑳 ≶ 𝜽
Reject/Accept
Determine level of Match
Determine level of Match
Likelihood of Generic Male
Likelihood of John
17
Feature Extraction
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Score Normalisation
Speaker Verification System 1
Speaker Verification System 2
Speaker Verification System N
Speech
Different Systems perform speaker
verification in parallel
18
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Score Normalisation
Speaker Verification System 1
Speaker Verification System 2
Speaker Verification System N
Speech
Score 1
Score 2
Score N
May not fall in the same range. i.e.,
NOT directly comparable
19
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Score Normalisation
Speaker Verification System 1
Speaker Verification System 2
Speaker Verification System N
Speech
Score 1
Score 2
Score N
Score Normalisation
Score Normalisation
Score Normalisation
NormalisedScore 1
NormalisedScore 2
NormalisedScore N
Normalised scores (L) are comparable
20
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Fusion
21
Final score will be a weighted sum of score from each system
Speaker Verification System 1
Speaker Verification System 2
Speaker Verification System N
Speech
Score Normalisation
Score Normalisation
Score Normalisation
Normalised
Score 1, s1
Normalised
Score 2, s2
Normalised
Score N, sN
Final Score =
w1s1 + w2s2 +…+wNsN
Fusion
Performance measure
• Types of error:
– Misses: valid identity is rejected o Probability of miss: ratio of the number of falsely rejected
speaker tests to the total number of correct speaker trials.
– False alarms: invalid identity is accepted o Probability of false alarm: ratio of the number of falsely accepted
speaker tests to the total number of impostor trials
TRUE SPEAKER
IMPOSTER
ACCEPT CLAIM REJECT CLAIM
CORRECT DECISION
CORRECT DECISION
FALSE ACCEPTANCE
MISS
22
Performance measure - Detection error trade-off (DET) curve
False Acceptance Rate (in %)
Mis
s R
ate
(in
%)
Each point on the curve corresponds to a different 𝜃
23
Performance measure - Detection error trade-off (DET) curve
Equal Error Rate (EER) = 1 %
Wire Transfer:
False acceptance is very costly
Users may tolerate rejections for security
Customization:
False rejections alienate customers
Any customization is beneficial
Application operating point depends on relative costs of the two error types
High Convenience
High Security
Balance
False Acceptance Rate (in %)
Mis
s R
ate
(in
%)
24
NIST Speaker Recognition Evaluation (SRE)
• Ongoing text independent speaker recognition evaluations conducted by NIST (http://www.itl.nist.gov/iad/mig/tests/spk/)
– driving force in advancing the state-of-the-art
– Conditions for different amounts of data o 10 sec.
o 3-5 minutes
o 8 minutes
o Separate channel and summed channel conditions
– English-speakers, non-English speakers, multilingual speakers
25
NIST SRE Trends
• 1996 – First SRE in current series
• 2000 – AHUMADA Spanish data, first non-English speech
• 2001 – Cellular data, Automatic Speech Recognition (ASR) transcripts provided
• 2005 – Multiple languages with bilingual speakers, room mic recordings, cross-channel trials
• 2008 – Interview data
• 2010 – High and low vocal effort, aging, HASR (Human-Assisted Speaker Recognition) Evaluation
• 2012 – Broad range of test conditions, with added noise and reverberation, target speakers defined beforehand
26
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Basic System
Feature
Extraction
MAP adaptation
Log-likelihood
UBM
Sp
ee
ch
27
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Trends
• In 2004’s: Classification
Feature
Extraction
MAP adaptation
Extract Supervectors
UBM
Sp
ee
ch
SVM Scoring
28
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Trends
• In 2005’s: Channel compensation - NAP
Feature
Extraction
MAP adaptation
Extract Supervectors
UBM
Sp
ee
ch
NAP SVM Scoring
29
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Trends
• In 2007’s: Channel compensation - JFA
Feature
Extraction
Baum-Welch statistics
estimation
Factor analysis
Speaker Factor
ExtractionWCCN SVM
UBMJFA
Hyperparameters (V, U , D)
Sp
ee
ch
30
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Trends
• In 2009’s: Channel compensation – i-vector
Feature
Extraction
Baum-Welch statistics
estimation
Factor analysis
i-vector extraction
WCCN LDACosine
Distance Scoring
UBMTotal
Variability Matrix
Sp
ee
ch
31
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia
Trends
• In 2009’s: Channel compensation – PLDA
Feature
Extraction
Baum-Welch statistics
estimation
Factor analysis
i-vector extraction
PLDA Log-likelihood
UBMTotal
Variability Matrix
Sp
ee
ch
32
APSIPA Asia-Pacific Signal and Information Processing Association
APSIPA Distinguished Lecture Series @ IIU, Malaysia 33