Speaker Verification The present and future of voiceprint based … Distinguished Lecture... ·...

APSIPA Asia-Pacific Signal and Information Processing Association


Speaker Verification – The present and future of voiceprint based security

Prof. Eliathamby Ambikairajah Head of School of Electrical Engineering & Telecommunications,

University of New South Wales, Australia

21 Oct 2013


APSIPA Distinguished Lecture Series @ IIU, Malaysia

Outline

• Introduction

• Speaker Verification Applications

• Speaker Verification System

• Performance measure

• NIST Speaker Recognition Evaluation (SRE)

• Discussion

1



“How are you?”

Introduction

• Speech conveys several types of information

– Linguistic: message and language information

– Paralinguistic : emotional and physiological characteristics

Speech Recognition

Language Recognition

Speaker Recognition

Emotion Recognition

Accent Recognition

“How are you?” English Hsing Ming Happy Taiwanese

Linguistic Paralinguistic

2



Introduction

Speech Recognition

Language Recognition

Speaker Recognition

Emotion Recognition

Accent Recognition

“How are you?”

3

Speaker Identification

determines who is speaking given a set of

enrolled speakers

Speaker Verification

determines if the unknown voice is from

the claimed speaker

Speaker Diarization

partition an input audio stream into

homogeneous segments according to the speaker

identity


APSIPA Distinguished Lecture Series @ IIU, Malaysia 4

Speaker Identification

determines who is speaking given a set of

enrolled speakers

Speaker Verification

determines if the unknown voice is from

the claimed speaker

Speaker Diarization

partition an input audio stream into

homogeneous segments according to the speaker

identity

3.5 4 4.5 5 5.5

x 104

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Model repositorySpeaker 1

Model

Unknown Speaker

Best Matching Speaker

Speaker 2 Model

Speaker M Model

Model repositorySpeaker 1

Model

Claimed Speaker 2

RejectSpeaker 2

Model

Speaker M Model

3.5 4 4.5 5 5.5

x 104

-0.1

-0.05

0

0.05

0.1

0.15

0.2

3.5 4 4.5 5 5.5

x 104

-0.1

-0.05

0

0.05

0.1

0.15

0.2

3.5 4 4.5 5 5.5

x 104

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Speaker 1

Speaker 2

Speaker 1



Speaker Verification Applications - Biometrics

5

Physical facilities

Access control

Telephone credit card purchases

Transaction authentication



Speaker Verification System – Basic Overview

• In automatic speaker verification,

– The front-end converts speech signal into a more convenient representation (typically a set of feature vectors)

– The back-end compares this representation to a model of a speaker to determine how well they match

6

3 . 5 4 4 . 5 5 5 . 5

x 1 04

- 0 . 1

- 0 . 0 5

0

0 . 0 5

0 . 1

0 . 1 5

0 . 2

Feature Extraction

ClassificationSpeech

Speaker Model

Decision MakingAccept/Reject

Front-end Back-end

UBM: represent general, speaker independent model to be compared against a person-specific model when making an accept or reject decision.

Speaker Models

Speaker 1 Model

John’s Model

Speaker Verification System

I am John

Determine level of Match


Likelihood of Generic Male

Likelihood of John

Likelihood Ratio Decision Making

NOT JOHN

Universal Background Models (UBM)

Generic Male

Generic Female

Feature Extraction

c0 c1 cnc2

Speaker Verification System – Speaker Enrolment

8

Universal Background Models (UBM)

Generic Male

Generic Female

Speaker 1

Speaker 2

Speaker N

Feature Extraction

ModelTraining

Speaker x1

Speaker x2

Speaker xM

Feature Extraction

Feature Extraction

Feature Extraction

Model Adaptation

Model Adaptation

Model Adaptation

Background male speaker data

Target male speaker data

3 . 5 4 4 . 5 5 5 . 5

x 1 04

- 0 . 1

- 0 . 0 5

0

0 . 0 5

0 . 1

0 . 1 5

0 . 2

3 . 5 4 4 . 5 5 5 . 5

x 1 04

- 0 . 1

- 0 . 0 5

0

0 . 0 5

0 . 1

0 . 1 5

0 . 2

Creating a male UBM

Speaker Models

Speaker x1 Model

Speaker x2 Model

Speaker xM Model

Creating male speaker-specific models

Step 1

Step 2



Detailed Speaker Verification System

9

Feature

Extraction

Speaker

Modelling

Classification

(Scoring)SpeechAccept/

Reject

Feature

Normalisation

Model

Normalisation

Score

Normalisation

Decision

Making

· Cepstral Mean Subtraction

(CMS)

· RelAtive SpecTrAl (RASTA)

· Feature Warping

· Feature Mapping

· Nuisance Attribute

Projection (NAP)

· Joint Factor Analysis (JFA)

· i-vectors

· Within Class Covariance

Normalisation (WCCN)

· Linear Discriminant Analysis

(LDA)

· Probabilistic Linear

Discriminant Analysis

(PLDA)

· Zero-normalisation

(Z-norm)

· Test-normalisation

(T-Norm)

Front-end: Feature Extraction

10

3.5 4 4.5 5 5.5

x 104

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Frame 1 Frame 2 Frame 3 Frame N

25ms 25ms 25ms 25ms

Feature Vector

Feature Vector

Feature Vector

Feature Vector

Windowing

Feature Extraction

Feature Normalisation

BASIC FEATURES

-10 -5 0 5 100

0.2

0.4

-10 -5 0 5 100

0.1

0.2

0.3

0.4

Windowing

Feature Extraction

Windowing

Feature Extraction

Windowing

Feature Extraction

c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2

Normalised Feature Vector

c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2NORMALISED FEATURES




Co distribution

Normalised Co

distribution



Temporal Derivative

c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2

Normalised Feature vectors

(Frame 1)


(Frame 2)


(Frame P)

d0d1 dnd2

Delta Feature vectors

(Frame 1)


(Frame 2)


(Frame P)

d0d1 dnd2 d0d1 dnd2

Temporal Derivative

a0a1 ana2

Acceleration Feature vectors

(Frame 1)


(Frame 2)


(Frame P)

a0a1 ana2 a0a1 ana2

c0 c1 cnc2 d0d1 dnd2 a0a1 ana2Frame 1 Features:

(e.g: 39 dimensions)



Detailed Speaker Verification System

12

Feature

Extraction

Speaker

Modelling

Classification

(Scoring)SpeechAccept/

Reject

Feature

Normalisation

Model

Normalisation

Score

Normalisation

Decision

Making

· Cepstral Mean Subtraction

(CMS)

· RelAtive SpecTrAl (RASTA)

· Feature Warping

· Feature Mapping

· Nuisance Attribute

Projection (NAP)

· Joint Factor Analysis (JFA)

· i-vectors

· Within Class Covariance

Normalisation (WCCN)

· Linear Discriminant Analysis

(LDA)

· Probabilistic Linear

Discriminant Analysis

(PLDA)

· Zero-normalisation

(Z-norm)

· Test-normalisation

(T-Norm)

Speaker Modelling

Probability density function approximated by 3-component Gaussian mixture models

Each Gaussian mixture consist of a mean (µ), covariance (Σ) and weight (w)

-8 -6 -4 -2 0 2 4 6 8 100

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Overall

PDF

Weighted

Gaussian 3

Weighted

Gaussian 2

Weighted

Gaussian 1

All Weights

must sum to 1

13

4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 93

4

5

6

7

8

9

45

67

89

3

4

5

6

7

8

9

0

0.2

0.4FEATURE SPACE

MODELLING PROBABILITY DISTRIBUTION

Dim

en

sio

n 1

(C

0)

Dimension 2 (C1)



Database for creating UBM (example)

• Training set

– 56 male speakers (each speaker consists of 2 minutes of active speech) for creating the UBM

• Target set

– 20 male speakers (each speaker consists of 2 minutes of active speech) for speaker-specific model

• Test set

– 250 male utterances (each speaker has many test utterances) with the known identity

14

-1000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000-4

-3

-2

-1

0

1

2

0 500 1000 1500 2000 2500 3000-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000 2500-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1



UBM

Target Speaker Data

Target Model

-6 -4 -2 0 2 4 6 8-0.1

0

0.1

0.2

0.3

0.4

-6 -4 -2 0 2 4 6 8-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

Mean = 0.9

Covariance = 0.9

Mean = 0.8

Covariance = 0.5

Weight = 0.2

Weight = 0.3

Feature Dimension 1

Fe

atu

re D

ime

nsio

n 2

Feature Dimension 1

Fe

atu

re D

ime

nsio

n 2

Universal Background Model (UBM) consists

of 1024 Gaussian mixtures

Target speaker model consists of 1024

Gaussian mixtures

Gaussian mixture consists of a mean (µ), covariance (Σ) and weight (w)

1

2

1024

998

1024

998

Representing GMMs

16

GMM

Mixture 1 Mixture 2 Mixture 1024

1x1

- W

eig

ht

39

x1 M

ean

s

39

x1 C

ova

rian

ces

1x1

- W

eig

ht

39

x1 M

ean

s

39

x1 C

ova

rian

ces

1x1

- W

eig

ht

39

x1 M

ean

s

39

x1 C

ova

rian

ces

1x1024 Weight vector

39x1024 Means matrix

39x1024 Covariances matrix

GMM REPRESENTATION

The UBM and each speaker model is a GMM

Each of them will be represented by a vector of weights, a matrix of means and a matrix of covariances

Universal Background Models

Generic Male

Generic Female

Speaker Models

Speaker 1 Model

John’s Model

Decision Making

Likelihood S came from speaker model Likelihood S did not come from speaker model

Score, L = log

𝑳 ≶ 𝜽

Reject/Accept



Likelihood of Generic Male

Likelihood of John

17

Feature Extraction



Score Normalisation

Speaker Verification System 1


Speaker Verification System N

Speech

Different Systems perform speaker

verification in parallel

18



Score Normalisation




Speech

Score 1

Score 2

Score N

May not fall in the same range. i.e.,

NOT directly comparable

19



Score Normalisation




Speech

Score 1

Score 2

Score N

Score Normalisation

Score Normalisation

Score Normalisation

NormalisedScore 1

NormalisedScore 2

NormalisedScore N

Normalised scores (L) are comparable

20



Fusion

21

Final score will be a weighted sum of score from each system




Speech

Score Normalisation

Score Normalisation

Score Normalisation

Normalised

Score 1, s1

Normalised

Score 2, s2

Normalised

Score N, sN

Final Score =

w1s1 + w2s2 +…+wNsN

Fusion

Performance measure

• Types of error:

– Misses: valid identity is rejected o Probability of miss: ratio of the number of falsely rejected

speaker tests to the total number of correct speaker trials.

– False alarms: invalid identity is accepted o Probability of false alarm: ratio of the number of falsely accepted

speaker tests to the total number of impostor trials

TRUE SPEAKER

IMPOSTER

ACCEPT CLAIM REJECT CLAIM

CORRECT DECISION

CORRECT DECISION

FALSE ACCEPTANCE

MISS

22

Performance measure - Detection error trade-off (DET) curve

False Acceptance Rate (in %)

Mis

s R

ate

(in

%)

Each point on the curve corresponds to a different 𝜃

23

Performance measure - Detection error trade-off (DET) curve

Equal Error Rate (EER) = 1 %

Wire Transfer:

False acceptance is very costly

Users may tolerate rejections for security

Customization:

False rejections alienate customers

Any customization is beneficial

Application operating point depends on relative costs of the two error types

High Convenience

High Security

Balance

False Acceptance Rate (in %)

Mis

s R

ate

(in

%)

24

NIST Speaker Recognition Evaluation (SRE)

• Ongoing text independent speaker recognition evaluations conducted by NIST (http://www.itl.nist.gov/iad/mig/tests/spk/)

– driving force in advancing the state-of-the-art

– Conditions for different amounts of data o 10 sec.

o 3-5 minutes

o 8 minutes

o Separate channel and summed channel conditions

– English-speakers, non-English speakers, multilingual speakers

25

http://www.itl.nist.gov/iad/mig/tests/spk/

NIST SRE Trends

• 1996 – First SRE in current series

• 2000 – AHUMADA Spanish data, first non-English speech

• 2001 – Cellular data, Automatic Speech Recognition (ASR) transcripts provided

• 2005 – Multiple languages with bilingual speakers, room mic recordings, cross-channel trials

• 2008 – Interview data

• 2010 – High and low vocal effort, aging, HASR (Human-Assisted Speaker Recognition) Evaluation

• 2012 – Broad range of test conditions, with added noise and reverberation, target speakers defined beforehand

26



Basic System

Feature

Extraction

MAP adaptation

Log-likelihood

UBM

Sp

ee

ch

27



Trends

• In 2004’s: Classification

Feature

Extraction

MAP adaptation

Extract Supervectors

UBM

Sp

ee

ch

SVM Scoring

28



Trends

• In 2005’s: Channel compensation - NAP

Feature

Extraction

MAP adaptation

Extract Supervectors

UBM

Sp

ee

ch

NAP SVM Scoring

29



Trends

• In 2007’s: Channel compensation - JFA

Feature

Extraction

Baum-Welch statistics

estimation

Factor analysis

Speaker Factor

ExtractionWCCN SVM

UBMJFA

Hyperparameters (V, U , D)

Sp

ee

ch

30



Trends

• In 2009’s: Channel compensation – i-vector

Feature

Extraction


estimation

Factor analysis

i-vector extraction

WCCN LDACosine

Distance Scoring

UBMTotal

Variability Matrix

Sp

ee

ch

31



Trends

• In 2009’s: Channel compensation – PLDA

Feature

Extraction


estimation

Factor analysis

i-vector extraction

PLDA Log-likelihood

UBMTotal

Variability Matrix

Sp

ee

ch

32

Speaker Verification The present and future of voiceprint based … Distinguished Lecture... ·...

Documents

Transcript of Speaker Verification The present and future of voiceprint based … Distinguished Lecture... ·...