Development of voice password based speaker verification system

Voice Password Based Speaker Verification Using Vowel Region

Under guidance of Dr. G. PradhanNIT PATNA (ECE dept.)

Presented By:Piyush Kumar(1104091)Kamlesh Kalvaniya(1104080)Niranjan Kumar(1104087)

Content

• Introduction• Motivation for present work• Issues in speaker verification• Development of baseline• Proposed speaker verification system• Summary • Conclusion

Introduction

• Speaker Verification is a task of validating identity claim of a person from his/her voice.

• Voice password based speaker verification system – Speaker is free to choose his/her password – Password remains same for training and

verification

Motivation

• Development of a low complexity speaker verification system with reasonable performance using few seconds of speech data– For mobile based applications– Low security person authentication

Issues in limited Speech Speaker Verification

• Information in human speech– Message, Language, Speaker, Emotion/ health

Recording environment, channel, sensor etc.• Speaker specific information extracted from

speech data varies depending on other factors• Challenge – Enhance the speaker specific information – Normalize other variability's in speech data

Baseline System

• Gaussian Mixture Model (Text Independent)Database: NIST-2003VAD: Energy based VAD (0.6 * average

energy)Feature vector: 13 dimension MFCC

appended with delta and delta-deltaModeling: GMMGMM size: 8, 16, 32, 64 Comparison: log Likelihood score

Flowchart for GMM based SV system

04/15/2023 N.I.T. PATNA ECE, DEPTT. 8

GMM based SV system EER

.

GAUSSIAN SIZE

8

16

32

64

TEST 15 SecTRAIN 15 SEC

TEST 15 SecTrain Full

TEST Full Train 15 Sec

Test FullTrain Full

EQUAL ERRORRATE(%)

EQUAL ERRORRATE(%)

EQUAL ERRORRATE(%)

EQUAL ERRORRATE(%)

34.90 33.18 34.24 32.70

33.05 30.50 32.28 29.67

32.46 28.78 32.92 27.77

32.82 27.42 33.06 26.05

Conclusion

• Performance is sensitive to duration of testing and training data.

• Performance is more sensitive to duration of training data compared to testing data.

• GMM based SV system may not suitable for limited data.

Baseline system for Voice password based system

• Data Collection Data of 100 speakers was collected. Each speaker utter his/her full name or roll no as the

voice password which was recorded over phone. No of male speaker: 81, No of female speaker: 19 Duration of data: 2 -5 Sec No of training session: 3, No of testing session: 5 With

minimum gap of one day between each sessions During verification task each speaker was compared

with its own & 19 other imposter speakers.

Dynamic Time Warping

• DTW is a template matching technique• Test Features and Template (Model) are

sequence of feature vectors• Aim is to find distortion between Test Features

and Template • They may have different length • DTW uses dynamic programming to find

optimal path for normalizing the length variation.


Experimental results for DTW based system for Voice password database

13 39 13 39 13 39 13 39 13 39

25 28 14 14.6 25 27.9 14.7 15 25.2 26.3

31 34 17 18.9 30 33.6 18 19.3 29.4 32.6

28 29 18 19 29 31 18.7 20 31.5 32.6

31 32 15 16 32 32.3 16.1 17.5 30.5 32.6

31 33 17 18 32 34 18.2 20.7 34.7 35.7

13 39

14.7 15.7

20 21.05

20 21.94

16.8 18.94

18.9 21.05

Start to End

VAD Start to end VAD Start to end VAD

Session1(EER %) Session2(EER%) Session3(EER%)

1

2

3

4

5

Train

Test


Experimental results for GMM based system for Voice password database

17.9 19.7 21.2

18.34 18.1 20.3

18.69 19.6 18.7

19.8 20.1 18.9

20.6 20.6 20

Session 1(EER%) Session 2(EER%) Session 3(EER%)

Session 1

Session 2

Session 3

Session 4

Session 5

TrainTest


DTW using only mean vector of GMM

15.9 19.7 21.2

16.26 18.1 20.3

18.69 19.6 18.7

19.8 20.1 18.9

20.6 20.6 20

Session1(EER%) Session2(EER%) Session3(EER%)

1

2

3

4

5

TrainTest


Verification result comparison and discussion

DTW based system best EER :14%GMM based system best EER :17.9%DTW using mean vector of GMM best EER :15.9%Best result was obtained for DTW.Performance of DTW based system depends on

detection of end points.Performance of DTW based system may be improved

by robust end point detection and enhancing more speaker specific regions

Hence the motivation for the present work

Vowel Regions In Speech Signal

• VOP and VEP are two important events in speech signal– VOP: instants at which onset of vowel takes place

in speech signal– VEP: instants at which offset of vowel takes place

in speech signal


VOP (circle) and VEP (arrow head) events for an utterance /the sea/


• Vowel regions are prominent regions in speech signal:– High amplitude– Near Periodic Excitation– Long Duration– Lower Zero Crossing rate

• Due to high amplitude SNR of vowel regions are high.


Empirical Mode Decomposition

• Empirical Mode Decomposition (EMD)• Data-driven, multi-scale, robust to non-stationary signal• Fast oscillating signal can be superimposed to slow oscillating signals• Local mean of decomposed signals is zero and the signals are symmetric to

its local mean.• Impact of noise on the signal can be reduced

• Decomposed signals are defined as Implicit Mode Function (IMF), if it satisfies following conditions

• The number of extrema and the number of zero crossing differs only by one

• The local average is zero. This implies that envelop mean of upper envelop and lower envelop is zero.

.

EMD Algorithm• For a given input signal X to decompose

Identify the local extrema of the signal X. Construct upper envelop E max & lower envelop Emin by interpolating maximum

&minimum,respectively Approximate local average by envelop mean Em taking average of two

envelops E max &Emin.

Compute candidate implicit mode h1=X-Em. If h1 is IMF,decompose the signal X as IMF imf= hi& the residue signal r=X-

imf.Otherwise repeat above steps.• If r has implicit oscillation mode,set r as input signal & repeat the steps.• A signal S(n) can be represented through IMFs as follows

S(n)= +r(n)Where r(n) is the residue.


MOTIVATION FOR USE OF EMD

• Environmental effect on the speech data can be deemphasized

• Excitation information present in different frequency range can be analyzed separately.

• To emphasize the weak transitions in case of nasal-vowel, semivowel-vowel & Dipthongs.


Flowchart for VOP detection


VOP EVIDENCE PLOT


Experiment

Speech data• Complete TIMIT database• Number of Male speakers: 438• Number of Female speakers: 192• Sampling Frequency=8 KHz• VOP experiment was performed on 100 speakers.


Performance measure

• Identification rate (IR): Percentage of reference VOPs (VEPs) that are matched by detected VOPs (VEPs) with in vowel regions

• Spurious rate (SR): Percentage of detected VOPs (VEPs), which are detected outside vowel regions


Performance of proposed VOP detection method

Baseline 47 74 78 88 15

Proposed 62 83 90 96 13

Detection Rate % Spurious Rate%

Method 10ms 20ms 30ms 40ms

Observation:•Performance of proposed method is better than baseline in terms of both Detection rate & Spurious Rate.•83% detection is achieved in 20ms window which is beneficial when used for comparison of strings of vowel regions.

SV System by applying DTW on Vowel regions only

SV System by applying DTW on mean of vowel regions only

Score Normalization

DET Plot for DTW & Normalized DTW

DET Plot for DTW on vowel regions only

DET Plot of DTW on mean of vowel regions

Conclusion

• The proposed VOP Detection algorithm performed better than the best method present in the literature.

• The performance of proposed algorithm for voice password SV system is better than the any of the baseline system.

• The complexity of the proposed algorithm for voice password SV system is less than any baseline system which makes it useful for online SV task.

Development of voice password based speaker verification system

Technology

Transcript of Development of voice password based speaker verification system