SPEKER RECOGNITION UNDER LIMITED DATA CODITION
-
Upload
niranjan-kumar -
Category
Engineering
-
view
12 -
download
2
Transcript of SPEKER RECOGNITION UNDER LIMITED DATA CODITION
![Page 1: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/1.jpg)
Under guidance of Dr. G. PradhanNIT PATNA (ECE dept.)
Presented by -Kamlesh Kalvaniya -(1104080)Niranjan Kumar –(1104087)Piyush Kumar-(1104091)B.TECH 4th yr (ECE dept.)N.I.T. PATNA ECE, DEPTT.
SPEAKER RECOGNITION UNDER LIMITED DATA CONDITION
![Page 2: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/2.jpg)
1. Introduction 2. Baseline speaker verification system3. Future Plan
OUTLINE
![Page 3: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/3.jpg)
Speaker Recognition is the computing task of validating identity claim of a person
from his/her voice.
Applications:- Authentication Forensic test Security system ATM Security Key Personalized user interface Multi speaker tracking Surveillance
Introduction
![Page 4: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/4.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 4
Identification v/s verification
![Page 5: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/5.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 5
Phase of Speaker Verification• Enrollment Session or Training Phase• Operating Session or Testing Phase
![Page 6: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/6.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 6
Training & Testing Phase
Training Reference model
Speech
Identity claim
Testing
Speech R
Accept/reject
Pre-
processing
Feature
extraction
Model
Building
Pre-
processing
Feature
extraction
comparison
Decision logic
![Page 7: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/7.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 7
PreprocessingPreprocessing is an important step in a speaker verification system. This also called voice activity detection (VAD).
VAD separates speech region from non-speech regions[2-3] It is very difficult to implement a VAD algorithm which works consistently for
different type of data VAD algorithms can be classified in two groups
Feature based approach Statistical model based approach
Each of the VAD method have its own merits and demerits depending on accuracy, complexity etc.
Due to simplicity most of the speaker verification systems use signal energy for VAD.
![Page 8: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/8.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 8
The speech signal along with speaker information contains many other redundant information like recording sensor, channel, environment etc.The speaker specific information in the speech signal[2] Unique speech production system Physiological Behavioral aspects
Feature extraction module transforms speech to a set of feature vectors of reduce dimensions To enhance speaker specific information Suppress redundant information.
Feature Extraction
![Page 9: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/9.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 9
• Robust against noise and distortion• Occur frequently and naturally in speech• Be easy to measure from speech signal• Be difficult to impersonate/mimic• Not be affected by the speaker’s health or long term variations in voice
Selection of Features
![Page 10: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/10.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 10
Types Of Features
![Page 11: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/11.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 11
Feature Extraction Techniques
A wide range of approaches may be used to parametrically represent the speech signal to be used in the speaker recognition activity. Linear Prediction Coding Linear Predictive Ceptral Coefficients Mel Frequency Ceptral Coefficients Perceptual Linear Prediction Neural Predictive Coding
Most of the state-of-the-art speaker verification systems use Mel-frequency Cepstral Coefficient (MFCC) appended to it’s first and second order derivative as the feature vectors
Easy to extract Provides best performance compared to other features MFCC mostly contains information about the resonance structure of the vocal tract
system
![Page 12: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/12.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 12
1. Analog to digital conversion 2. Pre emphasis 3. Framing & windowing4. Fast Fourier Transform5. Mel scale wrapping6. MFCC
MEL FREQUENCY CEPTRAL COEFFICIENTS
![Page 13: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/13.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 13
MFCC
Step 1:- Analog to digital conversion: is transformed to digital form by sampling it at given frequency.
![Page 14: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/14.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 14
MFCC
Step 2:- Pre-emphasis: The amount of energy present in the high frequency (important for speech) are boosted.
![Page 15: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/15.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 15
MFCC
Step 3:(framing)the signal is divided into frames of given size.
![Page 16: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/16.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 16
MFCC FRAMING
![Page 17: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/17.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 17
MFCC FRAMING
![Page 18: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/18.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 18
MFCC FRAMING
![Page 19: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/19.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 19
MFCC FRAMING
25ms
10 ms
![Page 20: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/20.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 20
MFCC WINDOWING
• The next step is to window individual frame to minimize the signal discontinuities at the beginning and end of each frame.
• The concept applied here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame.
• We have used hamming window
![Page 21: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/21.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 21
MFCC
![Page 22: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/22.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 22
MFCC
![Page 23: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/23.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 23
MEL FILTERBANK
![Page 24: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/24.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 24
MFCC
DCT
![Page 25: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/25.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 25
MFCC
DCT
![Page 26: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/26.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 26
Speaker Modelling
• Vector Quantization• Gaussian Mixture Model• Gaussian Mixture Model-UBM• Hidden Markov Model• Artificial Neural Networks• Super Vector Machines• I-Vector
Gaussian model assumes the feature vectors follow a Gaussian distribution, characterized by mean vectors, covariance matrix and weights
The data unseen in the training which appear in the test data will trigger a low score
Speaker models the statistical information present in the feature vectors it enhances the speaker information and suppress the redundant information
![Page 27: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/27.jpg)
04/15/2023 27
A Gaussian mixture density defined as-
A Gaussian function for D dimension is defined as-
where- Unimodal Gaussian D=8,16,32,64
ʎ i = {wi , ∑i µi }
wi = Weight
µi = Mean ;
∑i = Covariance ;
i-No. of models(M=356)N.I.T. PATNA ECE, DEPTT.
Gaussian Mixture Model
![Page 28: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/28.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 28
For a sequence of T training vector X={x1 , x2 ,…, xT } the GMM likelihood can be defined as-
For estimation of speaker specific GMM, Expectation maximization algorithm is used .
MAXIMUM LIKLIHOOD PARAMETER ESTIMATION
![Page 29: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/29.jpg)
![Page 30: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/30.jpg)
30
ʎtarget : X(MFCC(TESTING DATA)) is from the hypothesized
speaker S ʎUBM : X(MFCC(TESTING DATA)) is not from the
hypothesized speaker S The likelihood ratio test is given by- LR(X)=
The probability of alternative hypothesis P(X/ʎUBM ) =F( P(X/ʎ1), P(X/ʎ2),..., P(X/ʎM))
F( ) is function such as average or maximum of likelihood value of Background Speaker set ( P(X/ʎi) ) .04/15/2023 N.I.T. PATNA ECE, DEPTT.
GMM UBM
![Page 31: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/31.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 31
Score Normalisation Where- s- Original Score = log(LR(X)); µI - Estimated mean of s
σI -standard deviation of s
Score Normalisation
![Page 32: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/32.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 32
PERFORMANCE EVALUATION NIST has conducted speaker recognition
benchmarking activity on annual basis since 1997.
NIST has provided speech files as development data.
NIST 2003 data- Testing Speech Data-2559 Train Speech Data-356 UBM Female Speech data-251 UBM male Speech data-251
![Page 33: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/33.jpg)
For Baseline speaker verification the following parameter are used VAD: Energy based VAD (0.6 * average
energy) Feature vector: 13 dimension MFCC appended with
delta and delta-delta Modeling: GMM GMM size: 8, 16, 32, 64.0Comparison: log Likelihood score
Development of BASELINE SPEAKER VERIFICATION SYSTEM
![Page 34: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/34.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 34
. Flowchart Of Baseline Speaker Recognition System
![Page 35: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/35.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 35
DET PLOTFOR TEST
15 SecAND
TRAIN15 SEC
![Page 36: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/36.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 36
DET PLOTFOR TEST FULLAND
TRAIN15 SEC
![Page 37: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/37.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 37
DET PLOTFOR TEST
15 SecAND
TRAINFULL
![Page 38: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/38.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 38
DET PLOTFOR TEST FULLAND
TRAINFULL
![Page 39: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/39.jpg)
Comparison of training data model with Equal Error Rate
.
GAUSSIAN SIZE
8
16
32
64
TEST 15 SecTRAIN 15 SEC
Test FullTrain 15 sec
TEST 15 secTrain Full
Test FullTrain Full
EQUAL ERROR RATE(%)
EQUAL ERROR RATE(%)
EQUAL ERROR RATE(%)
EQUAL ERROR RATE(%)
34.90 34.24 33.18 27.70
33.05 32.28 30.50 25.67
32.46 32.94 28.78 23.67
32.82 33.06 27.42 22.05
![Page 40: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/40.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 40
Conclusion
Performance is more sensitive to training data.
![Page 41: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/41.jpg)
04/15/2023 N.I.T. PATNA ECE, DEPTT. 41
Future Plan
Synthetically generating training and testing speech from limited speech data.
Validating the results on state-of-the-art i-vector based speaker verification system.
![Page 42: SPEKER RECOGNITION UNDER LIMITED DATA CODITION](https://reader035.fdocuments.in/reader035/viewer/2022062515/55c5c865bb61eb4b748b4645/html5/thumbnails/42.jpg)
Thank you