Digital Voice Analysis
-
Upload
vivek-gangwar -
Category
Documents
-
view
598 -
download
2
Transcript of Digital Voice Analysis
DIGITAL VOICE ANALYSIS
Prepared by
GAURAV MISHRA
BHUMIKA DWIVEDI
AKASH RAJAN RAI
KARTIC KUMAR
2
Table of Contents
CONCLUSION
FUTURE WORK
FEATURE EXTRACTION
SPEAKER VERIFICATION
INTRODUCTION
3
Voice analysis- is the study of speech sounds for purposes other than linguistic content, such as in speech recognition.
include mostly medical analysis of the voice i.e. phoniatrics, but also speaker identification.
Speaker recognition process of identifying a person from a spoken phrase allows for a secure method of authenticating speakers Applications include:
voice dialing, banking over a telephone network, security control for confidential information, etc
Challenges Can be imitated to a certain degree Need to capture discriminating features Emotional physical states affect quality
DIGITAL VOICE ANALYSIS -INTRODUCTION
4
SPEECH PROCESSING TAXONOMY
Recognition
SpeechRecognition
Speaker Recognition
LanguageRecognition
Speaker Identification
Speaker Verification
Text-dependent
Closed-set
Text-independent
Closed-set
Text-dependent
Closed-set
Text-independent
Open-set
• Determine whether person is who they claim to be
• User makes identity claim: one to one mapping
• Unknown voice could come from large set of unknown speakers - referred to as open-set verification
5
SPEAKER IDENTIFICATION
?
Is this Kartic’s voice?
6
ACCEPT
GENERAL THEORY OF SPEAKER VERIFICATION SYSTEM
Feature extraction
Feature extraction
SpeakerModel
SpeakerModel
Mishra’s “Voiceprint”
“My Name is Mishra”
ACCEPT
Mishra
ImpostorModel
ImpostorModel
Identity Claim
DecisionDecision
REJECT
Input Speech
Impostor “Voiceprints”
7
Two distinct phases to any speaker verification system
Feature extraction
Feature extraction
Model training
Model training
Enrollment speech for each speaker
Akash
Bhumika
Voiceprints (models) for each speaker
Bhumika
Akash
Enrollment Enrollment PhasePhase
Model training
Model training
Accepted!Feature extraction
Feature extraction
Verificationdecision
Verificationdecision
Claimed identity: Bhumika
Verification Verification PhasePhase
Verificationdecision
Verificationdecision
8
TRAINING PHASE
1st phase of SIS is Enrollment Sessions also known as Training Phase.
During training phase, the SIS generates a speaker model which is based on the speaker’s characteristics.
Front End Processing
FeatureVectors
Speaker Database
Speaker Modeling
SpeakerModels
Speaker 1
Speaker 3
Speaker
2
9
There are three main components of SI System:
Front-end Processing
Speaker Modeling
Pattern Matching and Classification
COMPONENTS OF SPEAKER IDENTIFICATION SYSTEM
10
FRONT-END PROCESSING
Front-end Processing generally consists of three sub-processes Preprocessing
Removal of Noise / Silence from SpeechFrame BlockingWindowing
Feature Extraction
'the curse of the dimensionality' the number of training/test-vectors needed for a classification problem grows exponential with the dimension of the given input-vector- feature extraction is needed.
Transform the speech signal into compact effective representationMore stable and discriminative than the original signal
11
PRE-PROCESSING
The speech signal is a slowly timed varying signal called quasi-stationary that is when the signal is examined over a short period of time (5-100msec), the signal is fairly stationary.
Speech signals are often analyzed in short time segments referred to as short-time spectral analysis typically 20-30 msec frames that overlap each other with 30-50%. This is done in order not to lose any information due to the windowing.
Duration of each frame is 23 ms for sampling frequencies 11025 Hz, and a new frame contains the last 11.5 ms of the previous frame’s data. For the sampling frequency 8000 Hz, duration of each frame is 16 ms and a new frame contains the last 8 ms of the previous frame’s data
12
WINDOWING
After the signal has been framed, window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame.
Each frame is multiplied with a window function w(n) with length N, where N is the length of the frame.
Typically the Hamming window is used. It preserves higher order harmonics and avoid problems due to truncation of the signal.
13
MFCC
Continous
Speech
Frame
Spectrum
Mel
Weighted Spectrum
Mel Cepstral Coefficients
Frame Blocking Windowing
FFT
Mel Filter Bank
Log Compression
DCT
Feature Extracted Coefficients
Fast Fourier Transform (FFT)
converts each frame of N samples from the time domain into the frequency domain.
defined on the set of N samples {xn}, as follow:
14
1
0
/2 1,...,2,1,0,N
n
Nknjnk NkexX
Mel-frequency Wrapping
15
0 1000 2000 3000 4000 5000 6000 7000 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2 Mel-spaced filterbank
Frequency (Hz)
Cepstrum
convert the log mel spectrum back to time called the mel frequency cepstrum coefficients (MFCC) we denote those mel power spectrum coefficients that
are the result of the last step are , we can calculate the MFCC's, as
16
17
PATTERN MATCHING AND CLASSIFICATION
The classifiers used for speaker identification can be grouped into two major types:
Template-based and Stochastic model based classifiers
Template based classifiers are considered to be the simplest classifiers.
Dynamic Time Warping (useful for text-dependent speaker recognition) Vector Quantization (useful for text-independent speaker recognition)
Stochastic models provide more flexibility and better results.
Gaussian Mixture Model (useful for text-independent speaker recognition), the Hidden Markov model (useful for text-dependent speaker recognition), and Neural Networks to model a speaker's acoustic space.
18
SPEAKER MODELING-VQ
Vector QuantizationIt is not possible to use all the feature vectors of a given speaker occurring in the training data to form the speaker's model. Because there are too many feature vectors for each speaker.
A method of reducing/compressing the number of training vectors is required to form a codebook consisting of a small number of highly representative vectors that efficiently represent the speaker-specific characteristics.
VQ is the process of mapping feature vectors in a vector space into a finite number of regions in that space. Each region is called a cluster and each cluster is represented by its centroid. The collection of all centroids is called codebook
19
We would develop selected algorithms related to speaker identification in MATLAB. The implementation would be modular and will be done keeping in view the real time implementation. The complete real time implementation though is not in the scope of the project
We would test and verify all the performance level of the algorithms. For this purpose the data collected will be divided into training and testing data (70% training, 30% testing).
For hardware implementation we would either use interfacing or some digital signal processor.
FUTURE WORK
CONCLUSION
Speaker verification is one of the few recognition areas where machines can outperform humans
Speaker verification technology is a viable technique currently available for applications
Speaker verification can be augmented with other authentication techniques to add further security
20