Voice Activity Detection Identifying Speech Segments.pdf
-
Upload
meteostroy -
Category
Documents
-
view
224 -
download
0
Transcript of Voice Activity Detection Identifying Speech Segments.pdf
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 1/44
Voice Activity Detection 1 / 44
Voice Activity Detection: Identifying Speech
Segments within Audio Recordings
Dr. Christos Boukis
Autonomic and Grid Computing Group
Athens Information Technology
email: [email protected]
October 27, 2006
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 2/44
Need for Voice Activity Detection
Need for Voice Activity
Detection
Audio Processing
Technologies
Need for Voice Activity
Detection
Benefits of Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 2 / 44
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 3/44
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 4/44
Need for Voice Activity Detection
Need for Voice Activity
Detection
Audio Processing
Technologies
Need for Voice Activity
Detection
Benefits of Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 4 / 44
Voice controlled systems employ one or more audio sensors (microphones) and
capture audio signals
recognise verbal commands
transform them to computer-recognisable commands execute the associated actions
Common component of all these systems is the voice activity detection
pre-processing step during which
the presence or absence of speech within the captured audio signals is
detected
audio samples are separated into speech and non-speech segments
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 5/44
Benefits of Voice Activity Detection
Need for Voice Activity
Detection
Audio Processing
Technologies
Need for Voice Activity
Detection
Benefits of Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 5 / 44
The benefits from the introduction of a voice activity detection block within a voice
commanding system are manifold
Reduction of computational requirements
Improvement of the efficiency of the overlying system
Other applications of voice activity detection are
speech coding
optimisation of bandwidth use in mobile communications
security / surveillance
Voice Activity detection can be applied to
online systems, where instant decisions about the captured audio signals are
required
offline systems that process recorded audio signals in order to extract speech
segments
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 6/44
Voice Activity Detection
Fundamentals
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection witha Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 6 / 44
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 7/44
Human Hearing
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection witha Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 7 / 44
When humans hear a sound they can tell whether this is human speech or not by
using
audio information (auditory system) : main info
visual information (visual system) : complementary info
Moreover, additional information is extracted regarding
The gender of the speaker
The age of the speaker
The location of the speaker relative to our position
His/her distance
His/her identity, if we know him/her
How do we do it?
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 8/44
Human Hearing
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection witha Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 8 / 44
The detection of speech by humans
is not inherent
⇒ infants recognise their mother’s voice but require 14 days to get used to
their father’s voice
⇒ they identify babbling as noise
training of the human auditory system is required in order to identify sounds -
including speech!!!
the exact procedure is not known
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 9/44
Human Speech
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection witha Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 9 / 44
0 0.1 0.2 0.3 0.4 0.5−1
−
0.5
0
0.5
S o u
n d P r e s s u r e
Time (sec)
0 0.2 0.4 0.6 0.8 1−50
0
50
Normalized Frequency (×π rad/sample)
M a g n i t u d e ( d B )
Human speech is distinguishable since it has specific
statistical properties
frequency content
periodicity etc
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 10/44
Artificial Voice Activity Detection
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection witha Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 10 / 44
Artificial voice activity detection systems can be classified into two major
categories
Supervised methods which use some a priori information in order to train the
system and thus optimise its performance
linear discriminat analysis (LDA)
hidden Markov models (HMM)
neural networks (NNs)
Unsupervised methods that compute a predefined statistical measure andcompare it to a threshold in order to decide whether the captured signal is
speech or not
maximum likelihood ratio (MLR) criterion algorithms
periodicity based algorithms
All these algorithms perform in a frame-by-frame processing basis using close
talking microphones
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 11/44
Voice Activity detection Approaches
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection witha Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 11 / 44
Supervised techniques Unsupervised Techniques
training data required not requiredmisclassifications less more
fine-tuning difficult simple
noisy conditions similar to training data any
application dependent yes no
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 12/44
Supervised methods
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection witha Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 12 / 44
Supervised classification techniques operate in two modes:
⇒ Training mode : During this stage the provided training data are employed in
order to optimise the parameters of the system
⇒ Testing or Decision mode : In testing mode decisions are made by using theoptimised parameters that were derived in the training mode
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 13/44
Linear Discriminant Analysis
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection witha Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 13 / 44
LDA is a classification technique, that is it looks for directions efficient for
discriminating between data in different classes
It projects data into a space with less dimensions
⇒ it reduces the dimensionality of the problem
It looks for a linear transformation that increases the separability of the data
it employs a threshold in the reduced-dimensionality space to make decisions
If the original distributions are multimodal and highly overlapping even the
best direction is unlikely to provide adequate separation
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 14/44
Linear discriminant Analysis
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection witha Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 14 / 44
For the two-class problem this linear transformation is found by maximising Fisher
linear discriminant
J (w
) =
|m1 − m2|2
s21 + s22
where
mi = 1ni
xi∈Di
x the mean value of the i−th class
mi = w
tm
i the projected mean value of the i−th class
s2i =
yi∈Y i(y − mi)
2 the scatter of the i−th class
Thus we look for a direction w that
Increases the distance between the projected means
reduces the standard deviations for each class
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 15/44
Linear Discriminant Analysis
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection witha Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 15 / 44
The Fisher linear discriminant can also be expressed as
J (w) = wtS Bw
wtS W w
where
S i =
x∈Di(x−mi)
t(x−mi) the scatter matrices of the classes
S W = S 1 + S 2 the within class scatter matrix
S B = (
m1 −
m2)
t
(m
1 −m
2) the between class scatter matrix
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 16/44
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 17/44
Separability of Speech/Non-speech data
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection witha Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 17 / 44
Observe the energy values and the LDA projected values of audio signals
(speech and non-speech) captured with far-field microphones
0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5x 10
4
N u m b e r o f F r a m e s
Energy Values
Speech
Non−Speech
(a) Separation with Energy
−60 −40 −20 0 20 400
1000
2000
3000
4000
5000
6000
LDA projected value
N u m b e r o f F r a m e s
SpeechNon−Speech
(b) Separation with LDA
Conclusions
Using a threshold value ≈ 10 we can separate data based on LDA
This is not possible from the energy of the captured signals
misclassifications still exist but are significantly less
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 18/44
Unsupervised methods
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection with
a Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 18 / 44
Typical statistical measures that unsupervised methods use in order to
discriminate between speech and non-speech signals are
periodicity
zero crossings
energy
etc
Modern unsupervised methods rely on the likelihood ratio test and include
soft decision LRT decision directed LRT
multiple observation LRT
Components:
1. decision rule, which is a quantity that measures the difference between noise
and observed signal statistics
2. decision threshold, to which the decision rule is compared to
3. noise statistics estimation scheme, to derive the dynamics of the background
noise
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 19/44
Voice Activity Detection with a Likelihood Ratio Test
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection with
a Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 19 / 44
S ,N ,X are:
The L-dimentional coefficient vectors of speech, noise and noisy speech
Obtained by DFT transform of the captured audio signals
Their variances are given by
λN (k) = S N (2πk/L)
λS (k) = S S (2πk/L)
σ(k) = λN (k) + λS (k)
where S S (ω) and S N (ω) the true power spectra of noise and speech
respectively
Assumptions
speech and noise are Gaussian random processes
independent of each other
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 20/44
Hypotheses
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection with
a Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 20 / 44
The two hypotheses of the voice activity detection problem are
H 0 : speech absent : X =N
H 1 : speech present : X = S +N
Joint probability density functions
p(X |H 0) =L−1k=0
1πλN (k)
exp− |X k|2
λN (k)
p(X |H 1) =L−1
k=0
1
π[λN (k) + λS (k)]
exp− |X k|
2
λN (k) + λS (k)
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 21/44
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 22/44
LRT methods
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection with
a Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 22 / 44
Depending on the computation of the a priori signal-to-noise ratio
generalised estimator
ˆξ
(ML)
k = γ k − 1
decision directed estimation
ξ (DD)
(n) = α
A2k(n − 1)
λN (k, n − 1) + (1 − α)P [γ k(n) − 1]
where
P [x] = x if x ≥ 0 and P [x] = 0 otherwise
A(k, n − 1) signal amplitude estimates of previous frame
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 23/44
multiple observations
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Human Hearing
Human Hearing
Human Speech
Artificial Voice Activity
Detection
Voice Activity detection
Approaches
Supervised methods
Linear Discriminant Analysis
Linear discriminant Analysis
Linear Discriminant Analysis
LDA applied to Voice
Activity Detection
Separability of
Speech/Non-speech data
Unsupervised methods
Voice Activity Detection with
a Likelihood Ratio Test
Hypotheses
Likelihood Ratio
LRT methods
multiple observations
Performance Enhancement
Voice Activity Detection @
SmartLab
Voice Activity Detection 23 / 44
If multiple framesX (n−m), . . . ,X (n−1),X (n),X (n+1), . . . ,X (n+m)
instead of a single one X are used then we can decide based on the measure
L(n+1) = L(n) − logΛ(n−m) + log Λ(n+m+1)
The multiple observation LRT
is more robust that single observation techniques
as the number of observations increases
the non-speech variance decreases
the speech distribution is shifted to the right ⇒ better separated from the
non-speech distribution
optimum performance for m = 6
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 24/44
Performance Enhancement
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Problems of Voice Activity
DetectionVAD Improvement
Techniques
Hang-Over Schemes
Linear Prediction Coding
Band-Pass fi ltering
Adaptive Thresholding
Voice Activity Detection @
SmartLab
Voice Activity Detection 24 / 44
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 25/44
Problems of Voice Activity Detection
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Problems of Voice Activity
DetectionVAD Improvement
Techniques
Hang-Over Schemes
Linear Prediction Coding
Band-Pass fi ltering
Adaptive Thresholding
Voice Activity Detection @
SmartLab
Voice Activity Detection 25 / 44
Voice activity detection systems some times provide faulty decisions since
they are sensitive to impulsive sounds
hand clapping
coughing knocking
etc . . .
they under-perform in highly varying environments
they fail under extremely noisy conditions when far field microphones are employed their performance degrades with
the distance from the microphones
Moreover their behaviour, in terms of precision, depends on the employed frame
size
large: robust estimate but low precision
small: high precision but mis-triggering
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 26/44
VAD Improvement Techniques
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Problems of Voice Activity
DetectionVAD Improvement
Techniques
Hang-Over Schemes
Linear Prediction Coding
Band-Pass fi ltering
Adaptive Thresholding
Voice Activity Detection @
SmartLab
Voice Activity Detection 26 / 44
Techniques used for the improvement of the performance of a voice activity
detector include
hang-over schemes
linear prediction coding
band-pass filtering
adaptive thresholding
H O S h
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 27/44
Hang-Over Schemes
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Problems of Voice Activity
DetectionVAD Improvement
Techniques
Hang-Over Schemes
Linear Prediction Coding
Band-Pass fi ltering
Adaptive Thresholding
Voice Activity Detection @
SmartLab
Voice Activity Detection 27 / 44
A hang-over scheme is a post-processing system that is applied on the raw
decisions of a voice activity detector
Its objective is to prevent misclassification of
sharp impulsive sounds as speech
speech pauses as silence
Fundamental idea is to pose time thresholds tsil, tsp such that
silence intervals of duration less than tsil are classified as speech
pauses and thus speech speech intervals of duration shorter than tsp are considered to be
impulsive sounds and not speech
Hang-over schemes can be implemented with
Markov models
Finite state machines
Li P di ti C di
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 28/44
Linear Prediction Coding
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Problems of Voice Activity
DetectionVAD Improvement
Techniques
Hang-Over Schemes
Linear Prediction Coding
Band-Pass fi ltering
Adaptive Thresholding
Voice Activity Detection @
SmartLab
Voice Activity Detection 28 / 44
Linear prediction coding
estimates with a least squares approach an autoregressive model that
simulated the vocal tract
with this autoregressive model the buzz (train of impulses) produced by the
glottis can be estimated
It has been extensively used in speech applications like
glottal closure instants in voiced speech
pitch extraction
When applied as a pre-processing step the performance of voice activity
detection is significantly improved
B d P filt i
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 29/44
Band-Pass filtering
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Problems of Voice Activity
Detection
VAD Improvement
Techniques
Hang-Over Schemes
Linear Prediction Coding
Band-Pass fi ltering
Adaptive Thresholding
Voice Activity Detection @
SmartLab
Voice Activity Detection 29 / 44
the frequency content of human speech is mainly contained in the frequency
range from 200Hz -3KHz
Microphones return audio signals that have significantly wider range
A band-pass filter that isolates the speech frequencies
improves the performance
reduces the computational requirements
Adaptive Thresholding
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 30/44
Adaptive Thresholding
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Problems of Voice Activity
Detection
VAD Improvement
Techniques
Hang-Over Schemes
Linear Prediction Coding
Band-Pass fi ltering
Adaptive Thresholding
Voice Activity Detection @
SmartLab
Voice Activity Detection 30 / 44
Voice activity detection systems employ a statistical measure and a threshold to
decide whether a received/recorded audio signal contains speech. This threshold
reflects the dynamics of the background noise
is time-invariant
is set either heuristically or with the use of a priori info
To enable the detector to perform satisfyingly in environments with varying noise
statistics, adaptive thresholding can be employed
geometrically adaptive energy threshold high order statistics adaptive threshold
gradient adaptive threshold
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 31/44
Voice Activity Detection @ SmartLab
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity Detector
Hang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some ResultsOpen Issues
Voice Activity Detection 31 / 44
System Overview
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 32/44
System Overview
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity Detector
Hang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some Results
Open Issues
Voice Activity Detection 32 / 44
DataCollection
Decision
SpatialAveraging
Band-PassFiltering
LinearPrediction
Coding
VoiceActivity
Detector
Hang-OverScheme
mic 1
mic 2
mic N
Requirements
A real time system capable of performing both
online fashion
offline fashion
Far field microphones are used since we want our system to be non-intrusive
Potential speakers move inside the room
the energy of the captured speech signals varies continuously
Data Collection
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 33/44
Data Collection
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity DetectorHang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some Results
Open Issues
Voice Activity Detection 33 / 44
DataCollection
Decision
SpatialAveraging
Band-PassFiltering
LinearPrediction
Coding
VoiceActivity
Detector
Hang-OverScheme
mic 1
mic 2
mic N
Data collection is performed with
a microphone array consisting of 64 microphones and sampling rate 22-44KHz
16 microphones placed on the walls in ”inverse T” formation
Spatial Averaging
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 34/44
Spatial Averaging
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity DetectorHang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some Results
Open Issues
Voice Activity Detection 34 / 44
DataCollection
Decision
SpatialAveraging
Band-PassFiltering
LinearPrediction
Coding
VoiceActivity
Detector
Hang-OverScheme
mic 1
mic 2
mic N
The signals of the employed microphones are properly aligned and averaged in
order to remove the effect of the room impulse response
reduce the stochastic measurement noise
Band-Pass Filtering
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 35/44
Band Pass Filtering
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity DetectorHang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some Results
Open Issues
Voice Activity Detection 35 / 44
DataCollection
Decision
SpatialAveraging
Band-PassFiltering
LinearPrediction
Coding
VoiceActivity
Detector
Hang-OverScheme
mic 1
mic 2
mic N
A butterworth filter is introduced which
isolates the frequencies of interest disregards the undesired frequency components
Butterworth filter is chosen due to his maximally flat pass band: It does not distort
the speech signals
Linear Prediction Coding
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 36/44
Linear Prediction Coding
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity DetectorHang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some Results
Open Issues
Voice Activity Detection 36 / 44
DataCollection
Decision
SpatialAveraging
Band-PassFiltering
LinearPrediction
Coding
VoiceActivity
Detector
Hang-OverScheme
mic 1
mic 2
mic N
Linear prediction coding is performed on the band-pass filtered signal in order to
suppress the non-speech intervals transform the speech segments to (buzz) impulsive signals
Voice Activity Detector
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 37/44
Voice Activity Detector
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity Detector
Hang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some Results
Open Issues
Voice Activity Detection 37 / 44
DataCollection
Decision
SpatialAveraging
Band-PassFiltering
LinearPrediction
Coding
VoiceActivity
Detector
Hang-OverScheme
mic 1
mic 2
mic N
Two voice activity detectors have been developed
The first is a supervised method and it is based on LDA the second and most recently developed is an unsupervised method based
on the decision directed likelihood ratio test
Hang-Over Scheme
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 38/44
Hang Over Scheme
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity Detector
Hang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some Results
Open Issues
Voice Activity Detection 38 / 44
DataCollection
Decision
SpatialAveraging
Band-PassFiltering
LinearPrediction
Coding
VoiceActivity
Detector
Hang-OverScheme
mic 1
mic 2
mic N
To smooth the decisions provided by the detector we have developed two
hangover schemes
A finite state machine designed for the supervised (LDA) voice activity
detector
A Markov model developed for the unsupervised voice activity detector
Hang-over scheme (Finite State Machine)
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 39/44
g ( )
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity Detector
Hang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some Results
Open Issues
Voice Activity Detection 39 / 44
conditions
C2: Speech duration (SD) >tsp
C3: Silence duration (SiD) >tsil
C4: LDA value > nLDA
Actions (l: length of frame in sec-
onds)
A1: SiD = SiD + l
A2: SD = l A3: SiD = SiD + SD A4: SD = SD + l A5: SiD = l A6: SD = SiD = 0
Hang-over scheme (Markov Model)
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 40/44
g ( )
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity Detector
Hang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some Results
Open Issues
Voice Activity Detection 40 / 44
VAD=0
VAD=0VAD=0VAD=0
VAD=1VAD=1VAD=1 VAD=1VAD=1VAD=1 VAD=1
VAD=1
D=1
D=1
D=1 D=1
D=1D=1
D=1
D=0
D=1D=0
D=1
D=0D=0
D=0
D=1 D=1
D=1
D=0D=0D=0D=0D=0D=0
D=0
with dashed lines the transition states are displayed
with solid lines the final states
notice that to move from
speech to non-speech eight consecutive non-speech frames must be
detected
non-speech to speech we need to have three consecutive non-speech
decisions The thresholds are heuristically chosen
Some Results
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 41/44
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity Detector
Hang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some Results
Open Issues
Voice Activity Detection 41 / 44
0 0.5 1 1.5 2
x 105
−2
−1
0
1
2
3
4
5
6
7
8
9
Time index
D e c i s i o n
Multiple Observations − LRT
Decision Directed−
LRT
Generalized−LRT
Raw decisions from several likelihood ratio test unsupervised voice activity
detectors
Some Results
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 42/44
Voice Activity Detection 42 / 44
Method LDA Thr. En. Thr. MR SDER NDER ADER Wpeps
LDA 4.9 - 10.09% 10.40% 8.62% 9.51% 0.09
Ad. En. Thr. - - 18.10% 18.40% 15.60% 17.00% 0.08
FSM+LDA 4.9 - 9.94% 10.19% 8.65% 9.42% 0.08
FSM+En. - 0.043 17.28% 17.69% 14.63% 16.16% 0.08
Open Issues
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 43/44
Need for Voice Activity
Detection
Voice Activity Detection
Fundamentals
Performance Enhancement
Voice Activity Detection @
SmartLab
System Overview
Data Collection
Spatial Averaging
Band-Pass Filtering
Linear Prediction Coding
Voice Activity Detector
Hang-Over Scheme
Hang-over scheme (Finite
State Machine)
Hang-over scheme (Markov
Model)
Some Results
Some Results
Open Issues
Voice Activity Detection 43 / 44
Issues that need to be examined and address are
performance of supervised detectors when the training set and the testing set
are derived from different speakers
substitution of spatial averaging with beamforming
incorporation of an adaptive threshold
evaluation of voice activity detectors based on hidden Markov models
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf
http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 44/44
Voice Activity Detection 44 / 44
Thank You ...