Voice Activity Detection Identifying Speech Segments.pdf

44
8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 1/44                         Voice Activity Detection 1 / 44 Voice Activity Detection: Identifying Speech Segments within Audio Recordings Dr. Christos Boukis Autonomic and Grid Computing Group Athens Information Technology email: [email protected] October 27, 2006

Transcript of Voice Activity Detection Identifying Speech Segments.pdf

Page 1: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 1/44

  

  

  

  

  

  

  

  

  

  

  

Voice Activity Detection 1 / 44

Voice Activity Detection: Identifying Speech

Segments within Audio Recordings

Dr. Christos Boukis

Autonomic and Grid Computing Group

Athens Information Technology

email: [email protected]

October 27, 2006

Page 2: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 2/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Need for Voice Activity Detection

Need for Voice Activity

Detection

Audio Processing

Technologies

Need for Voice Activity

Detection

Benefits of Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 2 / 44

Page 3: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 3/44

Page 4: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 4/44

  

  

  

  

  

Need for Voice Activity Detection

Need for Voice Activity

Detection

Audio Processing

Technologies

Need for Voice Activity

Detection

Benefits of Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 4 / 44

Voice controlled systems employ one or more audio sensors (microphones) and

  capture audio signals

  recognise verbal commands

  transform them to computer-recognisable commands   execute the associated actions

Common component of all these systems is the voice activity detection

pre-processing step during which

  the presence or absence of speech within the captured audio signals is

detected

  audio samples are separated into speech and non-speech segments

Page 5: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 5/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Benefits of Voice Activity Detection

Need for Voice Activity

Detection

Audio Processing

Technologies

Need for Voice Activity

Detection

Benefits of Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 5 / 44

The benefits from the introduction of a voice activity detection block within a voice

commanding system are manifold

  Reduction of computational requirements

  Improvement of the efficiency of the overlying system

Other applications of voice activity detection are

  speech coding

  optimisation of bandwidth use in mobile communications

  security / surveillance

Voice Activity detection can be applied to

  online systems, where instant decisions about the captured audio signals are

required

  offline systems that process recorded audio signals in order to extract speech

segments

Page 6: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 6/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

    

  

  

  

Voice Activity Detection

Fundamentals

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 6 / 44

Page 7: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 7/44

  

  

  

  

  

  

  

Human Hearing

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 7 / 44

When humans hear a sound they can tell whether this is human speech or not by

using

  audio information (auditory system) : main info

  visual information (visual system) : complementary info

Moreover, additional information is extracted regarding

  The gender of the speaker

  The age of the speaker

  The location of the speaker relative to our position

  His/her distance

  His/her identity, if we know him/her

How do we do it?

  

Page 8: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 8/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Human Hearing

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 8 / 44

The detection of speech by humans

  is not inherent

⇒   infants recognise their mother’s voice but require 14 days to get used to

their father’s voice

⇒   they identify babbling as noise

  training of the human auditory system is required in order to identify sounds -

including speech!!!

  the exact procedure is not known

Page 9: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 9/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Human Speech

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 9 / 44

0 0.1 0.2 0.3 0.4 0.5−1

0.5

0

0.5

   S  o  u

  n   d   P  r  e  s  s  u  r  e

Time (sec)

0 0.2 0.4 0.6 0.8 1−50

0

50

Normalized Frequency (×π rad/sample)

   M  a  g  n   i   t  u   d  e   (   d   B   )

Human speech is distinguishable since it has specific

  statistical properties

  frequency content

  periodicity   etc

Page 10: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 10/44

  

  

  

  

  

  

  

  

  

  

Artificial Voice Activity Detection

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 10 / 44

Artificial voice activity detection systems can be classified into two major

categories

  Supervised methods which use some a priori information in order to train the

system and thus optimise its performance

  linear discriminat analysis (LDA)

  hidden Markov models (HMM)

  neural networks (NNs)

  Unsupervised methods that compute a predefined statistical measure andcompare it to a threshold in order to decide whether the captured signal is

speech or not

  maximum likelihood ratio (MLR) criterion algorithms

  periodicity based algorithms

All these algorithms perform in a frame-by-frame processing basis using close

talking microphones

Page 11: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 11/44

  

  

  

  

  

  

  

  

  

  

  

 

  

  

  

  

  

  

  

  

  

  

  

Voice Activity detection Approaches

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 11 / 44

Supervised techniques Unsupervised Techniques

training data   required not requiredmisclassifications   less more

fine-tuning   difficult simple

noisy conditions   similar to training data any

application dependent   yes no

Page 12: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 12/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Supervised methods

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 12 / 44

Supervised classification techniques operate in two modes:

⇒   Training mode : During this stage the provided training data are employed in

order to optimise the parameters of the system

⇒   Testing or Decision mode : In testing mode decisions are made by using theoptimised parameters that were derived in the training mode

Page 13: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 13/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Linear Discriminant Analysis

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 13 / 44

LDA is a classification technique, that is it looks for directions efficient for

discriminating between data in different classes

  It projects data into a space with less dimensions

⇒   it reduces the dimensionality of the problem

  It looks for a linear transformation that increases the separability of the data

  it employs a threshold in the reduced-dimensionality space to make decisions

  If the original distributions are multimodal and highly overlapping even the

best direction is unlikely to provide adequate separation

  

Page 14: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 14/44  

  

  

  

  

  

  

Linear discriminant Analysis

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 14 / 44

For the two-class problem this linear transformation is found by maximising Fisher

linear discriminant

J (w

) =

  |m1 −  m2|2

s21 + s22

where

  mi  =  1ni

xi∈Di

x the mean value of the i−th class

  mi = w

tm

i the projected mean value of the i−th class

  s2i   =

yi∈Y i(y −  mi)

2 the scatter of the i−th class

Thus we look for a direction w that

  Increases the distance between the projected means

  reduces the standard deviations for each class

Page 15: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 15/44

  

  

  

  

  

  

  

  

  

  

  

Linear Discriminant Analysis

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 15 / 44

The Fisher linear discriminant can also be expressed as

J (w) =  wtS Bw

wtS W w

where

  S i  =

x∈Di(x−mi)

t(x−mi) the scatter matrices of the classes

  S W   = S 1 + S 2 the within class scatter matrix

  S B  = (

m1 −

m2)

t

(m

1 −m

2) the between class scatter matrix

Page 16: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 16/44

Page 17: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 17/44

  

  

  

  

 

  

  

  

  

  

  

  

  

  

  

  

  

 

 

  

  

  

  

  

    

  

  

  

  

  

Separability of Speech/Non-speech data

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 17 / 44

Observe the energy values and the LDA projected values of audio signals

(speech and non-speech) captured with far-field microphones

0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

4

   N  u  m   b  e  r  o   f   F  r  a  m  e  s

Energy Values

Speech

Non−Speech

(a) Separation with Energy

−60   −40   −20 0 20 400

1000

2000

3000

4000

5000

6000

LDA projected value

   N  u  m   b  e  r  o   f   F  r  a  m  e  s

  SpeechNon−Speech

(b) Separation with LDA

Conclusions

  Using a threshold value ≈ 10 we can separate data based on LDA

  This is not possible from the energy of the captured signals

  misclassifications still exist but are significantly less

Page 18: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 18/44

  

  

  

  

     

  

  

 

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Unsupervised methods

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection with

a Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 18 / 44

Typical statistical measures that unsupervised methods use in order to

discriminate between speech and non-speech signals are

  periodicity

  zero crossings

  energy

  etc

Modern unsupervised methods rely on the likelihood ratio test and include

  soft decision LRT   decision directed LRT

  multiple observation LRT

Components:

1.   decision rule, which is a quantity that measures the difference between noise

and observed signal statistics

2.   decision threshold, to which the decision rule is compared to

3.   noise statistics estimation scheme, to derive the dynamics of the background

noise

Page 19: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 19/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Voice Activity Detection with a Likelihood Ratio Test

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection with

a Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 19 / 44

S ,N ,X  are:

  The L-dimentional coefficient vectors of speech, noise and noisy speech

  Obtained by DFT transform of the captured audio signals

Their variances are given by

λN (k) = S N (2πk/L)

λS (k) = S S (2πk/L)

σ(k) = λN (k) + λS (k)

where S S (ω) and S N (ω) the true power spectra of noise and speech

respectively

Assumptions

  speech and noise are Gaussian random processes

  independent of each other

Page 20: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 20/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

    

    

  

  

  

  

  

  

  

  

  

Hypotheses

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection with

a Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 20 / 44

The two hypotheses of the voice activity detection problem are

H 0 : speech absent : X =N 

H 1 : speech present : X = S +N 

Joint probability density functions

 p(X |H 0) =L−1k=0

1πλN (k)

 exp− |X k|2

λN (k)

 p(X |H 1) =L−1

k=0

1

π[λN (k) + λS (k)]

 exp−  |X k|

2

λN (k) + λS (k)

Page 21: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 21/44

Page 22: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 22/44

  

  

  

  

  

LRT methods

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection with

a Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 22 / 44

Depending on the computation of the  a priori  signal-to-noise ratio

  generalised  estimator

ˆξ 

(ML)

k   = γ k − 1

  decision directed estimation

ξ (DD)

(n) = α

A2k(n − 1)

λN (k, n − 1) + (1 − α)P [γ k(n) − 1]

where

  P [x] = x if x ≥ 0 and P [x] = 0 otherwise

  A(k, n − 1) signal amplitude estimates of previous frame

  

  

Page 23: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 23/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  multiple observations

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis

Linear Discriminant Analysis

LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection with

a Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations

Performance Enhancement

Voice Activity Detection @

SmartLab

Voice Activity Detection 23 / 44

If multiple framesX (n−m), . . . ,X (n−1),X (n),X (n+1), . . . ,X (n+m)

instead of a single one X  are used then we can decide based on the measure

L(n+1) = L(n) − logΛ(n−m) + log Λ(n+m+1)

The multiple observation LRT

  is more robust that single observation techniques

  as the number of observations increases

  the non-speech variance decreases

  the speech distribution is shifted to the right ⇒ better separated from the

non-speech distribution

  optimum performance for m  = 6

Page 24: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 24/44

  

 

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Performance Enhancement

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Problems of Voice Activity

DetectionVAD Improvement

Techniques

Hang-Over Schemes

Linear Prediction Coding

Band-Pass fi ltering

Adaptive Thresholding

Voice Activity Detection @

SmartLab

Voice Activity Detection 24 / 44

Page 25: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 25/44

  

  

  

  

  

  

  

Problems of Voice Activity Detection

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Problems of Voice Activity

DetectionVAD Improvement

Techniques

Hang-Over Schemes

Linear Prediction Coding

Band-Pass fi ltering

Adaptive Thresholding

Voice Activity Detection @

SmartLab

Voice Activity Detection 25 / 44

Voice activity detection systems some times provide faulty decisions since

  they are sensitive to impulsive sounds

  hand clapping

  coughing   knocking

  etc . . .

  they under-perform in highly varying environments

  they fail under extremely noisy conditions   when far  field microphones are employed their performance degrades with

the distance from the microphones

Moreover their behaviour, in terms of precision, depends on the employed frame

size

  large: robust estimate but low precision

  small: high precision but mis-triggering

Page 26: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 26/44

  

  

  

  

  

VAD Improvement Techniques

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Problems of Voice Activity

DetectionVAD Improvement

Techniques

Hang-Over Schemes

Linear Prediction Coding

Band-Pass fi ltering

Adaptive Thresholding

Voice Activity Detection @

SmartLab

Voice Activity Detection 26 / 44

Techniques used for the improvement of the performance of a voice activity

detector include

  hang-over schemes

  linear prediction coding

  band-pass  filtering

  adaptive thresholding

  

    

H O S h

Page 27: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 27/44

  

  

  

  

  

  

  

    

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Hang-Over Schemes

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Problems of Voice Activity

DetectionVAD Improvement

Techniques

Hang-Over Schemes

Linear Prediction Coding

Band-Pass fi ltering

Adaptive Thresholding

Voice Activity Detection @

SmartLab

Voice Activity Detection 27 / 44

  A hang-over scheme is a post-processing system that is applied on the raw

decisions of a voice activity detector

  Its objective is to prevent misclassification of

  sharp impulsive sounds as speech

  speech pauses as silence

  Fundamental idea is to pose time thresholds tsil, tsp such that

  silence intervals of duration less than tsil are classified as speech

pauses and thus speech   speech intervals of duration shorter than tsp are considered to be

impulsive sounds and not speech

  Hang-over schemes can be implemented with

  Markov models

  Finite state machines

Li P di ti C di

Page 28: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 28/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

 

  

  

  

Linear Prediction Coding

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Problems of Voice Activity

DetectionVAD Improvement

Techniques

Hang-Over Schemes

Linear Prediction Coding

Band-Pass fi ltering

Adaptive Thresholding

Voice Activity Detection @

SmartLab

Voice Activity Detection 28 / 44

Linear prediction coding

  estimates with a least squares approach an autoregressive model that

simulated the vocal tract

  with this autoregressive model the buzz (train of impulses) produced by the

glottis can be estimated

It has been extensively used in speech applications like

  glottal closure instants in voiced speech

  pitch extraction

When applied as a pre-processing step the performance of voice activity

detection is significantly improved

    

B d P filt i

Page 29: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 29/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Band-Pass  filtering

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Problems of Voice Activity

Detection

VAD Improvement

Techniques

Hang-Over Schemes

Linear Prediction Coding

Band-Pass fi ltering

Adaptive Thresholding

Voice Activity Detection @

SmartLab

Voice Activity Detection 29 / 44

  the frequency content of human speech is mainly contained in the frequency

range from 200Hz -3KHz

  Microphones return audio signals that have significantly wider range

  A band-pass filter that isolates the speech frequencies

  improves the performance

  reduces the computational requirements

Adaptive Thresholding

Page 30: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 30/44

  

  

  

  

  

  

  

  

  

  

Adaptive Thresholding

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Problems of Voice Activity

Detection

VAD Improvement

Techniques

Hang-Over Schemes

Linear Prediction Coding

Band-Pass fi ltering

Adaptive Thresholding

Voice Activity Detection @

SmartLab

Voice Activity Detection 30 / 44

Voice activity detection systems employ a statistical measure and a threshold to

decide whether a received/recorded audio signal contains speech. This threshold

  reflects the dynamics of the background noise

  is time-invariant

  is set either heuristically or with the use of a priori  info

To enable the detector to perform satisfyingly in environments with varying noise

statistics, adaptive thresholding can be employed

  geometrically adaptive energy threshold   high order statistics adaptive threshold

  gradient adaptive threshold

  

Page 31: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 31/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

 

  Voice Activity Detection @ SmartLab

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity Detector

Hang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some ResultsOpen Issues

Voice Activity Detection 31 / 44

  

System Overview

Page 32: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 32/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

System Overview

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity Detector

Hang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some Results

Open Issues

Voice Activity Detection 32 / 44

DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

 mic 1

 mic 2

 mic N

Requirements

  A real time system capable of performing both

  online fashion

  offline fashion

  Far  field microphones are used since we want our system to be non-intrusive

  Potential speakers move inside the room

  the energy of the captured speech signals varies continuously

    

Data Collection

Page 33: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 33/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

 

  

  

  

  

  

Data Collection

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity DetectorHang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some Results

Open Issues

Voice Activity Detection 33 / 44

DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

 mic 1

 mic 2

 mic N

Data collection is performed with

  a microphone array consisting of 64 microphones and sampling rate 22-44KHz

  16 microphones placed on the walls in ”inverse T” formation

Spatial Averaging

Page 34: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 34/44

  

  

  

  

  

  

  

  

    

  

  

  

  

  

  

  

  

Spatial Averaging

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity DetectorHang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some Results

Open Issues

Voice Activity Detection 34 / 44

DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

 mic 1

 mic 2

 mic N

The signals of the employed microphones are properly aligned and averaged in

order to   remove the effect of the room impulse response

  reduce the stochastic measurement noise

  

  

Band-Pass Filtering

Page 35: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 35/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

   

  

Band Pass Filtering

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity DetectorHang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some Results

Open Issues

Voice Activity Detection 35 / 44

DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

 mic 1

 mic 2

 mic N

A butterworth  filter is introduced which

  isolates the frequencies of interest   disregards the undesired frequency components

Butterworth  filter is chosen due to his maximally  flat pass band: It does not distort

the speech signals

  

Linear Prediction Coding

Page 36: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 36/44

  

  

  

  

  

  

  

  

  

  

  

Linear Prediction Coding

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity DetectorHang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some Results

Open Issues

Voice Activity Detection 36 / 44

DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

 mic 1

 mic 2

 mic N

Linear prediction coding is performed on the band-pass  filtered signal in order to

  suppress the non-speech intervals   transform the speech segments to (buzz) impulsive signals

   Voice Activity Detector

Page 37: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 37/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Voice Activity Detector

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity Detector

Hang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some Results

Open Issues

Voice Activity Detection 37 / 44

DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

 mic 1

 mic 2

 mic N

Two voice activity detectors have been developed

  The  first is a supervised method and it is based on LDA   the second and most recently developed is an unsupervised method based

on the decision directed likelihood ratio test

Hang-Over Scheme

Page 38: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 38/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

 

Hang Over Scheme

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity Detector

Hang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some Results

Open Issues

Voice Activity Detection 38 / 44

DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

 mic 1

 mic 2

 mic N

To smooth the decisions provided by the detector we have developed two

hangover schemes

  A  finite state machine designed for the supervised (LDA) voice activity

detector

  A Markov model developed for the unsupervised voice activity detector

  

  

  

Hang-over scheme (Finite State Machine)

Page 39: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 39/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

g ( )

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity Detector

Hang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some Results

Open Issues

Voice Activity Detection 39 / 44

  conditions

  C2: Speech duration (SD)  >tsp

  C3: Silence duration (SiD) >tsil

  C4: LDA value > nLDA

  Actions (l: length of frame in sec-

onds)

  A1: SiD = SiD + l

  A2: SD = l   A3: SiD = SiD + SD   A4: SD = SD + l   A5: SiD = l   A6: SD = SiD = 0

  

Hang-over scheme (Markov Model)

Page 40: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 40/44

  

  

  

  

  

  

  

g ( )

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity Detector

Hang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some Results

Open Issues

Voice Activity Detection 40 / 44

VAD=0

VAD=0VAD=0VAD=0

VAD=1VAD=1VAD=1   VAD=1VAD=1VAD=1 VAD=1

VAD=1

D=1

D=1

D=1  D=1

  D=1D=1

D=1

D=0

D=1D=0

D=1

D=0D=0

D=0

D=1 D=1

D=1

D=0D=0D=0D=0D=0D=0

D=0

  with dashed lines the transition states are displayed

  with solid lines the  final states

  notice that to move from

  speech to non-speech eight consecutive non-speech frames must be

detected

  non-speech to speech we need to have three consecutive non-speech

decisions   The thresholds are heuristically chosen

  

Some Results

Page 41: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 41/44

  

  

  

 

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity Detector

Hang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some Results

Open Issues

Voice Activity Detection 41 / 44

0 0.5 1 1.5 2

x 105

−2

−1

0

1

2

3

4

5

6

7

8

9

Time index

   D  e  c   i  s   i  o  n

Multiple Observations − LRT

Decision Directed−

 LRT

Generalized−LRT

Raw decisions from several likelihood ratio test unsupervised voice activity

detectors

  

  

Some Results

Page 42: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 42/44

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

Voice Activity Detection 42 / 44

Method LDA Thr. En. Thr. MR SDER NDER ADER Wpeps

LDA   4.9 - 10.09% 10.40% 8.62% 9.51% 0.09

Ad. En. Thr.   - - 18.10% 18.40% 15.60% 17.00% 0.08

FSM+LDA   4.9 -   9.94% 10.19% 8.65% 9.42% 0.08

FSM+En.   - 0.043 17.28% 17.69% 14.63% 16.16% 0.08

  Open Issues

Page 43: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 43/44

  

  

  

  

  

Need for Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering

Linear Prediction Coding

Voice Activity Detector

Hang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some Results

Open Issues

Voice Activity Detection 43 / 44

Issues that need to be examined and address are

  performance of supervised detectors when the training set and the testing set

are derived from different speakers

  substitution of spatial averaging with beamforming

  incorporation of an adaptive threshold

  evaluation of voice activity detectors based on hidden Markov models

  

  

Page 44: Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 44/44

  

  

  

  

  

  

  

  

  

   

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  Voice Activity Detection 44 / 44

Thank You ...