Voice Activity Detection Identifying Speech Segments.pdf

8/10/2019 Voice Activity Detection Identifying Speech Segments.pdf

http://slidepdf.com/reader/full/voice-activity-detection-identifying-speech-segmentspdf 1/44

Voice Activity Detection 1 / 44

Voice Activity Detection: Identifying Speech

Segments within Audio Recordings

Dr. Christos Boukis

Autonomic and Grid Computing Group

Athens Information Technology

email: [email protected]

October 27, 2006



Need for Voice Activity Detection

Need for Voice Activity

Detection

Audio Processing

Technologies


Detection

Benefits of Voice Activity

Detection

Voice Activity Detection

Fundamentals

Performance Enhancement

Voice Activity Detection @

SmartLab




Need for Voice Activity Detection


Detection

Audio Processing

Technologies


Detection


Detection


Fundamentals



SmartLab


Voice controlled systems employ one or more audio sensors (microphones) and

capture audio signals

recognise verbal commands

transform them to computer-recognisable commands execute the associated actions

Common component of all these systems is the voice activity detection

pre-processing step during which

the presence or absence of speech within the captured audio signals is

detected

audio samples are separated into speech and non-speech segments



Benefits of Voice Activity Detection


Detection

Audio Processing

Technologies


Detection


Detection


Fundamentals



SmartLab


The benefits from the introduction of a voice activity detection block within a voice

commanding system are manifold

Reduction of computational requirements

Improvement of the efficiency of the overlying system

Other applications of voice activity detection are

speech coding

optimisation of bandwidth use in mobile communications

security / surveillance

Voice Activity detection can be applied to

online systems, where instant decisions about the captured audio signals are

required

offline systems that process recorded audio signals in order to extract speech

segments




Fundamentals


Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech

Artificial Voice Activity

Detection

Voice Activity detection

Approaches

Supervised methods

Linear Discriminant Analysis

Linear discriminant Analysis


LDA applied to Voice

Activity Detection

Separability of

Speech/Non-speech data

Unsupervised methods

Voice Activity Detection witha Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods

multiple observations



SmartLab




Human Hearing


Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of




Hypotheses

Likelihood Ratio

LRT methods




SmartLab


When humans hear a sound they can tell whether this is human speech or not by

using

audio information (auditory system) : main info

visual information (visual system) : complementary info

Moreover, additional information is extracted regarding

The gender of the speaker

The age of the speaker

The location of the speaker relative to our position

His/her distance

His/her identity, if we know him/her

How do we do it?



Human Hearing


Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of




Hypotheses

Likelihood Ratio

LRT methods




SmartLab


The detection of speech by humans

is not inherent

⇒ infants recognise their mother’s voice but require 14 days to get used to

their father’s voice

⇒ they identify babbling as noise

training of the human auditory system is required in order to identify sounds -

including speech!!!

the exact procedure is not known



Human Speech


Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of




Hypotheses

Likelihood Ratio

LRT methods




SmartLab


0 0.1 0.2 0.3 0.4 0.5−1

−

0.5

0

0.5

S o u

n d P r e s s u r e

Time (sec)

0 0.2 0.4 0.6 0.8 1−50

0

50

Normalized Frequency (×π rad/sample)

M a g n i t u d e ( d B )

Human speech is distinguishable since it has specific

statistical properties

frequency content

periodicity etc



Artificial Voice Activity Detection


Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of




Hypotheses

Likelihood Ratio

LRT methods




SmartLab


Artificial voice activity detection systems can be classified into two major

categories

Supervised methods which use some a priori information in order to train the

system and thus optimise its performance

linear discriminat analysis (LDA)

hidden Markov models (HMM)

neural networks (NNs)

Unsupervised methods that compute a predefined statistical measure andcompare it to a threshold in order to decide whether the captured signal is

speech or not

maximum likelihood ratio (MLR) criterion algorithms

periodicity based algorithms

All these algorithms perform in a frame-by-frame processing basis using close

talking microphones



Voice Activity detection Approaches


Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of




Hypotheses

Likelihood Ratio

LRT methods




SmartLab


Supervised techniques Unsupervised Techniques

training data required not requiredmisclassifications less more

fine-tuning difficult simple

noisy conditions similar to training data any

application dependent yes no



Supervised methods


Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of




Hypotheses

Likelihood Ratio

LRT methods




SmartLab


Supervised classification techniques operate in two modes:

⇒ Training mode : During this stage the provided training data are employed in

order to optimise the parameters of the system

⇒ Testing or Decision mode : In testing mode decisions are made by using theoptimised parameters that were derived in the training mode





Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of




Hypotheses

Likelihood Ratio

LRT methods




SmartLab


LDA is a classification technique, that is it looks for directions efficient for

discriminating between data in different classes

It projects data into a space with less dimensions

⇒ it reduces the dimensionality of the problem

It looks for a linear transformation that increases the separability of the data

it employs a threshold in the reduced-dimensionality space to make decisions

If the original distributions are multimodal and highly overlapping even the

best direction is unlikely to provide adequate separation





Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of




Hypotheses

Likelihood Ratio

LRT methods




SmartLab


For the two-class problem this linear transformation is found by maximising Fisher

linear discriminant

J (w

) =

|m1 − m2|2

s21 + s22

where

mi = 1ni

xi∈Di

x the mean value of the i−th class

mi = w

tm

i the projected mean value of the i−th class

s2i =

yi∈Y i(y − mi)

2 the scatter of the i−th class

Thus we look for a direction w that

Increases the distance between the projected means

reduces the standard deviations for each class





Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of




Hypotheses

Likelihood Ratio

LRT methods




SmartLab


The Fisher linear discriminant can also be expressed as

J (w) = wtS Bw

wtS W w

where

S i =

x∈Di(x−mi)

t(x−mi) the scatter matrices of the classes

S W = S 1 + S 2 the within class scatter matrix

S B = (

m1 −

m2)

t

(m

1 −m

2) the between class scatter matrix



Separability of Speech/Non-speech data


Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of




Hypotheses

Likelihood Ratio

LRT methods




SmartLab


Observe the energy values and the LDA projected values of audio signals

(speech and non-speech) captured with far-field microphones

0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

4

N u m b e r o f F r a m e s

Energy Values

Speech

Non−Speech

(a) Separation with Energy

−60 −40 −20 0 20 400

1000

2000

3000

4000

5000

6000

LDA projected value

N u m b e r o f F r a m e s

SpeechNon−Speech

(b) Separation with LDA

Conclusions

Using a threshold value ≈ 10 we can separate data based on LDA

This is not possible from the energy of the captured signals

misclassifications still exist but are significantly less





Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of



Voice Activity Detection with

a Likelihood Ratio Test

Hypotheses

Likelihood Ratio

LRT methods




SmartLab


Typical statistical measures that unsupervised methods use in order to

discriminate between speech and non-speech signals are

periodicity

zero crossings

energy

etc

Modern unsupervised methods rely on the likelihood ratio test and include

soft decision LRT decision directed LRT

multiple observation LRT

Components:

1. decision rule, which is a quantity that measures the difference between noise

and observed signal statistics

2. decision threshold, to which the decision rule is compared to

3. noise statistics estimation scheme, to derive the dynamics of the background

noise



Voice Activity Detection with a Likelihood Ratio Test


Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of





Hypotheses

Likelihood Ratio

LRT methods




SmartLab


S ,N ,X are:

The L-dimentional coefficient vectors of speech, noise and noisy speech

Obtained by DFT transform of the captured audio signals

Their variances are given by

λN (k) = S N (2πk/L)

λS (k) = S S (2πk/L)

σ(k) = λN (k) + λS (k)

where S S (ω) and S N (ω) the true power spectra of noise and speech

respectively

Assumptions

speech and noise are Gaussian random processes

independent of each other



Hypotheses


Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of





Hypotheses

Likelihood Ratio

LRT methods




SmartLab


The two hypotheses of the voice activity detection problem are

H 0 : speech absent : X =N

H 1 : speech present : X = S +N

Joint probability density functions

p(X |H 0) =L−1k=0

1πλN (k)

exp− |X k|2

λN (k)

p(X |H 1) =L−1

k=0

1

π[λN (k) + λS (k)]

exp− |X k|

2

λN (k) + λS (k)



LRT methods


Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of





Hypotheses

Likelihood Ratio

LRT methods




SmartLab


Depending on the computation of the a priori signal-to-noise ratio

generalised estimator

ˆξ

(ML)

k = γ k − 1

decision directed estimation

ξ (DD)

(n) = α

A2k(n − 1)

λN (k, n − 1) + (1 − α)P [γ k(n) − 1]

where

P [x] = x if x ≥ 0 and P [x] = 0 otherwise

A(k, n − 1) signal amplitude estimates of previous frame





Detection


Fundamentals

Human Hearing

Human Hearing

Human Speech


Detection


Approaches

Supervised methods





Activity Detection

Separability of





Hypotheses

Likelihood Ratio

LRT methods




SmartLab


If multiple framesX (n−m), . . . ,X (n−1),X (n),X (n+1), . . . ,X (n+m)

instead of a single one X are used then we can decide based on the measure

L(n+1) = L(n) − logΛ(n−m) + log Λ(n+m+1)

The multiple observation LRT

is more robust that single observation techniques

as the number of observations increases

the non-speech variance decreases

the speech distribution is shifted to the right ⇒ better separated from the

non-speech distribution

optimum performance for m = 6





Detection


Fundamentals


Problems of Voice Activity

DetectionVAD Improvement

Techniques

Hang-Over Schemes

Linear Prediction Coding

Band-Pass fi ltering

Adaptive Thresholding


SmartLab




Problems of Voice Activity Detection


Detection


Fundamentals




Techniques

Hang-Over Schemes





SmartLab


Voice activity detection systems some times provide faulty decisions since

they are sensitive to impulsive sounds

hand clapping

coughing knocking

etc . . .

they under-perform in highly varying environments

they fail under extremely noisy conditions when far field microphones are employed their performance degrades with

the distance from the microphones

Moreover their behaviour, in terms of precision, depends on the employed frame

size

large: robust estimate but low precision

small: high precision but mis-triggering



VAD Improvement Techniques


Detection


Fundamentals




Techniques

Hang-Over Schemes





SmartLab


Techniques used for the improvement of the performance of a voice activity

detector include

hang-over schemes

linear prediction coding

band-pass filtering

adaptive thresholding

H O S h



Hang-Over Schemes


Detection


Fundamentals




Techniques

Hang-Over Schemes





SmartLab


A hang-over scheme is a post-processing system that is applied on the raw

decisions of a voice activity detector

Its objective is to prevent misclassification of

sharp impulsive sounds as speech

speech pauses as silence

Fundamental idea is to pose time thresholds tsil, tsp such that

silence intervals of duration less than tsil are classified as speech

pauses and thus speech speech intervals of duration shorter than tsp are considered to be

impulsive sounds and not speech

Hang-over schemes can be implemented with

Markov models

Finite state machines

Li P di ti C di





Detection


Fundamentals




Techniques

Hang-Over Schemes





SmartLab


Linear prediction coding

estimates with a least squares approach an autoregressive model that

simulated the vocal tract

with this autoregressive model the buzz (train of impulses) produced by the

glottis can be estimated

It has been extensively used in speech applications like

glottal closure instants in voiced speech

pitch extraction

When applied as a pre-processing step the performance of voice activity

detection is significantly improved

B d P filt i



Band-Pass filtering


Detection


Fundamentals



Detection

VAD Improvement

Techniques

Hang-Over Schemes





SmartLab


the frequency content of human speech is mainly contained in the frequency

range from 200Hz -3KHz

Microphones return audio signals that have significantly wider range

A band-pass filter that isolates the speech frequencies

improves the performance

reduces the computational requirements






Detection


Fundamentals



Detection

VAD Improvement

Techniques

Hang-Over Schemes





SmartLab


Voice activity detection systems employ a statistical measure and a threshold to

decide whether a received/recorded audio signal contains speech. This threshold

reflects the dynamics of the background noise

is time-invariant

is set either heuristically or with the use of a priori info

To enable the detector to perform satisfyingly in environments with varying noise

statistics, adaptive thresholding can be employed

geometrically adaptive energy threshold high order statistics adaptive threshold

gradient adaptive threshold



Voice Activity Detection @ SmartLab


Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering


Voice Activity Detector

Hang-Over Scheme

Hang-over scheme (Finite

State Machine)

Hang-over scheme (Markov

Model)

Some Results

Some ResultsOpen Issues


System Overview



System Overview


Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering



Hang-Over Scheme


State Machine)


Model)

Some Results

Some Results

Open Issues


DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

mic 1

mic 2

mic N

Requirements

A real time system capable of performing both

online fashion

offline fashion

Far field microphones are used since we want our system to be non-intrusive

Potential speakers move inside the room

the energy of the captured speech signals varies continuously

Data Collection



Data Collection


Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering


Voice Activity DetectorHang-Over Scheme


State Machine)


Model)

Some Results

Some Results

Open Issues


DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

mic 1

mic 2

mic N

Data collection is performed with

a microphone array consisting of 64 microphones and sampling rate 22-44KHz

16 microphones placed on the walls in ”inverse T” formation

Spatial Averaging



Spatial Averaging


Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering




State Machine)


Model)

Some Results

Some Results

Open Issues


DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

mic 1

mic 2

mic N

The signals of the employed microphones are properly aligned and averaged in

order to remove the effect of the room impulse response

reduce the stochastic measurement noise

Band-Pass Filtering



Band Pass Filtering


Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering




State Machine)


Model)

Some Results

Some Results

Open Issues


DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

mic 1

mic 2

mic N

A butterworth filter is introduced which

isolates the frequencies of interest disregards the undesired frequency components

Butterworth filter is chosen due to his maximally flat pass band: It does not distort

the speech signals






Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering




State Machine)


Model)

Some Results

Some Results

Open Issues


DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

mic 1

mic 2

mic N

Linear prediction coding is performed on the band-pass filtered signal in order to

suppress the non-speech intervals transform the speech segments to (buzz) impulsive signals






Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering



Hang-Over Scheme


State Machine)


Model)

Some Results

Some Results

Open Issues


DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

mic 1

mic 2

mic N

Two voice activity detectors have been developed

The first is a supervised method and it is based on LDA the second and most recently developed is an unsupervised method based

on the decision directed likelihood ratio test

Hang-Over Scheme



Hang Over Scheme


Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering



Hang-Over Scheme


State Machine)


Model)

Some Results

Some Results

Open Issues


DataCollection

Decision

SpatialAveraging

Band-PassFiltering

LinearPrediction

Coding

VoiceActivity

Detector

Hang-OverScheme

mic 1

mic 2

mic N

To smooth the decisions provided by the detector we have developed two

hangover schemes

A finite state machine designed for the supervised (LDA) voice activity

detector

A Markov model developed for the unsupervised voice activity detector

Hang-over scheme (Finite State Machine)



g ( )


Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering



Hang-Over Scheme


State Machine)


Model)

Some Results

Some Results

Open Issues


conditions

C2: Speech duration (SD) >tsp

C3: Silence duration (SiD) >tsil

C4: LDA value > nLDA

Actions (l: length of frame in sec-

onds)

A1: SiD = SiD + l

A2: SD = l A3: SiD = SiD + SD A4: SD = SD + l A5: SiD = l A6: SD = SiD = 0

Hang-over scheme (Markov Model)



g ( )


Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering



Hang-Over Scheme


State Machine)


Model)

Some Results

Some Results

Open Issues


VAD=0

VAD=0VAD=0VAD=0

VAD=1VAD=1VAD=1 VAD=1VAD=1VAD=1 VAD=1

VAD=1

D=1

D=1

D=1 D=1

D=1D=1

D=1

D=0

D=1D=0

D=1

D=0D=0

D=0

D=1 D=1

D=1

D=0D=0D=0D=0D=0D=0

D=0

with dashed lines the transition states are displayed

with solid lines the final states

notice that to move from

speech to non-speech eight consecutive non-speech frames must be

detected

non-speech to speech we need to have three consecutive non-speech

decisions The thresholds are heuristically chosen

Some Results




Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering



Hang-Over Scheme


State Machine)


Model)

Some Results

Some Results

Open Issues


0 0.5 1 1.5 2

x 105

−2

−1

0

1

2

3

4

5

6

7

8

9

Time index

D e c i s i o n

Multiple Observations − LRT

Decision Directed−

LRT

Generalized−LRT

Raw decisions from several likelihood ratio test unsupervised voice activity

detectors

Some Results




Method LDA Thr. En. Thr. MR SDER NDER ADER Wpeps

LDA 4.9 - 10.09% 10.40% 8.62% 9.51% 0.09

Ad. En. Thr. - - 18.10% 18.40% 15.60% 17.00% 0.08

FSM+LDA 4.9 - 9.94% 10.19% 8.65% 9.42% 0.08

FSM+En. - 0.043 17.28% 17.69% 14.63% 16.16% 0.08

Open Issues




Detection


Fundamentals



SmartLab

System Overview

Data Collection

Spatial Averaging

Band-Pass Filtering



Hang-Over Scheme


State Machine)


Model)

Some Results

Some Results

Open Issues


Issues that need to be examined and address are

performance of supervised detectors when the training set and the testing set

are derived from different speakers

substitution of spatial averaging with beamforming

incorporation of an adaptive threshold

evaluation of voice activity detectors based on hidden Markov models




Thank You ...

Voice Activity Detection Identifying Speech Segments.pdf

Documents

Transcript of Voice Activity Detection Identifying Speech Segments.pdf