Leakage Problems in Array Speech Processing Julien Bourgeois Martigny - September 2003

14
Leakage Problems in Array Speech Processing Julien Bourgeois Martigny - September 2003

description

Leakage Problems in Array Speech Processing Julien Bourgeois Martigny - September 2003. x 1 (t). x 4 (t). Array Processor. Context of the work. Several simultaneous speakers (sources) spatially located. Road Noise spatially diffuse. s 2 (t). s 1 (t). Microphone Array - PowerPoint PPT Presentation

Transcript of Leakage Problems in Array Speech Processing Julien Bourgeois Martigny - September 2003

Leakage Problems in Array Speech Processing

Julien Bourgeois

Martigny - September 2003

x1(t) x4(t)

Array Processor

Recover clean individual speech flows: separate and denoise the sources

Context of the work

Microphone Array

get mixtures of the sources and noise

Individual speech flows

s1(t)

s2(t)

Road Noise spatially diffuse

Several simultaneous speakers (sources) spatially located

Beamforming

Beamforming: Minimization of output power with unit gain at the direction (DOA) of the target

+ robust against noise, sources do not have to be active- array geometry and target location must be known and far-field

Leakage Problem (Beamforming)

x1 xN

With echo or source location error:

the source signal arrives from another direction than the constrained DOA.

The beamformer can produce a zero output...

... and indeed it minimizes the output power.

+1-1(Constrain)

In a reverberant environment or by target location error,

beamforming can cancel the target signal.

0 (output)

Solution to the Leakage Problem

Do not adapt the beamformer when the target is active (the speaker is speaking).

x1 xN

With the constrain, good behavior should be preserved for the target.

When the target is off, minimizing the output power will cancel the noise sources.

0

+1

(Constrain)

Do not speak

A beamformer needs a voice activity detector (VAD) to control its adaptation.

2000 4000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.52000 4000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5 -20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

2000 4000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1000 2000 3000 4000 5000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Estimate the target power PT with a delay-sum beam

Estimate noise power PN with M-1 orthogonal beams

Voice Activity Detector: VAD(t) = PN (t)/PT (t) (frame-wise)

M = 4 microphones

VAD by unknown noise field

20 40 60 80 100 120

1000

2000

3000

4000

5000

6000

0 20 40 60 80 100 120 140-5

0

5

10

15

VA

D(t

) [d

B]

(586 H

z)

Realistic scenario (road noise always present) Prior: DOA of the target speaker

It can be difficult to discriminate Double-Talk and Talk situations.

Noisy Speech (freeze) Noisy Jammer (adapt) Noisy Double Talk (freeze)

Leakage Problem (Beamforming)

Is caused by echoic environment (such as a car)target location errorcalibration errorwrong propagation model (far-field)

A solution: no adaptation during target activity (speech)requires a voice activity detectoris a trade-off between noise tracking and robustness

Blind Source Separation

Blind Source Separation: Minimization of a dependence measure

+ only statistical assumption on the sources (independence) + no prior on the array geometry and sources locations- ambiguities: permutations and scaling at each frequency - not robust against noise, need all sources to be active

Robust Blind Source Separation: Multiple Decorrelation

Find W s.t. the components of s = W x are decorrelated at several times

i.e. such that Rss(tk) = WHRxx (tk) W is diagonal for k = 1,...,K

t1 t2 t3 t4 tK

W is found using the gradient descent and is constrained to unity gain.

1000 2000 3000 4000 5000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5 -20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

1000 2000 3000 4000 5000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

W initialized to identity

2 microphones

1000 2000 3000 4000 5000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5 -20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

1000 2000 3000 4000 5000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

3 microphones

Leakage Problem (BSS)

1000 2000 3000 4000 5000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5 -20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

1000 2000 3000 4000 5000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

1000 2000 3000 4000 5000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5 -20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

1000 2000 3000 4000 5000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

4 microphones

8 microphones

W initialized to identity

Leakage Problemn (BSS)

1000 2000 3000 4000 5000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5 -20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

1000 2000 3000 4000 5000 6000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

A solution with prior on source locations

8 microphones

W initialized to delay-sums at sources locations

Conclusion & Future Plans

Leakage ProblemBeamformers need to detect who speaks and when (VAD).Double talk is difficult to detect because of low directivity at low frequencies, where speech has more power.

For source separation, an unbiased spatial prior (source locations) prevents convergence to zero of the separator.

Future Work1. Set a spatial constrain at low frequencies where location error have little effect.2. Estimate location of the source at higher frequencies.

3. Is it possible to constructively use the early reflections ?

(multiple beamforming, matched filtering)