Sound, Mixtures, and Learning - Electrical Engineeringdpwe/talks/Ohio-2002-08+casa.pdf · search...

Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 1/37

Sound, Mixtures, and Learning

Dan Ellis<[email protected]>

Laboratory for Recognition and Organization of Speech and Audio(Lab

ROSA

)

Electrical Engineering, Columbia Universityhttp://labrosa.ee.columbia.edu/

Outline

Auditory Scene Analysis

Speech Recognition & Mixtures

Fragment Recognition

Alarm Sound Detection

Future Work

1

2

3

4

5



•


: describing a complex sound in terms of high-level sources/events

- ... like listeners do

• Hearing is

ecologically

grounded

- reflects ‘natural scene’ properties- subjective, not absolute

1

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

1000

3000

4000

Voice (evil)

Stab

Rumble Strings

Choir

Voice (pleasant)

Analysis

level / dB-60

-40

-20

0


Sound, mixtures, and learning

• Sound

- carries useful information about the world- complements vision

• Mixtures

- .. are the rule, not the exception- medium is ‘transparent’, sources are many- must be handled!

• Learning

- the ‘speech recognition’ lesson:let the data do the work

- like listeners

time / s time / s time / s

freq

/ kH

z

0 0.5 10

1

2

3

4

0 0.5 1 0 0.5 1

+ =

Speech Noise Speech + Noise


The problem with recognizing mixtures

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”

(after Bregman’90)

• Received waveform is a mixture

- two sensors, N signals ...

underconstrained

• Disentangling mixtures as the primary goal?

- perfect solution is not possible- need experience-based

constraints


Human Auditory Scene Analysis

(Bregman 1990)

• How do people analyze sound mixtures?

- break mixture into small

elements

(in time-freq)- elements are

grouped

in to sources using

cues

- sources have aggregate

attributes

• Grouping ‘rules’ (Darwin, Carlyon, ...):

- cues: common onset/offset/modulation, harmonicity, spatial location, ...

Frequencyanalysis

Groupingmechanism

Onsetmap

Harmonicitymap

Positionmap

Sourceproperties

(after Darwin, 1996)


Cues to simultaneous grouping

• Elements + attributes

• Common onset

- simultaneous energy has common source

• Periodicity

- energy in different bands with same cycle

• Other cues

- spatial (ITD/IID), familiarity, ...

time / s

freq

/ H

z

0 1 2 3 4 5 6 7 8 90

2000

4000

6000

8000


The effect of context

• Context can create an ‘expectation’: i.e. a bias towards a particular interpretation

• e.g. Bregman’s “old-plus-new” principle:

A change in a signal will be interpreted as an

added

source whenever possible

- a different division of the same energy depending on what preceded it

+

time / s

freq

uenc

y / k

Hz

0.0 0.4 0.8

1

2

1.20


Computational Auditory Scene Analysis(CASA)

• Goal: Automatic sound organization ;Systems to ‘pick out’ sounds in a mixture

- ... like people do

• E.g. voice against a noisy background

- to improve speech recognition

• Approach:

- psychoacoustics describes grouping ‘rules’- ... just implement them?

CASAObject 1 descriptionObject 2 descriptionObject 3 description...


The Representational Approach

(Brown & Cooke 1993)

• Implement psychoacoustic theory

- ‘bottom-up’ processing- uses common onset & periodicity cues

• Able to extract voiced speech:

inputmixture

signalfeatures

(maps)

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

onset

period

frq.mod

time

freq

0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300400

600

1000

15002000

3000

frq/Hzbrn1h.aif

0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300400

600

1000

15002000

3000

frq/Hzbrn1h.fi.aif


Restoration in sound perception

• Auditory ‘illusions’ = hearing what’s not there

• The continuity illusion

• SWS

- duplex perception

• How to model in CASA?

1000

2000

4000

f/Hz ptshort

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4time/s

5

10

15f/Bark S1−env.pf:0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

40

60

80


Adding top-down constraints

Perception is not

direct

but a

search

for

plausible hypotheses

• Data-driven (bottom-up)...

- objects irresistibly appear

vs. Prediction-driven (top-down)

- match observations with parameters of a world-model

- need world-model constraints...

inputmixture

signalfeatures

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

inputmixture

signalfeatures

predictionerrors

hypotheses

predictedfeaturesFront end Compare

& reconcile

Hypothesismanagement

Predict& combinePeriodic

components

Noisecomponents


Approaches to sound mixture recognition

• Recognize combined signal

- ‘multicondition training’- combinatorics..

• Separate signals

- e.g. CASA, ICA- nice, if you can do it

• Segregate features into fragments

- then missing-data recognition


Aside: Evaluation

• Evaluation is a big problem for CASA

- what is the goal, really?- what is a good test domain?- how do you measure performance?

• SNR improvement

- not easy given only before-after signals:correspondence problem

- can do with fixed filtering mask; rewards removing signal as well as noise

• ASR improvement

- recognizers typically very sensitive to artefacts

• ‘Real’ task?

- mixture corpus with specific sound events...


Outline



- standard ASR- approaches to speech + noise



Future Work

1

2

3

4

5


Speech recognition & mixtures

• Speech recognizers are the most successful and sophisticated acoustic recognizers to date

• ‘State of the art’ word-error rates (WERs):

- 2% (dictation) - 30% (phone conv’ns)

2

Featurecalculation

sound

Acousticclassifier

feature vectorsAcoustic model

parameters

HMMdecoder

Understanding/application...

phone probabilities

phone / word sequence

Word models

Language modelp("sat"|"the","cat")p("saw"|"the","cat")

s ah t

D A

T A


Learning acoustic models

• Goal: describe with e.g. GMMs

• Separate models for each class

- generalization as blurring

• Training data labels from:

- manual annotation- ‘best path’ from earlier classifier (Viterbi)- EM: joint estimation of labels & pdfs

p X M( )

Labeledtraining

examples{xn,ωxn}

Sortaccordingto class

Estimateconditional pdf

for class ω1

p(x|ω1)


Speech + noise mixture recognition

• Background noise is biggest (?) problem facing current ASR

• Feature invariance approach:Design features to reflect only speech- e.g. normalization, mean subtraction

• Ideally, models of clean speech will match speech in noise- .. although training on noisy examples can’t hurt

• Static noise is relatively easy- but: non-static noise?

• Alternative: More complex models of the signal- separate models for speech and ‘rest’


HMM decomposition(e.g. Varga & Moore 1991, Roweis 2000)

• Total signal model has independent state sequences for 2+ component sources

• New combined state space q' = {q1 q2}

- new observation pdfs for each combination

model 1

model 2

observations / time

p X q1 q2,( )


Problems with HMM decomposition

• O(qk)N is exponentially large...

• Feature normalization no longer holds!- each source has a different gain→ model at various SNRs?

- models typically don’t use overall energy C0

- each source has a different channel H[k]

• Modeling every possible sub-state combination is inefficient, inelegant and impractical


Outline



Fragment Recognition- separating signals vs. separating features- missing data recognition- recognizing multiple sources


Future Work

1

2

3

4

5


Fragment Recognition(Jon Barker & Martin Cooke, Sheffield)

• Signal separation is too hard!Instead:- segregate features into partially-observed

sources- then classify

• Made possible by ‘missing data’ recognition- integrate over uncertainty in observations

for optimal posterior distribution

• Goal:Relating clean speech models P(X|M)to speech + noise mixture observations- .. and making it tractable

3


Comparing different segregations

• Standard classification chooses between models M to match source features X

• Mixtures → observed features Y, segregation S, all related by

- spectral features allow clean relationship

• Joint classification of model and segregation:

- integral collapses in several cases...

M∗ P M X( )M

argmax P X M( )P M( )P X( )--------------⋅

Margmax = =

P X Y S,( )

freq

ObservationY(f )

Segregation S

SourceX(f )

P M S Y,( ) P M( ) P X M( )P X Y S,( )

P X( )-------------------------⋅ Xd∫ P S Y( )⋅=


Calculating fragment matches

• P(X|M) - the clean-signal feature model

• P(X|Y,S)/P(X) - is X ‘visible’ given segregation?

• Integration collapses some channels...

• P(S|Y) - segregation inferred from observation- just assume uniform, find S for most likely M - use extra information in Y to distinguish S’s

e.g. harmonicity, onset grouping

• Result: - probabilistically-correct relation between

clean-source models P(X|M)and inferred contributory source P(M,S|Y)

P M S Y,( ) P M( ) P X M( )P X Y S,( )

P X( )-------------------------⋅ Xd∫ P S Y( )⋅=


Speech fragment decoder results

• Simple P(S|Y) model forces contiguous regions to stay together- big efficiency gain when searching S space

• Clean-models-based recognition rivals trained-in-noise recognition

"1754" + noise

SNR mask

Fragments

FragmentDecoder "1754"

-5 0 5 10 15 20 clean0

10

20

30

40

50

60

70

80

90 AURORA 2000 - Test Set A

WE

R /

%

SNR / dB

MD Soft SNRHTK clean training

HTK multicondition


Multi-source decoding

• Search for more than one source

• Mutually-dependent data masks

• Use e.g. CASA features to propose masks- locally coherent regions

• Theoretical vs. practical limits

Y(t)

S1(t)q1(t)

S2(t)q2(t)


Outline




Alarm Sound Detection- sound- mixtures- learning

Future Work

1

2

3

4

5


Alarm sound detection

• Alarm sounds have particular structure- people ‘know them when they hear them’- clear even at low SNRs

• Why investigate alarm sounds?- they’re supposed to be easy- potential applications...

• Contrast two systems:- standard, global features, P(X|M)- sinusoidal model, fragments, P(M,S|Y)

4

time / s

hrn01 bfr02 buz01

level / dB

freq

/ kH

z

0 5 10 15 20 250

1

2

3

4

-40

-20

0

20s0n6a8+20


Alarms: Sound (representation)

• Standard system: Mel Cepstra- have to model alarms in noise context:

each cepstral element depends on whole signal

• Contrast system: Sinusoid groups- exploit sparse, stable nature of alarm sounds- 2D-filter spectrogram to enhance harmonics- simple magnitude threshold, track growing- form groups based on common onset

• Sinusoid representation is already fragmentary- does not record non-peak energies

freq

/ H

z

1 1.5 2 2.50

1000

2000

3000

4000

5000

time / sec1 1.5 2 2.5

0

1000

2000

3000

4000

5000 1

1 1.5 2 2.50

1000

2000

3000

4000

5000


Alarms: Mixtures

• Effect of varying SNR on representations:- sinusoid peaks have ~ invariant properties

Sine track groups Cepstra (normalized)

0

5

10

0

5

10

0 5 10 15 20 25-5

0

5

time / s

1

2

1

2 3

12 3 4

0 5 10 15 20 25 time / s

60 d

B S

NR

10 d

B S

NR

0 d

B S

NR


Alarms: Learning

• Standard: train MLP on noisy examples

• Alternate: learn distributions of group features- duration, frequency deviation, amp. modulation...

- underlying models are clean (isolated)- recognize in different contexts...

Sound mixture

Detectedalarms

Featureextraction

Neural netacousticclassifier

Medianfiltering

PLPcepstra

Alarmprobability

0 50 100 150 2000

20

40

60

80

100

Spectral centroid

Spe

ctra

l mom

ent

0 10 20 30 400

2

4

6

8

10

12

14

Inverse Frequency SD

Mag

nitu

de S

D

Alarm non alarm


Alarms: Results

• Both systems commit many insertions at 0dB SNR, but in different circumstances:

20 25 30 35 40 45 50

0

6 7 8 9

time/sec0

freq

/ kH

z

1

2

3

4

0

freq

/ kH

z

1

2

3

4Restaurant+ alarms (snr 0 ns 6 al 8)

MLP classifier output

Sound object classifier output

NoiseNeural net system Sinusoid model system

Del Ins Tot Del Ins Tot

1 (amb) 7 / 25 2 36% 14 / 25 1 60%

2 (bab) 5 / 25 63 272% 15 / 25 2 68%

3 (spe) 2 / 25 68 280% 12 / 25 9 84%

4 (mus) 8 / 25 37 180% 9 / 25 135 576%

Overall 22 / 100 170 192% 50 / 100 147 197%


Alarms: Summary

• Sinusoid domain- feature components belong to 1 source- simple ‘segregation’ (grouping) model- alarm model as properties of group- robust to partial feature observation

• Future improvements- more complex alarm class models- exploit repetitive structure of alarms


Outline





Future Work- generative models & inference- model acquisition- ambulatory audio

1

2

3

4

5


Future work

• CASA as generative model parameterization:

5

Sourcemodels

Sourcesignals

Receivedsignals

Observations

ObservationsO

Modeldependence

Channelparameters

M1 Y1 X1

C1

M2 Y2 X2

C2

O

{Mi}{Ki}

p(X|Mi,Ki)

Θ

Generationmodel

Analysisstructure

Fragmentformation

Maskallocation

Likelihoodevaluation

Modelfitting


Learning source models

• The speech recognition lesson:Use the data as much as possible- what can we do with unlimited data feeds?

• Data sources- clean data corpora- identify near-clean segments in real sound- build up ‘clean’ views from partial observations?

• Model types- templates- parametric/constraint models- HMMs

• Hierarchic classificationvs. individual characterization...


Personal Audio Applications

• Smart PDA records everything

• Only useful if we have index, summaries- monitor for particular sounds- real-time description

• Scenarios

- personal listener → summary of your day- future prosthetic hearing device- autonomous robots

• Meeting data, ambulatory audio


Summary

• Sound- carries important information

• Mixtures- need to segregate different source properties- fragment-based recognition

• Learning- information extracted by classification- models guide segregation

• Alarm sounds- simple example of fragment recognition

• General sounds- recognize simultaneous components- acquire classes from training data- build index, summary of real-world sound

Sound, Mixtures, and Learning - Electrical Engineeringdpwe/talks/Ohio-2002-08+casa.pdf · search...

Documents

Transcript of Sound, Mixtures, and Learning - Electrical Engineeringdpwe/talks/Ohio-2002-08+casa.pdf · search...