Speech Segregation Based on Oscillatory Correlation

27
Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University

description

Speech Segregation Based on Oscillatory Correlation. DeLiang Wang The Ohio State University. Outline of Presentation. Introduction Auditory Scene Analysis (ASA) Problem Binding Problem Oscillatory Correlation Theory LEGION network Multistage Model for Computational ASA (CASA) - PowerPoint PPT Presentation

Transcript of Speech Segregation Based on Oscillatory Correlation

Page 1: Speech Segregation Based on Oscillatory Correlation

Speech Segregation Based on Oscillatory Correlation

DeLiang Wang

The Ohio State University

Page 2: Speech Segregation Based on Oscillatory Correlation

Outline of Presentation

Introduction Auditory Scene Analysis (ASA) Problem Binding Problem

Oscillatory Correlation Theory LEGION network

Multistage Model for Computational ASA (CASA)

Recent Results Discussion and Summary

Page 3: Speech Segregation Based on Oscillatory Correlation

ASA Problem (Bregman’90)

Listeners are able to parse the complex mixture of sounds arriving at the ears in order to retrieve a mental representation of each sound source

ASA takes place in two conceptual stages: Segmentation. Decompose the acoustic signal into

‘sensory elements’ (segments) Grouping. Combine segments into groups, such that

segments in the same group are likely to have arisen from the same environmental source

Page 4: Speech Segregation Based on Oscillatory Correlation

ASA Problem - continued

The grouping process involves two mechanisms: Primitive grouping. Innate data-driven mechanisms,

consistent with those described by the Gestalt psychologists for visual perception (proximity, similarity, common fate, good continuation etc.)

Schema-driven grouping. Application of learned knowledge about speech, music and other environmental sounds

Page 5: Speech Segregation Based on Oscillatory Correlation

Binding Problem

Information about acoustic features (pitch, spectral shape, interaural differences, AM, FM) is extracted in distributed areas of the auditory system

How are these features combined to form a whole?

Hierarchies of feature-detecting cells exist, but do not constitute a solution to the binding problem; no evidence for ‘grandmother cells’

Page 6: Speech Segregation Based on Oscillatory Correlation

Oscillatory Correlation (von der Malsburg & Schneider’86; Wang’96)

Neural oscillators used to represent auditory features

Oscillators representing features of the same source are synchronized (phase-locked with zero phase lag), and are desynchronized from oscillators representing different sources

Supported by experimental findings, e.g. oscillations in auditory cortex measured by EEG, MEG and local field potentials

Page 7: Speech Segregation Based on Oscillatory Correlation

Oscillatory Correlation Theory

FD: Feature

Detector

FD1FDnFD2FDaFDbFDxTimeTime

Page 8: Speech Segregation Based on Oscillatory Correlation

LEGION Architecture for Stream Segregation

LEGION: Locally Excitatory Globally Inhibitory Oscillator Network (Terman & Wang’95)

....................................TimeFrequency InputChannelGlobal InhibitorInputEndLHRightLeftDelay Lines

Page 9: Speech Segregation Based on Oscillatory Correlation

Single Relaxation Oscillator

x

y

y0=

x0=

x

RBLB

y

x0=y0=

10 2-1-2

1

2

Stable fixed point

-2

-1

0

1

2

time

With stimulus

Without stimulus

Typical x trace (membrane potential)

Page 10: Speech Segregation Based on Oscillatory Correlation

LEGION on a Chip

The chip area is 6.7mm2 (Core 3mm2) and implements a 16x16 LEGION network (By Jordi Cosp, Polytechnic University of Catalonia, SPAIN)

Page 11: Speech Segregation Based on Oscillatory Correlation

Computational Auditory Scene Analysis

The ASA problem and the binding problem are closely related; the oscillatory correlation framework can address both issues

Previous work also suggests that: Representation of the auditory scene is a key issue Temporal continuity is important (although it is ignored

in most frame-based sound processing algorithms) Fundamental frequency (F0) is a strong cue for

grouping

Page 12: Speech Segregation Based on Oscillatory Correlation

NeuralOscillatorNetwork

Correlogram

Cross-channelCorrelation

ResynthesisHairCells

CochlearFiltering

Speechand Noise

ResynthesizedSpeech

ResynthesizedNoise

Correlogram (detail)

TimeLag

Neural Oscillator Network (detail)

GroupingLayer

SegmentationLayer

GlobalInhibitor

TimeFrequency

A Multi-stage Model for CASA

Page 13: Speech Segregation Based on Oscillatory Correlation

Auditory Periphery Model

A bank of gammatone filters

n: filter order (fourth-order is used) b: bandwidth H: Heaviside function

Meddis hair cell model converts gammatone output to neural firing

gi(t) =tn−1 (exp−2πbit) (cos2πfit+φi)H(t)

Page 14: Speech Segregation Based on Oscillatory Correlation

Fourth-order Gammatone Filters - Example

Impulse responses of gammatone filters

Page 15: Speech Segregation Based on Oscillatory Correlation

Auditory Periphery - Example

Hair cell response to utterance: “Why were you all weary?” mixed with phone ringing

128 filter channels arranged in ERB

Time (seconds)0.0 1.5

5000

2741

1457

729

315

80

Page 16: Speech Segregation Based on Oscillatory Correlation

Mid-level representations form the basis for segment formation and subsequent grouping processes

Correlogram extracts periodicity information from simulated auditory nerve firing patterns

Summary correlogram can be used to identify F0

Cross-correlation between adjacent correlogram channels identifies regions that are excited by the same frequency component

Mid-level Auditory Representations

Page 17: Speech Segregation Based on Oscillatory Correlation

Mid-level Representations - Example

Correlogram and cross-correlation for the speech/telephone mixture

Page 18: Speech Segregation Based on Oscillatory Correlation

Oscillator Network: Segmentation Layer

An oscillator consists of reciprocally connected excitatory variable xij and inhibitory variable yij (Terman & Wang’95):

Stable limit cycle occurs for Iij > 0

Each oscillator is connected to four nearest neighbors

˙ x ij =3xij −xij3 +2 −yij + Iij + Sij + ρ

˙ yij =ε(γ(1+ (tanhxij / β))−yij )

Page 19: Speech Segregation Based on Oscillatory Correlation

Segmentation Layer - continued

Horizontal weights are unity, vertical weights are unity if correlation exceeds threshold, otherwise 0

Oscillators receive input if energy in corresponding channel exceeds a threshold

All oscillators are connected to a global inhibitor, which ensures that different segments are desynchronized from one another

A LEGION network

Page 20: Speech Segregation Based on Oscillatory Correlation

Segmentation Layer - Example

Time (seconds)0.0 1.5

5000

2741

1457

729

315

80

Output of the segmentation layer to the speech/telephone mixture

Page 21: Speech Segregation Based on Oscillatory Correlation

Oscillator Network: Grouping Layer

The second layer is a two-dimensional oscillator network without global inhibition, which embodies the grouping stage of ASA

Oscillators in the second layer only receive input if the corresponding oscillator in the first layer is stimulated

At each time frame, a F0 estimate from the summary correlogram is used to classify channels into two categories; those that are consistent with the F0, and those that are not

Page 22: Speech Segregation Based on Oscillatory Correlation

Grouping Layer - continued

Enforce a rule that all channels of the same time frame within each segment must have the same F0 category as the majority of channels

Result of the speech

telephone example

Time (seconds)0.0 1.5

5000

2741

1457

729

315

80

Page 23: Speech Segregation Based on Oscillatory Correlation

Grouping Layer - continued

Grouping is limited to the time window of the longest segment

There are horizontal connections between oscillators in the same segment

Vertical connections are formed between pairs of channels within each time frame; mutual excitation if the channels belong to the same F0 category, otherwise mutual inhibition

Page 24: Speech Segregation Based on Oscillatory Correlation

Grouping Layer - Example

Two streams emerge from the group layer Foreground: left (original mixture ) Background: right

Time (seconds)0.0 1.5

5000

2741

1457

729

315

80

Time (seconds)0.0 1.5

5000

2741

1457

729

315

80

Page 25: Speech Segregation Based on Oscillatory Correlation

Evaluated on a corpus of 100 mixtures (Cooke’93): 10 voiced utterances x 10 noise intrusions Noise intrusions have a large variety

Resynthesis pathway allows estimation of SNR after segregation; improvement in SNR after processing for each noise condition

Evaluation

Page 26: Speech Segregation Based on Oscillatory Correlation

Changes in SNR

Results of Evaluation

Speech energy retained

Page 27: Speech Segregation Based on Oscillatory Correlation

Summary

An oscillatory correlation framework has been proposed for ASA

Neurobiologically plausible Engineering applications - robust automatic

speech recognition in noisy environments, hearing prostheses, and speech communication

Key issue is integration of various grouping cues