Lecture 4: Auditory Perception - Columbia Universitydpwe/e6820/lectures/E... · E6820 SAPR - Dan...

E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 1

EE E6820: Speech & Audio Processing & Recognition

Lecture 4: Auditory Perception

Motivation: Why & how

Auditory physiology

Psychophysics: Detection & discrimination

Pitch perception

Speech perception

Auditory organization & Scene analysis

Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820/

Columbia University Dept. of Electrical EngineeringSpring 2006

1

2

3

4

5

6


Why study perception?

• Perception is messy: Can we avoid it?

No!

• Audition provides the ‘ground truth’ in audio

- what is relevant and irrelevant- subjective importance of distortion (coding etc.)- (there could be other information in sound...)

• Some sounds are ‘designed’ for audition

- co-evolution of speech and hearing

• The auditory system is very successful

- we would do extremely well to duplicate it

• We are now able to model complex systems

- faster computers, bigger memories

1


How to study perception?

Three different approaches:

• Analyze the example: physiology

- dissection & nerve recordings

• Black box input/output: psychophysics

- fit simple models of simple functions

• Information processing models

- investigate and model complex functions- e.g. scene analysis, speech perception


Outline

Motivation

Physiology

- Outer, middle & inner ear- The Auditory Nerve and beyond- Models

Psychophysics

Pitch perception

Speech perception

Scene analysis

1

2

3

4

5

6


Physiology

• Processing chain from air to brain:

• Study via:

- anatomy- nerve recordings

• Signals flow in both directions

2

Outerear

Middleear

Inner ear(cochlea)

Auditorynerve

Midbrain

Cortex


Outer & middle ear

• Pinna ‘horn’

- complex reflections give spatial (elevation) cues

• Ear canal

- acoustic tube

• Middle ear

- bones provide impedance matching

Pinna

Ear canal

Eardrum(tympanum)

Middle earbones

Cochlea(inner ear)


Inner ear: Cochlea

• Mechanical input from middle ear starts traveling wave moving down Basilar Membrane

• Varying stiffness and mass of BM gives results in continuous variation of resonant frequency

• At resonance, traveling wave energy is dissipated in BM vibration

→

Frequency (Fourier) analysis

Cochlea

Oval window(from ME bones)

Basilar Membrane(BM)

Travellingwave

Resonantfrequency

Position

16 kHz

50 Hz

0 35mm


Cochlea hair cells

• Ear converts sound to BM motion;Each point on BM corresponds to a frequency

• Hair cells on BM convert motion into nerve impulses (firings)

• Inner Hair Cells detect motion

• Outer Hair Cells? Variable damping?

[BM animation]

Cochlea

Basilarmembrane

Tectorialmembrane

Inner Hair Cell(IHC) Outer Hair Cell

(OHC)

Auditory nerve


Inner Hair Cells

• IHCs convert BM vibration into nerve firings

• Human hear has ~3500 IHCs;Each IHC has ~7 connections to Auditory Nerve

• Each nerve fires (sometimes) near peak displacement:

• Histogram to get firing probability:

Local BMdisplacement

Typical nervesignal (mV)

time / ms50

Firingcount

Cycleangle


Auditory nerve (AN) signals

• Single nerve measurements:

• Hard to measure: probe living ANs?

(log) frequency

100 Hz 1 kHz 10 kHz

20

40

60

80

dB SPL

Tone burst histogram Frequency thresholdSpikecount

Time

100

100 ms

Tone burst

Spi

kes/

sec

Intensity / dB SPL

300

200

100

2000

40 60 80 100

One fiber:~ 25 dB dynamic range

Hearing dynamic range > 100 dB

Rate vs.intensity

(approx.constant-Q)


AN population response

• All the information the brain has about sound:

- average rate & spike timings on 30,000 fibers

• Not unlike a (constant-Q) spectrogram?

time / ms

freq

/ 8v

e re

100

Hz

p ( )

0

1

2

3

4

5

0 10 20 30 40 50 60


Beyond the auditory nerve

• Ascending and descending

• Tonotopic

×

?

- modulation - position - source??

(from lloydwatts.com)


Periphery models

• Modeled aspects:

- outer/middle ear - cochlea filtering- hair cell transduction - efferent feedback?

• Result: ‘neurogram’ / ‘cochleagram’

Outer/middleear

filteringSound

Cochleafilterbank

IHC

IHC

time / s

chan

nel

SlaneyPatterson 12 chans/oct from 180 Hz, BBC1tmp (20010218)

0 0.1 0.2 0.3 0.4 0.5

10

20

30

40

50

60


Outline

Motivation

Physiology

Psychophysics

- Detection theory modeling- Intensity perception- Masking

Pitch perception

Speech perception

Scene analysis

1

2

3

4

5

6


Psychophysics

• Physiology looks at the implementation;Psychology looks at the function/behavior

• Analyze audition as signal detection:

- psychological tests reflect internal decisions- assume optimal decision process- infer nature of internal representations, noise, ...

→

lower bounds on more complex functions

• Different aspects to measure

- time, frequency, intensity- tones, complexes, noise- binaural- pitch, detuning

3

p ω O( )


Basic psychophysics

• Relate physical and perceptual variables

- e.g. intensity

→

loudnessfrequency

→

pitch

• Methodology: subject tests

- just noticeable difference (jnd)- magnitude scaling e.g. ‘adjust to twice as loud’

• Results for Intensity vs. Loudness:

Weber’s law

∆

I

α

I

→

log(

L

) =

k

·log(

I

)

-20 -10 0 101.4

1.6

1.8

2.0

2.2

2.4

2.6

Sound level / dB

Log(

loud

ness

rat

ing)

Hartmann(1993) Classroom loudness scaling data

Power law fit:

L α I 0.22

Textbook figure:

L α I 0.3

L( )2log 0.3 I( )2log=

0.3I10log210log

--------------⋅=

0.3210log

-------------- dB10-------⋅=

dB 10Ú=


Loudness as a function of frequency

• Fletcher-Munson equal-loudness curves:

• Hearing impairment: exaggerates

freq / HzIn

tens

ity /

dB S

PL

0

40

20

60

100

80

120

1000100 10,000

Intensity / dB

Equ

ival

ent l

oudn

ess

@ 1

kHz

400 0

40

80

80

100

60

20

20 60 20 600

100

60

20

0Intensity / dB

Equ

ival

ent l

oudn

ess

@ 1

kHz

40

40

80

80

rapidloudnessgrowth

100 Hz 1 kHz


Loudness as a function of bandwidth

• Same total energy, different distribution:

- e.g. 2 chans at -6 dB (not -10 dB)

• Critical bands: independent freq. channels- ~ 25 total (4-6 / octave) [sndex]

time

freq

freq

mag

freq

mag

Sametotal

energyI·B

... but wider perceivedas louder

I0I1

B0 B1

Bandwidth B‘Critical’bandwidth

Loud

ness


Simultaneous masking

• A louder tone can ‘mask’ the perception of a second tone nearby in frequency:

• Suggests an ‘internal noise’ model:

maskedthreshold

log freq

absolutethreshold

maskingtone

Inte

nsity

/ dB

decision variablex

internal noise

p(x | I)p(x | I)

p(x | I+∆I)

σn

I


Sequential masking

• Backward/forward in time:

- suggests temporal envelope of decision var.

→ Time-frequency masking ‘skirt’:

time

Inte

nsity

/ dB

masker envelope

masked threshold

simultaneous masking~10 dB

backward masking~5 ms

forward masking~100 ms

time

freq

inte

nsity

Masking tone

Masked threshold


What we do and don’t hear

• Timing: 2ms attack resolution, 20ms discrim- but: spectral splatter

• Tuning: ~ 1% discrimination- but: beats

• Spectrum: profile changes, formants- variable time-frequency resolution

• Harmonic phase?

• Noisy signals & texture

• (Trace vs. categorical memory)

A B XX = A or B?

“two-intervalforced-choice”:

time


Outline

Motivation

Physiology

Psychophysics

Pitch perception- ‘Place’ models- ‘Time’ models- Multiple cues & competition

Speech perception

Scene analysis

1

2

3

4

5

6


Pitch perception:A classic argument in psychophysics

• Harmonic complexes are a pattern on AN

- .. but give a fused percept (ecological)

• What determines the pitch percept?- not the fundamental

• How is it computed?Two competing models: place and time

4

10

20

30

40

50

60

70

0 0.05 0.1 time/s

freq

. cha

n.


Place model of pitch

• AN excitation pattern shows individual peaks

• ‘Pattern matching’ method to find pitch:

• Support: Low harmonics are very important

• But: Flat-spectrum noise can carry pitch

frequency channel

frequency channel

AN

exc

itatio

n

Pitch strength

resolvedharmonics

broader HF channelscannot resolve

harmonics

Correlate with harmonic ‘sieve’:


Time model of pitch

• Timing information is preserved in AN down to ~ 1ms scale

• Extract periodicity by e.g. autocorrelation& combine across frequency chans:

• But: HF gives weak pitch (in practice)

lag / ms

time

freq per-channel

autocorrelation

autocorrelation

Summaryautocorrelation

0 10 20 30common period

(pitch)


Alternate & competing cues

• Pitch perception could rely on various cues- average excitation pattern- summary autocorrelation- more complex pattern matching

• Relying on just one cue is brittle- e.g. missing fundamental

→ Perceptual system appears to use a flexible, opportunistic combination

• Optimal detector justification?

if o1 and o2 are conditionally independent

p ω o( )ω

argmax

p o ω( ) p ω( )⋅p o( )

-----------------------------------ω

argmax =

p o1 ω( ) p o2 ω( ) p ω( )⋅ ⋅ω

argmax =


Outline

Motivation

Physiology

Psychophysics

Pitch perception

Speech perception- The sounds of speech- Phoneme perception- Context and top-down influences

Scene Analysis

1

2

3

4

5

6


Speech perception

• Highly specialized function- subsequent to source organization?- .. but also can interact

• Kinds of speech sounds:

- vowels - glides - nasals- stops - fricatives ...

5

20

30

40

50

60

1.4 1.6 1.8 2 2.2 2.4 2.6 time/slevel/dB

freq

/ H

z

0

1000

2000

3000

4000

watch thin as a dimeahas

stop burstfricativevowelnasalglide


Cues to phoneme perception

• Linguists describe speech with phonemes:

- phonemes define minimal word contrasts

• Acoustic-phoneticians describe phonemes by:

watch thin as a dimeahas

mdnctcl

^

θ zwzh e

III ayε

• formants & transitions

• bursts & onset times

time

freq

vowelformants

transition stop burst

voicing onset time


Categorical perception

• (Some) speech sounds perceived categorically rather than analogically- e.g. stop-burst & timing:

- tokens within category are hard to distinguish- category boundaries are very sharp

• Categories are learned for native tongue- “merry” / “mary” / “marry”

T

P K P

Pi e a c o uε

following vowel

burs

t fre

q f b

/ H

z

1000

2000

3000

4000

time

freq stop burst vowel

formants

fb


Where is the information in speech?

• ‘Articulation’ of high/low-pass filtered speech:

- sums to more than 1...

• Speech message is highly redundant- e.g. constraints of language, context→listeners can understand with very few cues

Art

icul

atio

n / %

1000

20

40

60

80

2000 3000 4000freq / Hz

high-pass low-pass


Top-down influences:Phonemic restoration

(Warren 1970)

• What if a noise burst obscures speech?

- auditory system ‘restores’ the missing phoneme... based on semantic context... even in retrospect!

• Subjects are typically unaware of which sounds are restored

1.4 1.6 1.8 2 2.2 2.4 2.6 time / s

freq

/ H

z

0

1000

2000

3000

4000


A predisposition for speech:Sinewave replicas

(Remez et al. 1994)

• Replace each formant with a single sinusoid:

- speech is (somewhat) intelligible- people hear both whistles and speech (“duplex”)- processed as speech despite un-speech-like

• What does it take to be speech?

010002000300040005000

0.5 1 1.5 2 2.5 30

10002000300040005000

time / s

freq

/ H

zfr

eq /

Hz

Speech

Sines


Simultaneous vowels

• Mix synthetic vowels with different f0s:

• Pitch difference helps (though not necessary):

freq+ =

dB

/iy/ @ 100 Hz

/ah/ @ 125 Hz

% b

oth

vow

els

corr

ect

∆f0 (semitones)

0 11/4 1/2 2 4

25

50

75


Computational models of speech perception

• Various theoretical - practical models of speech comprehension, e.g. :

• Open questions:- mechanism of phoneme classification- mechanism of lexical recall- mechanism of grammar constraints

• ASR is a practical implementation (?)

Phonemerecognition

Lexicalaccess

Grammarconstraints

Speech

Words


Outline

Motivation

Physiology

Psychophysics

Pitch perception

Speech perception

Scene analysis- Events and sources- Fusion and streaming- Continuity & restoration- Simultaneous vowels

1

2

3

4

5

5


Auditory Organization

• Detection model is huge simplification

• Real role of hearing is much more general:Recover useful information from outside world

→ Sound organization into events and sources:

• Research questions:- what determines perception of sources?- how do humans separate mixtures?- how much can we tell about a source?

6

0 2 4 time/s

frq/Hz

0

2000

4000 Voice

Stab

Rumble


Auditory scene analysis:Simultaneous fusion

• Harmonics are distinct on AN, but perceived as one sound (“fused”):

- depends on common onset- depends on harmonicity (common period)

• Methodologies:- ask subject how many ‘objects’- match attributes e.g. object pitch- manipulate higher level e.g. vowel identity

time

freq


Sequential grouping: streaming

• Pattern / rhythm: property of set of objects

- subsequent to fusion employs fused events?

• Measure by relative timing judgments- cannot compare between streams

• Separate ‘coherence’ and ‘fusion’ boundaries

• Can interact and compete with fusion

[sndex]

∴

–2 octaves

TRT: 60-150 ms

time

freq

uenc

y∆f:

1 kHz


Continuity & restoration

• Tone is interrupted by noise burst:What happened?

- masking makes tone undetectable during noise

• Need to infer most probable real-world events- observation equally likely for either explanation- prior on continuous tone much higher → choose

• Top-down influence on perceived events...

pulsation threshold [sndex]

time

freq

+

+ +

?


Models of auditory organization

• Psychological accounts suggest bottom-up:

- (Brown 1991)

• Complications in practice:- formation of separate elements- contradictory cues- influence of top-down constraints (context,

expectations) ...

inputmixture

signalfeatures

(maps)

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

onset

period

frq.mod

time

freq


Summary

• Auditory perception provides the ‘ground truth’ underlying audio processing

• Physiology specifies information available

• Psychophysics measures basic sensitivities

• Sound sources requires further organization

• Strong contextual effects in speech perception

Parting thought:

Is pitch central to communication? Why?

Transduce Sceneanalysis

Multiplerepresent'ns

High-levelrecognitionSound

Lecture 4: Auditory Perception - Columbia Universitydpwe/e6820/lectures/E... · E6820 SAPR - Dan...

Documents

Transcript of Lecture 4: Auditory Perception - Columbia Universitydpwe/e6820/lectures/E... · E6820 SAPR - Dan...