Lecture 4: Auditory Perception - Columbia Universitydpwe/e6820/lectures/E... · E6820 SAPR - Dan...
Transcript of Lecture 4: Auditory Perception - Columbia Universitydpwe/e6820/lectures/E... · E6820 SAPR - Dan...
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 1
EE E6820: Speech & Audio Processing & Recognition
Lecture 4: Auditory Perception
Motivation: Why & how
Auditory physiology
Psychophysics: Detection & discrimination
Pitch perception
Speech perception
Auditory organization & Scene analysis
Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820/
Columbia University Dept. of Electrical EngineeringSpring 2006
1
2
3
4
5
6
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 2
Why study perception?
• Perception is messy: Can we avoid it?
No!
• Audition provides the ‘ground truth’ in audio
- what is relevant and irrelevant- subjective importance of distortion (coding etc.)- (there could be other information in sound...)
• Some sounds are ‘designed’ for audition
- co-evolution of speech and hearing
• The auditory system is very successful
- we would do extremely well to duplicate it
• We are now able to model complex systems
- faster computers, bigger memories
1
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 3
How to study perception?
Three different approaches:
• Analyze the example: physiology
- dissection & nerve recordings
• Black box input/output: psychophysics
- fit simple models of simple functions
• Information processing models
- investigate and model complex functions- e.g. scene analysis, speech perception
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 4
Outline
Motivation
Physiology
- Outer, middle & inner ear- The Auditory Nerve and beyond- Models
Psychophysics
Pitch perception
Speech perception
Scene analysis
1
2
3
4
5
6
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 5
Physiology
• Processing chain from air to brain:
• Study via:
- anatomy- nerve recordings
• Signals flow in both directions
2
Outerear
Middleear
Inner ear(cochlea)
Auditorynerve
Midbrain
Cortex
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 6
Outer & middle ear
• Pinna ‘horn’
- complex reflections give spatial (elevation) cues
• Ear canal
- acoustic tube
• Middle ear
- bones provide impedance matching
Pinna
Ear canal
Eardrum(tympanum)
Middle earbones
Cochlea(inner ear)
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 7
Inner ear: Cochlea
• Mechanical input from middle ear starts traveling wave moving down Basilar Membrane
• Varying stiffness and mass of BM gives results in continuous variation of resonant frequency
• At resonance, traveling wave energy is dissipated in BM vibration
→
Frequency (Fourier) analysis
Cochlea
Oval window(from ME bones)
Basilar Membrane(BM)
Travellingwave
Resonantfrequency
Position
16 kHz
50 Hz
0 35mm
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 8
Cochlea hair cells
• Ear converts sound to BM motion;Each point on BM corresponds to a frequency
• Hair cells on BM convert motion into nerve impulses (firings)
• Inner Hair Cells detect motion
• Outer Hair Cells? Variable damping?
[BM animation]
Cochlea
Basilarmembrane
Tectorialmembrane
Inner Hair Cell(IHC) Outer Hair Cell
(OHC)
Auditory nerve
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 9
Inner Hair Cells
• IHCs convert BM vibration into nerve firings
• Human hear has ~3500 IHCs;Each IHC has ~7 connections to Auditory Nerve
• Each nerve fires (sometimes) near peak displacement:
• Histogram to get firing probability:
Local BMdisplacement
Typical nervesignal (mV)
time / ms50
Firingcount
Cycleangle
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 10
Auditory nerve (AN) signals
• Single nerve measurements:
• Hard to measure: probe living ANs?
(log) frequency
100 Hz 1 kHz 10 kHz
20
40
60
80
dB SPL
Tone burst histogram Frequency thresholdSpikecount
Time
100
100 ms
Tone burst
Spi
kes/
sec
Intensity / dB SPL
300
200
100
2000
40 60 80 100
One fiber:~ 25 dB dynamic range
Hearing dynamic range > 100 dB
Rate vs.intensity
(approx.constant-Q)
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 11
AN population response
• All the information the brain has about sound:
- average rate & spike timings on 30,000 fibers
• Not unlike a (constant-Q) spectrogram?
time / ms
freq
/ 8v
e re
100
Hz
p ( )
0
1
2
3
4
5
0 10 20 30 40 50 60
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 12
Beyond the auditory nerve
• Ascending and descending
• Tonotopic
×
?
- modulation - position - source??
(from lloydwatts.com)
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 13
Periphery models
• Modeled aspects:
- outer/middle ear - cochlea filtering- hair cell transduction - efferent feedback?
• Result: ‘neurogram’ / ‘cochleagram’
Outer/middleear
filteringSound
Cochleafilterbank
IHC
IHC
time / s
chan
nel
SlaneyPatterson 12 chans/oct from 180 Hz, BBC1tmp (20010218)
0 0.1 0.2 0.3 0.4 0.5
10
20
30
40
50
60
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 14
Outline
Motivation
Physiology
Psychophysics
- Detection theory modeling- Intensity perception- Masking
Pitch perception
Speech perception
Scene analysis
1
2
3
4
5
6
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 15
Psychophysics
• Physiology looks at the implementation;Psychology looks at the function/behavior
• Analyze audition as signal detection:
- psychological tests reflect internal decisions- assume optimal decision process- infer nature of internal representations, noise, ...
→
lower bounds on more complex functions
• Different aspects to measure
- time, frequency, intensity- tones, complexes, noise- binaural- pitch, detuning
3
p ω O( )
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 16
Basic psychophysics
• Relate physical and perceptual variables
- e.g. intensity
→
loudnessfrequency
→
pitch
• Methodology: subject tests
- just noticeable difference (jnd)- magnitude scaling e.g. ‘adjust to twice as loud’
• Results for Intensity vs. Loudness:
Weber’s law
∆
I
α
I
→
log(
L
) =
k
·log(
I
)
-20 -10 0 101.4
1.6
1.8
2.0
2.2
2.4
2.6
Sound level / dB
Log(
loud
ness
rat
ing)
Hartmann(1993) Classroom loudness scaling data
Power law fit:
L α I 0.22
Textbook figure:
L α I 0.3
L( )2log 0.3 I( )2log=
0.3I10log210log
--------------⋅=
0.3210log
-------------- dB10-------⋅=
dB 10Ú=
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 17
Loudness as a function of frequency
• Fletcher-Munson equal-loudness curves:
• Hearing impairment: exaggerates
freq / HzIn
tens
ity /
dB S
PL
0
40
20
60
100
80
120
1000100 10,000
Intensity / dB
Equ
ival
ent l
oudn
ess
@ 1
kHz
400 0
40
80
80
100
60
20
20 60 20 600
100
60
20
0Intensity / dB
Equ
ival
ent l
oudn
ess
@ 1
kHz
40
40
80
80
rapidloudnessgrowth
100 Hz 1 kHz
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 18
Loudness as a function of bandwidth
• Same total energy, different distribution:
- e.g. 2 chans at -6 dB (not -10 dB)
• Critical bands: independent freq. channels- ~ 25 total (4-6 / octave) [sndex]
time
freq
freq
mag
freq
mag
Sametotal
energyI·B
... but wider perceivedas louder
I0I1
B0 B1
Bandwidth B‘Critical’bandwidth
Loud
ness
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 19
Simultaneous masking
• A louder tone can ‘mask’ the perception of a second tone nearby in frequency:
• Suggests an ‘internal noise’ model:
maskedthreshold
log freq
absolutethreshold
maskingtone
Inte
nsity
/ dB
decision variablex
internal noise
p(x | I)p(x | I)
p(x | I+∆I)
σn
I
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 20
Sequential masking
• Backward/forward in time:
- suggests temporal envelope of decision var.
→ Time-frequency masking ‘skirt’:
time
Inte
nsity
/ dB
masker envelope
masked threshold
simultaneous masking~10 dB
backward masking~5 ms
forward masking~100 ms
time
freq
inte
nsity
Masking tone
Masked threshold
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 21
What we do and don’t hear
• Timing: 2ms attack resolution, 20ms discrim- but: spectral splatter
• Tuning: ~ 1% discrimination- but: beats
• Spectrum: profile changes, formants- variable time-frequency resolution
• Harmonic phase?
• Noisy signals & texture
• (Trace vs. categorical memory)
A B XX = A or B?
“two-intervalforced-choice”:
time
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 22
Outline
Motivation
Physiology
Psychophysics
Pitch perception- ‘Place’ models- ‘Time’ models- Multiple cues & competition
Speech perception
Scene analysis
1
2
3
4
5
6
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 23
Pitch perception:A classic argument in psychophysics
• Harmonic complexes are a pattern on AN
- .. but give a fused percept (ecological)
• What determines the pitch percept?- not the fundamental
• How is it computed?Two competing models: place and time
4
10
20
30
40
50
60
70
0 0.05 0.1 time/s
freq
. cha
n.
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 24
Place model of pitch
• AN excitation pattern shows individual peaks
• ‘Pattern matching’ method to find pitch:
• Support: Low harmonics are very important
• But: Flat-spectrum noise can carry pitch
frequency channel
frequency channel
AN
exc
itatio
n
Pitch strength
resolvedharmonics
broader HF channelscannot resolve
harmonics
Correlate with harmonic ‘sieve’:
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 25
Time model of pitch
• Timing information is preserved in AN down to ~ 1ms scale
• Extract periodicity by e.g. autocorrelation& combine across frequency chans:
• But: HF gives weak pitch (in practice)
lag / ms
time
freq per-channel
autocorrelation
autocorrelation
Summaryautocorrelation
0 10 20 30common period
(pitch)
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 26
Alternate & competing cues
• Pitch perception could rely on various cues- average excitation pattern- summary autocorrelation- more complex pattern matching
• Relying on just one cue is brittle- e.g. missing fundamental
→ Perceptual system appears to use a flexible, opportunistic combination
• Optimal detector justification?
if o1 and o2 are conditionally independent
p ω o( )ω
argmax
p o ω( ) p ω( )⋅p o( )
-----------------------------------ω
argmax =
p o1 ω( ) p o2 ω( ) p ω( )⋅ ⋅ω
argmax =
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 27
Outline
Motivation
Physiology
Psychophysics
Pitch perception
Speech perception- The sounds of speech- Phoneme perception- Context and top-down influences
Scene Analysis
1
2
3
4
5
6
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 28
Speech perception
• Highly specialized function- subsequent to source organization?- .. but also can interact
• Kinds of speech sounds:
- vowels - glides - nasals- stops - fricatives ...
5
20
30
40
50
60
1.4 1.6 1.8 2 2.2 2.4 2.6 time/slevel/dB
freq
/ H
z
0
1000
2000
3000
4000
watch thin as a dimeahas
stop burstfricativevowelnasalglide
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 29
Cues to phoneme perception
• Linguists describe speech with phonemes:
- phonemes define minimal word contrasts
• Acoustic-phoneticians describe phonemes by:
watch thin as a dimeahas
mdnctcl
^
θ zwzh e
III ayε
• formants & transitions
• bursts & onset times
time
freq
vowelformants
transition stop burst
voicing onset time
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 30
Categorical perception
• (Some) speech sounds perceived categorically rather than analogically- e.g. stop-burst & timing:
- tokens within category are hard to distinguish- category boundaries are very sharp
• Categories are learned for native tongue- “merry” / “mary” / “marry”
T
P K P
Pi e a c o uε
following vowel
burs
t fre
q f b
/ H
z
1000
2000
3000
4000
time
freq stop burst vowel
formants
fb
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 31
Where is the information in speech?
• ‘Articulation’ of high/low-pass filtered speech:
- sums to more than 1...
• Speech message is highly redundant- e.g. constraints of language, context→listeners can understand with very few cues
Art
icul
atio
n / %
1000
20
40
60
80
2000 3000 4000freq / Hz
high-pass low-pass
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 32
Top-down influences:Phonemic restoration
(Warren 1970)
• What if a noise burst obscures speech?
- auditory system ‘restores’ the missing phoneme... based on semantic context... even in retrospect!
• Subjects are typically unaware of which sounds are restored
1.4 1.6 1.8 2 2.2 2.4 2.6 time / s
freq
/ H
z
0
1000
2000
3000
4000
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 33
A predisposition for speech:Sinewave replicas
(Remez et al. 1994)
• Replace each formant with a single sinusoid:
- speech is (somewhat) intelligible- people hear both whistles and speech (“duplex”)- processed as speech despite un-speech-like
• What does it take to be speech?
010002000300040005000
0.5 1 1.5 2 2.5 30
10002000300040005000
time / s
freq
/ H
zfr
eq /
Hz
Speech
Sines
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 34
Simultaneous vowels
• Mix synthetic vowels with different f0s:
• Pitch difference helps (though not necessary):
freq+ =
dB
/iy/ @ 100 Hz
/ah/ @ 125 Hz
% b
oth
vow
els
corr
ect
∆f0 (semitones)
0 11/4 1/2 2 4
25
50
75
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 35
Computational models of speech perception
• Various theoretical - practical models of speech comprehension, e.g. :
• Open questions:- mechanism of phoneme classification- mechanism of lexical recall- mechanism of grammar constraints
• ASR is a practical implementation (?)
Phonemerecognition
Lexicalaccess
Grammarconstraints
Speech
Words
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 36
Outline
Motivation
Physiology
Psychophysics
Pitch perception
Speech perception
Scene analysis- Events and sources- Fusion and streaming- Continuity & restoration- Simultaneous vowels
1
2
3
4
5
5
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 37
Auditory Organization
• Detection model is huge simplification
• Real role of hearing is much more general:Recover useful information from outside world
→ Sound organization into events and sources:
• Research questions:- what determines perception of sources?- how do humans separate mixtures?- how much can we tell about a source?
6
0 2 4 time/s
frq/Hz
0
2000
4000 Voice
Stab
Rumble
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 38
Auditory scene analysis:Simultaneous fusion
• Harmonics are distinct on AN, but perceived as one sound (“fused”):
- depends on common onset- depends on harmonicity (common period)
• Methodologies:- ask subject how many ‘objects’- match attributes e.g. object pitch- manipulate higher level e.g. vowel identity
time
freq
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 39
Sequential grouping: streaming
• Pattern / rhythm: property of set of objects
- subsequent to fusion employs fused events?
• Measure by relative timing judgments- cannot compare between streams
• Separate ‘coherence’ and ‘fusion’ boundaries
• Can interact and compete with fusion
[sndex]
∴
–2 octaves
TRT: 60-150 ms
time
freq
uenc
y∆f:
1 kHz
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 40
Continuity & restoration
• Tone is interrupted by noise burst:What happened?
- masking makes tone undetectable during noise
• Need to infer most probable real-world events- observation equally likely for either explanation- prior on continuous tone much higher → choose
• Top-down influence on perceived events...
pulsation threshold [sndex]
time
freq
+
+ +
?
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 41
Models of auditory organization
• Psychological accounts suggest bottom-up:
- (Brown 1991)
• Complications in practice:- formation of separate elements- contradictory cues- influence of top-down constraints (context,
expectations) ...
inputmixture
signalfeatures
(maps)
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
onset
period
frq.mod
time
freq
E6820 SAPR - Dan Ellis L04 - Perception 2006-02-09 - 42
Summary
• Auditory perception provides the ‘ground truth’ underlying audio processing
• Physiology specifies information available
• Psychophysics measures basic sensitivities
• Sound sources requires further organization
• Strong contextual effects in speech perception
Parting thought:
Is pitch central to communication? Why?
Transduce Sceneanalysis
Multiplerepresent'ns
High-levelrecognitionSound