Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven...

Post on 05-Jan-2016

217 views 1 download

Transcript of Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven...

Speech Intelligibility

Derived fromAsynchronous Processing

of

Auditory-Visual Information

Steven GreenbergInternational Computer Science Institute

1947 Center Street, Berkeley, CA 94704, USAhttp://www.icsi.berkeley.edu/~steveng

steveng@icsi.berkeley.edu

Ken W. GrantArmy Audiology and Speech CenterWalter Reed Army Medical Center

Washington, D.C. 20307, USAhttp://www.wramc.amedd.army.mil/departments/aasc/avlab

grant@tidalwave.net

Acknowledgements and Thanks

Technical Assistance Takayuki Arai, Rosaria Silipo

Research FundingU.S. National Science Foundation

BACKGROUND

Superior recognition and intelligibility under many conditions

Provides phonetic-segment information that is potentially redundant with acoustic information

Vowels

Provides segmental information that complements acoustic informationConsonants

Directs auditory analyses to the target signalWho, where, when, what (spectral)

What’s the Big Deal with Speech Reading?

+

0

10

20

30

40

50

60

70

80

90

100

-15 -10 -5 0 5 10 15 20Speech-to-Noise Ratio (dB)

NH - Auditory Consonants

HI-Auditory Sentences

ASR Sentences

NH - Audiovisual Consonants

HI Auditory Consonants

HI - Audiovisual Consonants

Per

cen

t C

orr

ect

Rec

og

nit

ion

Audio-Visual vs. Audio-Only Recognition

NH = Normal HearingHI = Hearing Impaired

The visual modality provides a significant gain in speech processing

Particularly under low signal-to-noise-ratio conditions

And for hearing-impaired listeners

Figure courtesy of Ken Grant

VoicingMannerPlaceOther

Percent Information Transmitted relative to Total Information Received

Articulatory Information via by Visual Cues

Figure courtesy of Ken Grant

0% 3%4%

Place of articulation93%

Place of Articulation Most Important

Key issues pertaining to:

Early versus late integration models of bi-modal information

Most contemporary models favor late integration of information

However ….Preliminary evidence (Sams et al., 1991) that silent speechreading can

activate auditory cortex (in humans) (but Bernstein et al. 2002 say “nay”)

Superior colliculus (an upper brainstem nucleus) may also serve as a site of bimodal integration (or at least interaction; Stein and colleagues)

Are Auditory & Visual Processing Independent?

What are the temporal factors underlying integration of audio-visual information for speech processing?

Two sets of data are examined:Spectro-temporal integration – audio-only signalsAudio-visual integration using sparse spectral cues and speechreading

In each experiment the cues (acoustic and/or visual) are desynchronized and the impact on word intelligibility measured (for English sentences)

Time Constraints Underlying A/V Integration

EXPERIMENTOVERVIEW

Time course of integration

Within (the acoustic) modality – Four narrow spectral slitsCentral slits desynchronized relative to the lateral slits

Across modalities –Two acoustic slits (the lateral channels)Speechreading video informationDesynchronize the video and audio streams relative to each

other

Spectro-temporal Integration

Auditory-Visual Asynchrony - ParadigmVideo of spoken (Harvard/IEEE) sentences, presented in tandem with a

sparse spectral representation (low- and high-frequency slits) of the same material

+

+

Video Leads Audio Leads

40 – 400 ms 40 – 400 ms

Baseline ConditionSYNCHRONOUS A/V

Auditory-Visual Integration - Preview When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to

that observed for audio-alone signals

When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as long as 200 ms

Why? Why? Why?

We’ll return to these data shortly

But first, let’s take a look at audio-alone speech intelligibility data in order to gain some perspective on the audio-visual case

The audio-alone data come from earlier studies by Greenberg and colleagues using TIMIT sentences

9 Subjects

AUDIO-ALONE EXPERIMENTS

The edge of each slit was separated from its nearest neighbor by an octave

Can listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum? – YES (cf. next slide)

What is the intelligibility of each slit alone and in combination with others?

Audio (Alone) Spectral Slit Paradigm

+

+

89% 60% 13%

2% 9% 9% 4%

Word Intelligibility - Single and Multiple Slits

1

2

3

4

Slit

Nu

mb

er

1

2

3

4

Slit

Nu

mb

er

334

841

2120

5340

CF

(H

z)

334

841

2120

5340

CF

(H

z)

Word Intelligibility - Single SlitsThe intelligibility associated with any single slit is only 2 to 9%The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits

Word Intelligibility - 4 Slits

Word Intelligibility - 2 Slits

Word Intelligibility - 2 Slits

Slit Asynchrony Affects IntelligibilityDesynchronizing the slits by more than 25 ms results in a significant decline in intelligibility

The affect of asynchrony on intelligibility is relatively symmetrical

These data are from a different set of subjects than those participating in the study described earlier - hence slightly different numbers for the baseline conditions

Intelligibility and Slit AsynchronyDesynchronizing the two central slits relative to the lateral ones has a pronounced effect on intelligibility

Asynchrony greater than 50 ms results in intelligibility lower than baseline

AUDIO-VISUAL EXPERIMENTS

Focus on Audio-Leading-Video ConditionsWhen the AUDIO signal LEADS the VIDEO, there is a progressive decline in

intelligibility, similar to that observed for audio-alone signals

These data are next compared with data from the previous slide to illustrated the similarity in the slope of the function

Comparison of A/V and Audio-Alone Data The decline in intelligibility for the audio-alone condition is similar to that of

the audio-leading-video condition

Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar

The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves

When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms

These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio

We’ll consider a variety of interpretations in a few minutes

Focus on Video-Leading-Audio Conditions

The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions

WHY? WHY? WHY?

There are several interpretations of these data – we’ll consider several on the following slide

Auditory-Visual Integration - the Full Monty

INTERPRETATION OF THE DATA

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Possible Interpretations of the Data

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

The speed of light is ca. 186,300 miles per second (effectively instantaneous)

Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

The speed of light is ca. 186,300 miles per second (effectively instantaneous)

The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)

Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

The speed of light is ca. 186,300 miles per second (effectively instantaneous)

The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)

Subjects in this study were wearing headphones

Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

The speed of light is ca. 186,300 miles per second (effectively instantaneous)

The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)

Subjects in this study were wearing headphones

Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds)

Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

The speed of light is ca. 186,300 miles per second (effectively instantaneous)

The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)

Subjects in this study were wearing headphones

Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds)

(Let’s put this potential interpretation aside for a few moments)

Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Possible Interpretations of the Data – 2

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other

Possible Interpretations of the Data – 2

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other

Some problems with this interpretation ….

Possible Interpretations of the Data – 2

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other

Some problems with this interpretation ….

Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other)

Possible Interpretations of the Data – 2

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other

Some problems with this interpretation ….

Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other)

However, the data do not correspond to this pattern

Possible Interpretations of the Data – 2

Auditory-Visual IntegrationThe slope of intelligibility-decline associated with the video-leading-audio

conditions is rather different from the audio-leading-video conditions

WHY? WHY? WHY?

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Possible Interpretations of the Data – 3

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous

Possible Interpretations of the Data – 3

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous

Some problems with this interpretation ….

Possible Interpretations of the Data – 3

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous

Some problems with this interpretation ….

If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms?

Possible Interpretations of the Data – 3

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous

Some problems with this interpretation ….

If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms?

There must be some other factor (or set of factors) associated with this perceptual integration asymmetry. What would it (they) be?

Possible Interpretations of the Data – 3

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Possible Interpretations of the Data – 4

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

Possible Interpretations of the Data – 4

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

The visual component of the speech signal is most closely associated with place-of-articulation information (cf. Grant and Walden, 1996)

Possible Interpretations of the Data – 4

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)

In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)

Possible Interpretations of the Data – 4

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)

In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)

This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality

Possible Interpretations of the Data – 4

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)

In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)

This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality

BUT ….. the data imply that the modality arriving first determines the mode (and hence the time constant of processing) for combining information across sensory channels

Possible Interpretations of the Data – 4

VARIABILITY AMONG SUBJECTS

Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

One Further Wrinkle to the Story ….

Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio

One Further Wrinkle to the Story ….

Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio

The length of optimal asynchrony (in terms of intelligibility) varies from subject to subject, but is generally between 80 and 120 ms

One Further Wrinkle to the Story ….

Variation across subjects

Video signal leading is better than synchronous for 8 of 9 subjects

Auditory-Visual Integration - by Individual Ss

These data are complex, but the implications are clear.

Audio-visual integration is a complicated, poorly understood process, at least with respect to speech intelligibility

SUMMARY

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)

Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)

There are many potential interpretations of the data

Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)

There are many potential interpretations of the data

The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio)

Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)

There are many potential interpretations of the data

The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio)

The data further suggest that place-of-articulation cues evolve over syllabic intervals of ca. 200 ms in length and could therefore potentially apply to models of speech processing in general

Audio-Video Integration – Summary

That’s All

Many Thanks for Your Time and Attention