Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven...
-
Upload
thomas-charles -
Category
Documents
-
view
217 -
download
1
Transcript of Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven...
![Page 1: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/1.jpg)
Speech Intelligibility
Derived fromAsynchronous Processing
of
Auditory-Visual Information
Steven GreenbergInternational Computer Science Institute
1947 Center Street, Berkeley, CA 94704, USAhttp://www.icsi.berkeley.edu/~steveng
Ken W. GrantArmy Audiology and Speech CenterWalter Reed Army Medical Center
Washington, D.C. 20307, USAhttp://www.wramc.amedd.army.mil/departments/aasc/avlab
![Page 2: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/2.jpg)
Acknowledgements and Thanks
Technical Assistance Takayuki Arai, Rosaria Silipo
Research FundingU.S. National Science Foundation
![Page 3: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/3.jpg)
BACKGROUND
![Page 4: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/4.jpg)
Superior recognition and intelligibility under many conditions
Provides phonetic-segment information that is potentially redundant with acoustic information
Vowels
Provides segmental information that complements acoustic informationConsonants
Directs auditory analyses to the target signalWho, where, when, what (spectral)
What’s the Big Deal with Speech Reading?
+
![Page 5: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/5.jpg)
0
10
20
30
40
50
60
70
80
90
100
-15 -10 -5 0 5 10 15 20Speech-to-Noise Ratio (dB)
NH - Auditory Consonants
HI-Auditory Sentences
ASR Sentences
NH - Audiovisual Consonants
HI Auditory Consonants
HI - Audiovisual Consonants
Per
cen
t C
orr
ect
Rec
og
nit
ion
Audio-Visual vs. Audio-Only Recognition
NH = Normal HearingHI = Hearing Impaired
The visual modality provides a significant gain in speech processing
Particularly under low signal-to-noise-ratio conditions
And for hearing-impaired listeners
Figure courtesy of Ken Grant
![Page 6: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/6.jpg)
VoicingMannerPlaceOther
Percent Information Transmitted relative to Total Information Received
Articulatory Information via by Visual Cues
Figure courtesy of Ken Grant
0% 3%4%
Place of articulation93%
Place of Articulation Most Important
![Page 7: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/7.jpg)
Key issues pertaining to:
Early versus late integration models of bi-modal information
Most contemporary models favor late integration of information
However ….Preliminary evidence (Sams et al., 1991) that silent speechreading can
activate auditory cortex (in humans) (but Bernstein et al. 2002 say “nay”)
Superior colliculus (an upper brainstem nucleus) may also serve as a site of bimodal integration (or at least interaction; Stein and colleagues)
Are Auditory & Visual Processing Independent?
![Page 8: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/8.jpg)
What are the temporal factors underlying integration of audio-visual information for speech processing?
Two sets of data are examined:Spectro-temporal integration – audio-only signalsAudio-visual integration using sparse spectral cues and speechreading
In each experiment the cues (acoustic and/or visual) are desynchronized and the impact on word intelligibility measured (for English sentences)
Time Constraints Underlying A/V Integration
![Page 9: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/9.jpg)
EXPERIMENTOVERVIEW
![Page 10: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/10.jpg)
Time course of integration
Within (the acoustic) modality – Four narrow spectral slitsCentral slits desynchronized relative to the lateral slits
Across modalities –Two acoustic slits (the lateral channels)Speechreading video informationDesynchronize the video and audio streams relative to each
other
Spectro-temporal Integration
![Page 11: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/11.jpg)
Auditory-Visual Asynchrony - ParadigmVideo of spoken (Harvard/IEEE) sentences, presented in tandem with a
sparse spectral representation (low- and high-frequency slits) of the same material
+
+
Video Leads Audio Leads
40 – 400 ms 40 – 400 ms
Baseline ConditionSYNCHRONOUS A/V
![Page 12: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/12.jpg)
Auditory-Visual Integration - Preview When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to
that observed for audio-alone signals
When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as long as 200 ms
Why? Why? Why?
We’ll return to these data shortly
But first, let’s take a look at audio-alone speech intelligibility data in order to gain some perspective on the audio-visual case
The audio-alone data come from earlier studies by Greenberg and colleagues using TIMIT sentences
9 Subjects
![Page 13: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/13.jpg)
AUDIO-ALONE EXPERIMENTS
![Page 14: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/14.jpg)
The edge of each slit was separated from its nearest neighbor by an octave
Can listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum? – YES (cf. next slide)
What is the intelligibility of each slit alone and in combination with others?
Audio (Alone) Spectral Slit Paradigm
+
+
![Page 15: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/15.jpg)
89% 60% 13%
2% 9% 9% 4%
Word Intelligibility - Single and Multiple Slits
1
2
3
4
Slit
Nu
mb
er
1
2
3
4
Slit
Nu
mb
er
334
841
2120
5340
CF
(H
z)
334
841
2120
5340
CF
(H
z)
![Page 16: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/16.jpg)
Word Intelligibility - Single SlitsThe intelligibility associated with any single slit is only 2 to 9%The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits
![Page 17: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/17.jpg)
Word Intelligibility - 4 Slits
![Page 18: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/18.jpg)
Word Intelligibility - 2 Slits
![Page 19: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/19.jpg)
Word Intelligibility - 2 Slits
![Page 20: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/20.jpg)
Slit Asynchrony Affects IntelligibilityDesynchronizing the slits by more than 25 ms results in a significant decline in intelligibility
The affect of asynchrony on intelligibility is relatively symmetrical
These data are from a different set of subjects than those participating in the study described earlier - hence slightly different numbers for the baseline conditions
![Page 21: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/21.jpg)
Intelligibility and Slit AsynchronyDesynchronizing the two central slits relative to the lateral ones has a pronounced effect on intelligibility
Asynchrony greater than 50 ms results in intelligibility lower than baseline
![Page 22: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/22.jpg)
AUDIO-VISUAL EXPERIMENTS
![Page 23: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/23.jpg)
Focus on Audio-Leading-Video ConditionsWhen the AUDIO signal LEADS the VIDEO, there is a progressive decline in
intelligibility, similar to that observed for audio-alone signals
These data are next compared with data from the previous slide to illustrated the similarity in the slope of the function
![Page 24: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/24.jpg)
Comparison of A/V and Audio-Alone Data The decline in intelligibility for the audio-alone condition is similar to that of
the audio-leading-video condition
Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar
The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves
![Page 25: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/25.jpg)
When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms
These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio
We’ll consider a variety of interpretations in a few minutes
Focus on Video-Leading-Audio Conditions
![Page 26: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/26.jpg)
The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions
WHY? WHY? WHY?
There are several interpretations of these data – we’ll consider several on the following slide
Auditory-Visual Integration - the Full Monty
![Page 27: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/27.jpg)
INTERPRETATION OF THE DATA
![Page 28: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/28.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Possible Interpretations of the Data
![Page 29: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/29.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Possible Interpretations of the Data – 1
![Page 30: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/30.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
Possible Interpretations of the Data – 1
![Page 31: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/31.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
The speed of light is ca. 186,300 miles per second (effectively instantaneous)
Possible Interpretations of the Data – 1
![Page 32: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/32.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
The speed of light is ca. 186,300 miles per second (effectively instantaneous)
The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)
Possible Interpretations of the Data – 1
![Page 33: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/33.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
The speed of light is ca. 186,300 miles per second (effectively instantaneous)
The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)
Subjects in this study were wearing headphones
Possible Interpretations of the Data – 1
![Page 34: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/34.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
The speed of light is ca. 186,300 miles per second (effectively instantaneous)
The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)
Subjects in this study were wearing headphones
Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds)
Possible Interpretations of the Data – 1
![Page 35: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/35.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
The speed of light is ca. 186,300 miles per second (effectively instantaneous)
The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)
Subjects in this study were wearing headphones
Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds)
(Let’s put this potential interpretation aside for a few moments)
Possible Interpretations of the Data – 1
![Page 36: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/36.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Possible Interpretations of the Data – 2
![Page 37: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/37.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other
Possible Interpretations of the Data – 2
![Page 38: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/38.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other
Some problems with this interpretation ….
Possible Interpretations of the Data – 2
![Page 39: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/39.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other
Some problems with this interpretation ….
Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other)
Possible Interpretations of the Data – 2
![Page 40: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/40.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other
Some problems with this interpretation ….
Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other)
However, the data do not correspond to this pattern
Possible Interpretations of the Data – 2
![Page 41: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/41.jpg)
Auditory-Visual IntegrationThe slope of intelligibility-decline associated with the video-leading-audio
conditions is rather different from the audio-leading-video conditions
WHY? WHY? WHY?
![Page 42: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/42.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Possible Interpretations of the Data – 3
![Page 43: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/43.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous
Possible Interpretations of the Data – 3
![Page 44: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/44.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous
Some problems with this interpretation ….
Possible Interpretations of the Data – 3
![Page 45: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/45.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous
Some problems with this interpretation ….
If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms?
Possible Interpretations of the Data – 3
![Page 46: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/46.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous
Some problems with this interpretation ….
If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms?
There must be some other factor (or set of factors) associated with this perceptual integration asymmetry. What would it (they) be?
Possible Interpretations of the Data – 3
![Page 47: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/47.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Possible Interpretations of the Data – 4
![Page 48: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/48.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?
Possible Interpretations of the Data – 4
![Page 49: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/49.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?
The visual component of the speech signal is most closely associated with place-of-articulation information (cf. Grant and Walden, 1996)
Possible Interpretations of the Data – 4
![Page 50: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/50.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?
The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)
In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)
Possible Interpretations of the Data – 4
![Page 51: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/51.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?
The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)
In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)
This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality
Possible Interpretations of the Data – 4
![Page 52: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/52.jpg)
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?
The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)
In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)
This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality
BUT ….. the data imply that the modality arriving first determines the mode (and hence the time constant of processing) for combining information across sensory channels
Possible Interpretations of the Data – 4
![Page 53: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/53.jpg)
VARIABILITY AMONG SUBJECTS
![Page 54: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/54.jpg)
Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects
One Further Wrinkle to the Story ….
![Page 55: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/55.jpg)
Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects
For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio
One Further Wrinkle to the Story ….
![Page 56: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/56.jpg)
Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects
For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio
The length of optimal asynchrony (in terms of intelligibility) varies from subject to subject, but is generally between 80 and 120 ms
One Further Wrinkle to the Story ….
![Page 57: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/57.jpg)
Variation across subjects
Video signal leading is better than synchronous for 8 of 9 subjects
Auditory-Visual Integration - by Individual Ss
These data are complex, but the implications are clear.
Audio-visual integration is a complicated, poorly understood process, at least with respect to speech intelligibility
![Page 58: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/58.jpg)
SUMMARY
![Page 59: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/59.jpg)
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
Audio-Video Integration – Summary
![Page 60: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/60.jpg)
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
Audio-Video Integration – Summary
![Page 61: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/61.jpg)
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
Audio-Video Integration – Summary
![Page 62: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/62.jpg)
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
Audio-Video Integration – Summary
![Page 63: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/63.jpg)
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)
Audio-Video Integration – Summary
![Page 64: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/64.jpg)
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)
There are many potential interpretations of the data
Audio-Video Integration – Summary
![Page 65: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/65.jpg)
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)
There are many potential interpretations of the data
The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio)
Audio-Video Integration – Summary
![Page 66: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/66.jpg)
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)
There are many potential interpretations of the data
The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio)
The data further suggest that place-of-articulation cues evolve over syllabic intervals of ca. 200 ms in length and could therefore potentially apply to models of speech processing in general
Audio-Video Integration – Summary
![Page 67: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.](https://reader036.fdocuments.in/reader036/viewer/2022062722/56649f385503460f94c54948/html5/thumbnails/67.jpg)
That’s All
Many Thanks for Your Time and Attention