An Introduction to Speech Perception CDIS 4017/5017 Speech & Hearing Science I November 18, 2009.
-
Upload
doreen-washington -
Category
Documents
-
view
219 -
download
1
Transcript of An Introduction to Speech Perception CDIS 4017/5017 Speech & Hearing Science I November 18, 2009.
An Introduction to Speech Perception
CDIS 4017/5017
Speech & Hearing Science I
November 18, 2009
www.sil.org/computing/sa/index.htmwww.bl.uk/nsa
www.ntid.rit.edu/speechlang/slpros/instruction/segmentall/focusing.php
Cslu.cse.ogi.edu/tutordemos/SpectrogramReading/ipa/ipahome.html
http://Library.thinkquest.org/19537/
www.indiana.edu/~acoustic/spsites.html
www.haskins.yale.edu/facilities/
Web resources used in talk
Speech Perception• Is speech special?
– Do we perceive it using different mechanisms or strategies than we’d use on other sounds?
– For those in the speech perception literature, this is a significant issue: • Perceptual tasks reveal that the discrimination
ability for humans is different when listening to the components of speech sounds in isolation, rather than within the speech context
• This is important, because it indicates that two different processing strategies may be used on the same (or similar) stimuli
Speech Perception• Is speech special?
– Speech stimuli activate the auditory pathways in a unique manner
– Many other sounds, however, evoke unique responses, and so a cognitive component is considered along with the peripheral mechanisms • Required of the receiver if there is to be an
appreciation for the symbolic nature of the speech signal
• Each speech token can convey significant meaning, but context, situational cues, and suprasegmental information are also present
Speech Perception
• Is speech special? – Consider the “click” sound audible as part of any brief
waveform• In isolation, we perceive the sound as a click, similar to the
sound that a lightswitch makes when flipped• Another example is the glide associated with a formant
transition– Those particular sounds are not heard as clicks or glides
in a phonetic context• Yet they obviously influence (and are influenced by) the
neighboring vowel(s)• This overlap, or running together of sounds that affects the
identity of the phonemes is termed coarticulation
Speech Perception
• Communication (written/verbal/gestural) is an exchange of information, a transaction – Amount of information transmitted relies upon the
system used by the sender of the message– The two forms of communication systems that should
be considered are ciphers and codes– Cipher is a simpler system of communication, one in
which each symbol corresponds to one, and only one, item in the alphabet or lexicon• Ciphers typically cannot transmit information quickly • For example, morse code can be processed by an expert
individual at a rate of 1-2 words/sec.
Difference between cipher and code
• Ciphers require that every word be spelled out, one letter at a time, in a very precise manner– Examples of ciphers include finger spelling,
semaphore, Morse code transmitted visually, acoustically, or electronically
• There is no overlap between adjacent components of a signal– That is, *** --- *** contains three different ciphered
letters that are combined to form a word, or, in this case abbreviation .
– Each component is placed in a specific place, and is transmitted independently of the other components
Difference between cipher and code
• In a true code, at least one of the above rules is violated – Speech violates both– Each component’s identity is not maintained in all
conditions– There is overlap between components, and this overlap
is not the same in all productions of the sound– Signal units are not discrete, but are stuck together in
different ways depending upon the sounds to be produced
– Again, this process is called coarticulation - the temporal and spectral overlap between phonemes produced in speech
Speech Perception
• Difference between cipher and code– The identity of a phoneme can be significantly different
in different contexts• Consider the CVs /di/ and /du/• During production of the /d/, the articulators are in drastically
different positions• The output is affected, as the formant structure of the vowel is
different in the two cases• Both CVs end up in the same vowel position, and same
formant frequencies, but the formant transitions are clearly distinct
• Therefore, although the spectra of the two CVs are very different, the sound we hear as /d/ is the same in both cases
Difference between cipher and code
• The identity of a phoneme can be significantly different in different contexts– It is impossible to hear the /d/ without some portion of
the vowel contained in the sound– Therefore, the acoustic signal cannot be divided up into
discrete time slices, unlike morse code, which obviously can
– In other words, there is no one-to-one mapping between the acoustic signal and the articulations that produce them
– Similarly, there is no one-to-one mapping between the acoustic signal and the phoneme’s identity because what we hear is affected by the other phonemes, or components of the speech signal
Speech Perception - Theories• Motor Theory of Speech: Basis and assumptions
– Observation that place of articulation does not vary as much as the acoustic consequences of coarticulation
– Therefore, there may be an underlying articulatory, or gestural invariance that is not reflected in acoustic invariance
– Emphasis put on the ability to recognize the articulatory gesture that produce the speech output
– This theory requires that we process speech in a unique manner, different from the processing we use for other sounds
– Supported by evidence that shows we can understand more phonemes/second than we can distinguish non-phonetic sounds in the same amount of time.
Speech Perception – Motor Theory
• Formulated at Haskins Labs, (Liberman et al., 1967), suggests that sounds of speech are encoded, rather than enciphered (similar to the difference between cipher and code)– Therefore, the invariant perception may not require an
invariant acoustic signal– Speech sounds cannot effectively be segmented, or
separated, as we’ve seen due to the effects of coarticulation
Speech Perception – Motor Theory• Theory must account for the encoding process
– One problem is that the motor commands producing speech are not invariant, as originally suggested by motor theorists (consider bite-block experiments and talking while chewing)
– The response: the gestures themselves are not invariant, but the abstract (neural) representation of the speech event IS invariant and, moreover, we have a specialized module that has evolved for the processing, or perceiving of speech
– Problematic, because there’s no way to actually measure, or observe this “black box” or specialized module
– Suggested by MT that, because we have an understanding of the gestures used to produce speech sounds we are able to decode the variable signal and (somehow) perceive and understand the intended gestures (vocal tract manipulations) that produced the sounds
Speech Perception
• Categorical Perception –the observation that phonemes produced with different spectral characteristics still are identified correctly– Obviously the more discriminable pairs of stimuli
are, the more easily they’ll be identified correctly– Additionally, changes in some physical parameter,
such as frequency, may not be as equally discriminable as others• For example, a 100 Hz change in the formant frequency
under some conditions may not change a vowel’s identity • But, to the same listener, a 100 Hz change in a different
frequency region might produce perceptual distinction• Similar categorization schemes are revealed for temporal
information as well
Speech Perception
Speech Perception
• Categorical Perception – There appear to be perceptual boundaries between
phonemes across which the phonemes are very discriminable, but within which phonemes are not discriminable
– That is, some changes in a stimulus may be acoustically discernable, but not phonemically discernable – they are sub-phonemic, acoustic differences (same phoneme, different sound)
– This process is often modeled psychometrically using a graph (called an ogive) that contains two category regions that are divided by a boundary region
Speech Perception
• Categorical Perception – the categorization function for VOT (ba/pa continuum)– Region I
• The stimuli occupying this region will be identified with nearly 100% accuracy as a target stimulus (in this case /ba/)
• Note that a range of VOTs produces the same percept– Region III
• Similarly, the stimuli in this region will be identified with nearly 100% accuracy as the other target (in this case /pa/)
• Again, a range of VOT values produces the same percept– In both regions I and III, the discrimination ability of
the listener is poor in that all the stimuli from the same region sound similar (which is why they were categorized correctly in the first place)
Speech Perception
• Categorical Perception – the categorization function for VOT (ba/pa continuum)– Region II – the boundary region
• The stimuli occupying this region will be more difficult to categorize for the listener
• The difficulty is reflected in the listener’s tendency to perform at nearly a chance rate when trying to categorize
• In other words, about ½ the time they say /ba/, ½ the time they say /pa/, therefore there is substantial uncertainty or ambiguity for the listener, due at least in part to the lack of experience the listener has with such stimuli
– When the listener compares stimuli from either region I or III to region II, the difference is obvious, as is the difference between stimuli from regions I and III
Region I Region IIIII
Region I Region IIIII
Speech Perception
• Categorical Perception – summary– Regions corresponding to different portions of a
stimulus continuum reveal that listeners generalize more readily to some stimuli than to others
– Should be able to discern the difference between within and across category perception• Within categories substantial variability is acceptable to
listeners, or a great deal of spectral/temporal variability is tolerated
• Across categories, a small spectral/temporal difference between two stimuli will be discerned consistently
• Helpful to view the boundary region as a “discontinuity”
Speech Perception
• Categorical Perception – evident in infants as early as 6 months (also evident in animals)– The difference in VOT between voiced and unvoiced
stop consonants is, on average, about 20-60 ms, depending on the phonetic environment
– If the perceptual consequence of changing VOT is to change the identify of a CV from unvoiced to voiced, experiments can be done to determine what specific manipulations of VOT produce the perceptual change
– Eimas (1970s) and Kuhl (1980s) demonstrated the effect of these manipulations as evidence of perceptual categories.
Speech Perception
• Categorical Perception – The Eimas experiment looking at VOT– The right-most panel depicts the response of the
infant to repeated CVs that, over time, do not change• The continued decline in sucking rate is consistent with the
idea that the child is habituating to the stimulus (or getting bored, and no longer interested)
• The infant hears /ba/ repeatedly, which evokes a strong response at first (increase from baseline sucking rate)
• But after a certain number of presentations, when no perceptible change to the stimulus is presented, the infant loses interest
• As far as the infant is concerned, the same category was presented each time
Speech Perception
• Categorical Perception – The Eimas experiment looking at VOT– The left-most panel depicts the response of the infant
to CVs that have a 20 ms VOT at first, but that change to a 40 ms VOT after several presentations• As before, the infant hears /ba/ repeatedly, which evokes a
strong response at first (increase from baseline sucking rate) and the sucking rate slows as the infant habituates
• This time, however, a change in VOT is presented, and the infant, in response, shows renewed interest in the stimulus (in the form of a heightened sucking rate)
• The change in rate is attributed to the acoustic characteristic of the signal (a 20 ms difference in VOT) that produces a perceivable change in the stimulus
• Therefore, to the infant, a new category was presented
Speech Perception
• Acoustic Theories– Generally these theories prioritize the auditory
sensitivity and filtering that listeners may draw upon for processing speech signals
– Another important aspect of such theories is that they require that some aspect of the speech waveform is invariant
– The “noninvariance problem” must be resolved• Acoustic noninvariance refers to the observation
that variable production of speech results in a noninvariant perception
Speech Perception
• Acoustic Theories – Feature Detection– The acoustic components of speech contain specific,
audible cues that can be used by listeners to identify specific components of speech signals• F2 transitions• VOT and the concurrent rapid changes in spectral
content of CVs as the consonant changes into the vowel
– The acoustic theories presuppose that humans are born with the ability to use neural receptors that are sensitive to a variety of acoustic features that are specific to speech
Speech Perception
• Acoustic Theories – Feature Detection– Template matching is also an important consideration
for acoustic theories• Templates may include spectral or temporal information that
is consistent across utterances within a language• Inconsistencies across speakers may be overcome if the
talker produces speech that is, spectrally, close enough to a template the listener has established and encoded neurally
– The templates are created through years of experience with a language or languages• The greater the familiarity with the language, the greater the
amount of variability in production the listener can tolerate
Speech Perception
Speech Perception
Speech Perception
Speech Perception
• Acoustic Theories – Feature Detection– An example of one simplified form of template can
be seen in the two-formant synthetic vowels that were used in Liberman’s experiment• They were originally designed at the Haskins’ Labs in the
1950s (Pattern Playback device)• They used a photreceptor cell that produced sounds
corresponding to the bands of darkness on the page• Two lines, such as those that schematize vowels, would be
passed through the pattern playback which would transform the visual image into a sound
• Different formant locations on the paper (which would correspond to different frequencies) would then produce different sounds, well controlled, and of use experimentally
• An idealized version of the stimulus need not be presented for a listener to identify the stimulus
Speech Perception
• Acoustic Theories – Template matching– Stevens and Blumstein (1981) templates for the
detection of stop consonant CVs• The perception of /b,p/ is most consistently triggered by a
spectrum that is diffuse (spread out across frequency) and falling (or lower in energy at high frequencies than low frequencies)
• The perception of /d,t/ is most consistently triggered by a spectrum that is diffuse and rising
• The perception of /g,k/ is most consistently triggered by a spectrum that is compact, with energy centered at a mid range frequency, falling off in both low and high frequencies
• The CV is sampled as the stop is released and the formants are beginning to emerge
Speech Perception
• Acoustic Theories – template matching– Stevens and Blumstein (1981) templates for the
detection of stop consonant CVs• The investigators viewed spectrograms and predicted the
CV (specifically the stop) that was presented• Their predictions were accurate about 80% of the time,
which is still one of the better performance records for the identification of stops in a context lacking in other phonetic/syllabic/lexical information
• The suggestion is that listeners could improve performance to nearly 100% if they had access to more phonetic or suprasegmental info (words, sentences)
Speech Perception
• Feature Detection - Nasalization– Produces one type of resonant consonants (the others
are the glides)• The nasal cavity acts as a third resonator, along with the oral
and pharyngeal cavities• Used for three consonants /n/, /m/, and /ng/• Most speech in English utilizes oral sounds, but these
exceptions are considered nasal sounds• Nasal sounds are produced when the velopharyngeal port is
opened• The levator palatini is the primary muscle that produces
closure of the port
Speech Perception
• Feature Detection - Nasalization– The musculature involved includes other muscles as
well as the L. Palatini• These muscles are small, weak, and their involvement in
closing and opening the port is not clear at this time• The palatoglossus (by some authors called the
glossopalatine) muscle couples the velum to the tongue• In either case, the muscles are innervated by CNs X and XI,
and on the sensory side by CN IX• Additional mass and thickness to the system is provided by
the uvula (or uvular muscle)
Speech Perception
• Feature Detection - Nasalization– The tightness of velopharyngeal closure depends
upon the sound(s) produced (or the phonetic context)• It is lowest for the nasals• Intermediate position for low vowels /a/, /ae/• Somewhat higher for the high vowels /i/, /u/• The tightest seal occurs for oral consonants, particularly the
stops – these sounds will be distinguished from the nasals NOT because of the visible position of the articulators, but because of VP port function that alters the airflow through the upper respiratory tract
Speech Perception
• Feature Detection - Nasalization– Velar position also alters the volume of air, and
therefore the air pressure, in the cavities below it• When closed, supraglottal air pressure can be altered by
contracting the pharyngeal constrictors• When open, it makes the supraglottal space much larger• Because stops require the cessation of airflow through the
vocal tract, the velum may be manipulated to produce rapid changes in pressure, airflow, etc.
• Velar contraction closes the VP port reducing nasalance• Velar relaxation opens the port, increasing nasalance
Speech Perception
• Feature Detection - Nasalization– To force air through the nasal cavity during consonant
production it is necessary to close off the oral cavity (thereby forcing the air flow through the nasal cavity)• Orbicularis oris is used to close off the oral cavity• Use of the nasal cavities lengthens the vocal tract, which
lowers the resonance frequency of the entire system• The use of the nasal cavities also creates substantial
turbulence lower in the V. tract (particularly in and around the oral cavity)
• The turbulence reduces the efficiency with which the sub-glottal air pressure excites air in the V. tract, thereby reducing the intensity of vocal output
Speech Perception
• Nasalization – summary – Nasal cavity is opened to the airflow through the
lowering of the velum– Velum is raised through contraction of the levator
palatini muscle, but its mass must be great enough to tightly seal the port• This may not be the case for a person with a cleft palate
– Vocal pitch is decreased due to the lengthening of the vocal tract during use of the nasal cavity
– Vocal intensity is also reduced due to turbulence in the tract that produces, at specific locations, antiresonances (or regions of high damping)
Speech Perception: Summary
• Theories and concepts– Motor Theory– Categorical Perception– Template Matching– Feature detection/Acoustic Theories
• Priority: the noninvariance problem must be addressed
• Consider the ways in which production and perception are linked (VT manipulations and their acoustic consequences)