An Introduction to Speech Perception CDIS 4017/5017 Speech & Hearing Science I November 18, 2009.

An Introduction to Speech Perception

CDIS 4017/5017

Speech & Hearing Science I

November 18, 2009

www.sil.org/computing/sa/index.htmwww.bl.uk/nsa

www.ntid.rit.edu/speechlang/slpros/instruction/segmentall/focusing.php

Cslu.cse.ogi.edu/tutordemos/SpectrogramReading/ipa/ipahome.html

http://Library.thinkquest.org/19537/

www.indiana.edu/~acoustic/spsites.html

www.haskins.yale.edu/facilities/

Web resources used in talk

http://www.sil.org/computing/sa/index.htmwww.bl.uk/nsa

http://www.sil.org/computing/sa/index.htmwww.bl.uk/nsa

http://www.ntid.rit.edu/speechlang/slpros/instruction/segmentall/focusing.php

http://www.ntid.rit.edu/speechlang/slpros/instruction/segmentall/focusing.php

http://www.cslu.cse.ogi.edu/tutordemos/SpectrogramReading/ipa/ipahome.html







http://library.thinkquest.org/19537/

http://www.indiana.edu/~acoustic/spsites.html

http://www.haskins.yale.edu/facilities/

Speech Perception• Is speech special?

– Do we perceive it using different mechanisms or strategies than we’d use on other sounds?

– For those in the speech perception literature, this is a significant issue: • Perceptual tasks reveal that the discrimination

ability for humans is different when listening to the components of speech sounds in isolation, rather than within the speech context

• This is important, because it indicates that two different processing strategies may be used on the same (or similar) stimuli

Speech Perception• Is speech special?

– Speech stimuli activate the auditory pathways in a unique manner

– Many other sounds, however, evoke unique responses, and so a cognitive component is considered along with the peripheral mechanisms • Required of the receiver if there is to be an

appreciation for the symbolic nature of the speech signal

• Each speech token can convey significant meaning, but context, situational cues, and suprasegmental information are also present

Speech Perception

• Is speech special? – Consider the “click” sound audible as part of any brief

waveform• In isolation, we perceive the sound as a click, similar to the

sound that a lightswitch makes when flipped• Another example is the glide associated with a formant

transition– Those particular sounds are not heard as clicks or glides

in a phonetic context• Yet they obviously influence (and are influenced by) the

neighboring vowel(s)• This overlap, or running together of sounds that affects the

identity of the phonemes is termed coarticulation

Speech Perception

• Communication (written/verbal/gestural) is an exchange of information, a transaction – Amount of information transmitted relies upon the

system used by the sender of the message– The two forms of communication systems that should

be considered are ciphers and codes– Cipher is a simpler system of communication, one in

which each symbol corresponds to one, and only one, item in the alphabet or lexicon• Ciphers typically cannot transmit information quickly • For example, morse code can be processed by an expert

individual at a rate of 1-2 words/sec.

Difference between cipher and code

• Ciphers require that every word be spelled out, one letter at a time, in a very precise manner– Examples of ciphers include finger spelling,

semaphore, Morse code transmitted visually, acoustically, or electronically

• There is no overlap between adjacent components of a signal– That is, *** --- *** contains three different ciphered

letters that are combined to form a word, or, in this case abbreviation .

– Each component is placed in a specific place, and is transmitted independently of the other components


• In a true code, at least one of the above rules is violated – Speech violates both– Each component’s identity is not maintained in all

conditions– There is overlap between components, and this overlap

is not the same in all productions of the sound– Signal units are not discrete, but are stuck together in

different ways depending upon the sounds to be produced

– Again, this process is called coarticulation - the temporal and spectral overlap between phonemes produced in speech

Speech Perception

• Difference between cipher and code– The identity of a phoneme can be significantly different

in different contexts• Consider the CVs /di/ and /du/• During production of the /d/, the articulators are in drastically

different positions• The output is affected, as the formant structure of the vowel is

different in the two cases• Both CVs end up in the same vowel position, and same

formant frequencies, but the formant transitions are clearly distinct

• Therefore, although the spectra of the two CVs are very different, the sound we hear as /d/ is the same in both cases


• The identity of a phoneme can be significantly different in different contexts– It is impossible to hear the /d/ without some portion of

the vowel contained in the sound– Therefore, the acoustic signal cannot be divided up into

discrete time slices, unlike morse code, which obviously can

– In other words, there is no one-to-one mapping between the acoustic signal and the articulations that produce them

– Similarly, there is no one-to-one mapping between the acoustic signal and the phoneme’s identity because what we hear is affected by the other phonemes, or components of the speech signal

Speech Perception - Theories• Motor Theory of Speech: Basis and assumptions

– Observation that place of articulation does not vary as much as the acoustic consequences of coarticulation

– Therefore, there may be an underlying articulatory, or gestural invariance that is not reflected in acoustic invariance

– Emphasis put on the ability to recognize the articulatory gesture that produce the speech output

– This theory requires that we process speech in a unique manner, different from the processing we use for other sounds

– Supported by evidence that shows we can understand more phonemes/second than we can distinguish non-phonetic sounds in the same amount of time.

Speech Perception – Motor Theory

• Formulated at Haskins Labs, (Liberman et al., 1967), suggests that sounds of speech are encoded, rather than enciphered (similar to the difference between cipher and code)– Therefore, the invariant perception may not require an

invariant acoustic signal– Speech sounds cannot effectively be segmented, or

separated, as we’ve seen due to the effects of coarticulation

Speech Perception – Motor Theory• Theory must account for the encoding process

– One problem is that the motor commands producing speech are not invariant, as originally suggested by motor theorists (consider bite-block experiments and talking while chewing)

– The response: the gestures themselves are not invariant, but the abstract (neural) representation of the speech event IS invariant and, moreover, we have a specialized module that has evolved for the processing, or perceiving of speech

– Problematic, because there’s no way to actually measure, or observe this “black box” or specialized module

– Suggested by MT that, because we have an understanding of the gestures used to produce speech sounds we are able to decode the variable signal and (somehow) perceive and understand the intended gestures (vocal tract manipulations) that produced the sounds

Speech Perception

• Categorical Perception –the observation that phonemes produced with different spectral characteristics still are identified correctly– Obviously the more discriminable pairs of stimuli

are, the more easily they’ll be identified correctly– Additionally, changes in some physical parameter,

such as frequency, may not be as equally discriminable as others• For example, a 100 Hz change in the formant frequency

under some conditions may not change a vowel’s identity • But, to the same listener, a 100 Hz change in a different

frequency region might produce perceptual distinction• Similar categorization schemes are revealed for temporal

information as well

Speech Perception

Speech Perception

• Categorical Perception – There appear to be perceptual boundaries between

phonemes across which the phonemes are very discriminable, but within which phonemes are not discriminable

– That is, some changes in a stimulus may be acoustically discernable, but not phonemically discernable – they are sub-phonemic, acoustic differences (same phoneme, different sound)

– This process is often modeled psychometrically using a graph (called an ogive) that contains two category regions that are divided by a boundary region

Speech Perception

• Categorical Perception – the categorization function for VOT (ba/pa continuum)– Region I

• The stimuli occupying this region will be identified with nearly 100% accuracy as a target stimulus (in this case /ba/)

• Note that a range of VOTs produces the same percept– Region III

• Similarly, the stimuli in this region will be identified with nearly 100% accuracy as the other target (in this case /pa/)

• Again, a range of VOT values produces the same percept– In both regions I and III, the discrimination ability of

the listener is poor in that all the stimuli from the same region sound similar (which is why they were categorized correctly in the first place)

Speech Perception

• Categorical Perception – the categorization function for VOT (ba/pa continuum)– Region II – the boundary region

• The stimuli occupying this region will be more difficult to categorize for the listener

• The difficulty is reflected in the listener’s tendency to perform at nearly a chance rate when trying to categorize

• In other words, about ½ the time they say /ba/, ½ the time they say /pa/, therefore there is substantial uncertainty or ambiguity for the listener, due at least in part to the lack of experience the listener has with such stimuli

– When the listener compares stimuli from either region I or III to region II, the difference is obvious, as is the difference between stimuli from regions I and III

Region I Region IIIII

Speech Perception

• Categorical Perception – summary– Regions corresponding to different portions of a

stimulus continuum reveal that listeners generalize more readily to some stimuli than to others

– Should be able to discern the difference between within and across category perception• Within categories substantial variability is acceptable to

listeners, or a great deal of spectral/temporal variability is tolerated

• Across categories, a small spectral/temporal difference between two stimuli will be discerned consistently

• Helpful to view the boundary region as a “discontinuity”

Speech Perception

• Categorical Perception – evident in infants as early as 6 months (also evident in animals)– The difference in VOT between voiced and unvoiced

stop consonants is, on average, about 20-60 ms, depending on the phonetic environment

– If the perceptual consequence of changing VOT is to change the identify of a CV from unvoiced to voiced, experiments can be done to determine what specific manipulations of VOT produce the perceptual change

– Eimas (1970s) and Kuhl (1980s) demonstrated the effect of these manipulations as evidence of perceptual categories.

Speech Perception

• Categorical Perception – The Eimas experiment looking at VOT– The right-most panel depicts the response of the

infant to repeated CVs that, over time, do not change• The continued decline in sucking rate is consistent with the

idea that the child is habituating to the stimulus (or getting bored, and no longer interested)

• The infant hears /ba/ repeatedly, which evokes a strong response at first (increase from baseline sucking rate)

• But after a certain number of presentations, when no perceptible change to the stimulus is presented, the infant loses interest

• As far as the infant is concerned, the same category was presented each time

Speech Perception

• Categorical Perception – The Eimas experiment looking at VOT– The left-most panel depicts the response of the infant

to CVs that have a 20 ms VOT at first, but that change to a 40 ms VOT after several presentations• As before, the infant hears /ba/ repeatedly, which evokes a

strong response at first (increase from baseline sucking rate) and the sucking rate slows as the infant habituates

• This time, however, a change in VOT is presented, and the infant, in response, shows renewed interest in the stimulus (in the form of a heightened sucking rate)

• The change in rate is attributed to the acoustic characteristic of the signal (a 20 ms difference in VOT) that produces a perceivable change in the stimulus

• Therefore, to the infant, a new category was presented

Speech Perception

• Acoustic Theories– Generally these theories prioritize the auditory

sensitivity and filtering that listeners may draw upon for processing speech signals

– Another important aspect of such theories is that they require that some aspect of the speech waveform is invariant

– The “noninvariance problem” must be resolved• Acoustic noninvariance refers to the observation

that variable production of speech results in a noninvariant perception

Speech Perception

• Acoustic Theories – Feature Detection– The acoustic components of speech contain specific,

audible cues that can be used by listeners to identify specific components of speech signals• F2 transitions• VOT and the concurrent rapid changes in spectral

content of CVs as the consonant changes into the vowel

– The acoustic theories presuppose that humans are born with the ability to use neural receptors that are sensitive to a variety of acoustic features that are specific to speech

Speech Perception

• Acoustic Theories – Feature Detection– Template matching is also an important consideration

for acoustic theories• Templates may include spectral or temporal information that

is consistent across utterances within a language• Inconsistencies across speakers may be overcome if the

talker produces speech that is, spectrally, close enough to a template the listener has established and encoded neurally

– The templates are created through years of experience with a language or languages• The greater the familiarity with the language, the greater the

amount of variability in production the listener can tolerate

Speech Perception

Speech Perception

• Acoustic Theories – Feature Detection– An example of one simplified form of template can

be seen in the two-formant synthetic vowels that were used in Liberman’s experiment• They were originally designed at the Haskins’ Labs in the

1950s (Pattern Playback device)• They used a photreceptor cell that produced sounds

corresponding to the bands of darkness on the page• Two lines, such as those that schematize vowels, would be

passed through the pattern playback which would transform the visual image into a sound

• Different formant locations on the paper (which would correspond to different frequencies) would then produce different sounds, well controlled, and of use experimentally

• An idealized version of the stimulus need not be presented for a listener to identify the stimulus

Speech Perception

• Acoustic Theories – Template matching– Stevens and Blumstein (1981) templates for the

detection of stop consonant CVs• The perception of /b,p/ is most consistently triggered by a

spectrum that is diffuse (spread out across frequency) and falling (or lower in energy at high frequencies than low frequencies)

• The perception of /d,t/ is most consistently triggered by a spectrum that is diffuse and rising

• The perception of /g,k/ is most consistently triggered by a spectrum that is compact, with energy centered at a mid range frequency, falling off in both low and high frequencies

• The CV is sampled as the stop is released and the formants are beginning to emerge

Speech Perception

• Acoustic Theories – template matching– Stevens and Blumstein (1981) templates for the

detection of stop consonant CVs• The investigators viewed spectrograms and predicted the

CV (specifically the stop) that was presented• Their predictions were accurate about 80% of the time,

which is still one of the better performance records for the identification of stops in a context lacking in other phonetic/syllabic/lexical information

• The suggestion is that listeners could improve performance to nearly 100% if they had access to more phonetic or suprasegmental info (words, sentences)

Speech Perception

• Feature Detection - Nasalization– Produces one type of resonant consonants (the others

are the glides)• The nasal cavity acts as a third resonator, along with the oral

and pharyngeal cavities• Used for three consonants /n/, /m/, and /ng/• Most speech in English utilizes oral sounds, but these

exceptions are considered nasal sounds• Nasal sounds are produced when the velopharyngeal port is

opened• The levator palatini is the primary muscle that produces

closure of the port

Speech Perception

• Feature Detection - Nasalization– The musculature involved includes other muscles as

well as the L. Palatini• These muscles are small, weak, and their involvement in

closing and opening the port is not clear at this time• The palatoglossus (by some authors called the

glossopalatine) muscle couples the velum to the tongue• In either case, the muscles are innervated by CNs X and XI,

and on the sensory side by CN IX• Additional mass and thickness to the system is provided by

the uvula (or uvular muscle)

Speech Perception

• Feature Detection - Nasalization– The tightness of velopharyngeal closure depends

upon the sound(s) produced (or the phonetic context)• It is lowest for the nasals• Intermediate position for low vowels /a/, /ae/• Somewhat higher for the high vowels /i/, /u/• The tightest seal occurs for oral consonants, particularly the

stops – these sounds will be distinguished from the nasals NOT because of the visible position of the articulators, but because of VP port function that alters the airflow through the upper respiratory tract

Speech Perception

• Feature Detection - Nasalization– Velar position also alters the volume of air, and

therefore the air pressure, in the cavities below it• When closed, supraglottal air pressure can be altered by

contracting the pharyngeal constrictors• When open, it makes the supraglottal space much larger• Because stops require the cessation of airflow through the

vocal tract, the velum may be manipulated to produce rapid changes in pressure, airflow, etc.

• Velar contraction closes the VP port reducing nasalance• Velar relaxation opens the port, increasing nasalance

Speech Perception

• Feature Detection - Nasalization– To force air through the nasal cavity during consonant

production it is necessary to close off the oral cavity (thereby forcing the air flow through the nasal cavity)• Orbicularis oris is used to close off the oral cavity• Use of the nasal cavities lengthens the vocal tract, which

lowers the resonance frequency of the entire system• The use of the nasal cavities also creates substantial

turbulence lower in the V. tract (particularly in and around the oral cavity)

• The turbulence reduces the efficiency with which the sub-glottal air pressure excites air in the V. tract, thereby reducing the intensity of vocal output

Speech Perception

• Nasalization – summary – Nasal cavity is opened to the airflow through the

lowering of the velum– Velum is raised through contraction of the levator

palatini muscle, but its mass must be great enough to tightly seal the port• This may not be the case for a person with a cleft palate

– Vocal pitch is decreased due to the lengthening of the vocal tract during use of the nasal cavity

– Vocal intensity is also reduced due to turbulence in the tract that produces, at specific locations, antiresonances (or regions of high damping)

Speech Perception: Summary

• Theories and concepts– Motor Theory– Categorical Perception– Template Matching– Feature detection/Acoustic Theories

• Priority: the noninvariance problem must be addressed

• Consider the ways in which production and perception are linked (VT manipulations and their acoustic consequences)

An Introduction to Speech Perception CDIS 4017/5017 Speech & Hearing Science I November 18, 2009.

Documents

Transcript of An Introduction to Speech Perception CDIS 4017/5017 Speech & Hearing Science I November 18, 2009.