The phantom in the phoneme: Invariant cues for …any class of phonemes in English. The difficulty...

7
Perception & Psychophysics 1974, Vol. 15, No. 1,101·107 The phantom in the phoneme: Invariant cues for stop consonants* RONALD A. COLE and BRIAN SCOTT University of Waterloo, Waterloo, Ontario, Canada The stop consonants /b, d, g, p, t, kl were recorded before IiI, lal, [u]. The energy spectrum for each stop consonant was removed from its original vowel and spliced onto a different steady-state vowel. Results of a recognition test revealed that consonants were accurately recognized in all cases except when Ikl or Igl was spliced from IiI to [u]. Further demonstrations suggested that Ikl and Igl do have invariant characteristics before IiI, [e], and [u]. These results support the general notion that stop consonants may be recognized before different vowelsin normal speech in terms of invariant acoustic features. a Sa Va Ma Fa between the two members of each pair, information is needed from the vowel portion of the syllable . The first member of each pair is articulated toward the front of the mouth and is accompanied by a rising frequency transition to the following vowel, while the second consonant in each pair is articulated at the center of the Fig. 1. Speech spectrograms of pairs of sounds illustrating the complex relationship between invariant and transitional cues• The two syllables in each row share invariant cues that distinguish the two consonant phonemes from all others in English. The two syllables in each row differ by place of articulation, which is cued by the slope of the second (upper) formant transition (F2). In syllables on the left of the figure, F2 is rising, in syllables on the right, F2 ill falling. The simplest view of speech perception states that each phoneme in a language is accompanied by a 'unique combination of acoustic features; whenever a particular phoneme occurs in speech, these features also occur. According to this view, speech perception need only involve, at the simplest level of analysis, the- identification of particular configurations of acoustic features. In fact, a number of consonant phonemes in English are accompanied by invariant acoustic features. The noise which accompanies lsi, [z], Ishl , Ichl , lil, and Izhl provides sufficient information for listeners to discriminate these phonemes in any syllable environment (Harris, 1958; Heinz & Stevens, 1961). Moreover, it is possible to demonstrate the featural composition of these consonants. For example, Is/-/zl differ mainly in terms of voicing; [z] is accompanied by low-frequency vowel resonance, while lsi is not. When Izl is passed through a filter which eliminates energy below 800 Hz, listeners hear lsi. The consonants Ishl , Ichl, Ij /, and Idl differ mainly in terms of their duration. This is demonstrated by splicing successive 10-msec segments from the onset of Ishal , in which case listeners eventually hear Icha/, then Ija/ , and then Idal (Scott, 1971). Other sounds, such as the consonants If/-IOI, Iv/-ll1l, and Iml-/nl require the identification of both invariant and variant cues for their perception (Harris, 1958; Malecot, 1956). As shown in Fig. 1, the two phonemes in each pair share invariant cues that allow listeners to discriminate these consonants from all others in English. For example, If I and 10 I contain low-intensity high-frequency energy, Ivl and IlJI are accompanied by this same energy plus a low-frequency voicing component, while Iml and Inl are accompanied by nasal resonance and a com plete lack of high-frequency energy. Thus, the two consonant phonemes in each row in Fig. 1 share invariant cues. In order for listeners to discriminate • This research was supported by a grant from the National Research Council of Canada awarded to the first author. We would like to express our appreciation to the University of Waterloo computing center for the use of their facilities. We thank Hillary Stauntan for preparing the experimental tapes and running Ss, 101

Transcript of The phantom in the phoneme: Invariant cues for …any class of phonemes in English. The difficulty...

Page 1: The phantom in the phoneme: Invariant cues for …any class of phonemes in English. The difficulty is that the cues which signal the stop consonants seem to change, depending upon

Perception & Psychophysics1974, Vol. 15, No. 1,101·107

The phantom in the phoneme: Invariant cues for stop consonants*RONALD A. COLE and BRIAN SCOTT

University of Waterloo, Waterloo, Ontario, Canada

The stop consonants /b, d, g, p, t, kl were recorded before IiI, lal, [u]. The energy spectrum for each stop consonantwas removed from its original vowel and spliced onto a different steady-state vowel. Results of a recognition testrevealed that consonants were accurately recognized in all cases except when Ikl or Igl was spliced from IiI to [u].Further demonstrations suggested that Ikl and Igldo have invariant characteristics before IiI, [e], and [u]. These resultssupport the general notion that stop consonants may be recognized before different vowelsin normal speech in termsof invariant acoustic features.

a

Sa

Va

Ma

Fa

between the two members of each pair, information isneeded from the vowel portion of the syllable . The firstmember of each pair is articulated toward the front ofthe mouth and is accompanied by a rising frequencytransition to the following vowel, while the secondconsonant in each pair is articulated at the center of the

Fig. 1. Speech spectrograms of pairs of sounds illustrating thecomplex relationship between invariant and transitional cues•The two syllables in each row share invariant cues thatdistinguish the two consonant phonemes from all others inEnglish. The two syllables in each row differ by place ofarticulation, which is cued by the slope of the second (upper)formant transition (F2). In syllables on the left of the figure, F2is rising, in syllables on the right, F2 ill falling.

The simplest view of speech perception states thateach phoneme in a language is accompanied by a 'uniquecombination of acoustic features; whenever a particularphoneme occurs in speech, these features also occur.According to this view, speech perception need onlyinvolve, at the simplest level of analysis, the­identification of particular configurations of acousticfeatures.

In fact , a number of consonant phonemes in Englishare accompanied by invariant acoustic features. Thenoise which accompanies lsi, [z], Ishl , Ichl , lil,and Izhlprovides sufficient information for listeners todiscriminate these phonemes in any syllable environment(Harris, 1958; Heinz & Stevens, 1961). Moreover, it ispossible to demonstrate the featural composition ofthese consonants. For example, Is/-/zl differ mainly interms of voicing; [z] is accompanied by low-frequencyvowel resonance, while lsi is not. When Izl is passedthrough a filter which eliminates energy below 800 Hz,listeners hear lsi. The consonants Ishl, Ichl, Ij /, and Idldiffer mainly in terms of their duration. This isdemonstrated by splicing successive 10-msec segmentsfrom the onset of Ishal , in which case listenerseventually hear Icha/, then Ija/ , and then Idal (Scott,1971).

Other sounds, such as the consonants If/-IOI, Iv/-ll1l,and Iml-/nl require the identification of both invariantand variant cues for their perception (Harris, 1958;Malecot, 1956). As shown in Fig. 1, the two phonemes ineach pair share invariant cues that allow listeners todiscriminate these consonants from all others in English.For example, IfI and 10 I contain low-intensityhigh-frequency energy, Ivl and IlJI are accompanied bythis same energy plus a low-frequency voicing component,while Iml and Inl are accompanied by nasalresonance and a com plete lack of high-frequency energy.Thus, the two consonant phonemes in each row in Fig. 1share invariant cues. In order for listeners to discriminate

•This research was supported by a grant from the NationalResearch Council of Canada awarded to the first author. Wewould like to express our appreciation to the University ofWaterloo computing center for the use of their facilities. Wethank Hillary Stauntan for preparing the experimental tapes andrunning Ss,

101

Page 2: The phantom in the phoneme: Invariant cues for …any class of phonemes in English. The difficulty is that the cues which signal the stop consonants seem to change, depending upon

102 COLE AND SCOTT

mouth and is accompanied by a falling frequencytransition. Thus, perception of these consonants requiressimultaneous identification of invariant and transitionalcues.

The view of distinctive features as patterns ofinvariant acoustic energy appears to break down whenone considers the stop consonants. These consonants(lb, d, g, p, t, k/) carry the highest information load ofany class of phonemes in English. The difficulty is thatthe cues which signal the stop consonants seem tochange, depending upon the vowel context in which theconsonant occurs. For this reason, various investigators(e.g., Stevens & Halle, 1964, p. 90) argue that phonemesmay be adequately described as sets of distinctivefeatures, but these features cannot be identified in thespeech wave.

In normal speech, a stop consonant is preceded bysilence and is accompanied by an explosive burst ofenergy when the consonant is released, followed by aperiod of aspiration which is longer for the unvoicedstops tv. t, kl than for the voiced stops /b, d, g/. Whenproducing a voiced stop, the speaker may initiate voicingwhile his oral cavity is still gliding from the positionassumed for the consonant to the position assumed forthe vowel. When this gliding movement is accompaniedby vocalization, the result is a rapid frequency change atthe beginning of the vowel. These "vowel transitions"have been shown to be sufficient cues for the perceptionof the voiced stops, and these transitions differ for aparticular consonant in different vowel environments(Liberman, Cooper, Shankweiler, & Studdert-Kennedy,1967). For the unvoiced stops, the consonant isfollowed by steady-state vowel formants, althoughfrequency transitions may be observed in the aspiratedportion of the sound, especially for Ik/.

Several investigators have examined the role of thestop burst in the perception of unvoiced stops. Halle,Hughes, and Radley (1957) gated out the bursts for tvt,ItI, Ikl occurring at the end of a word and randomlypresented these isolated bursts to listeners forrecognition. This task was found to be very difficult forSs because of the extremely short burst duration ofunvoiced stops in final position. However, with practice,the five best Ss correctly recognized 65%, 70%, 75%,80%, and 96% of the bursts.

Liberman, Delattre, and Cooper (1952) used a patternplayback machine to present artificial bursts at specificfrequencies before synthetically produced vowels. Thesynthetically produced "teardrop shaped" bursts werealliS msec in duration and were presented as 12different frequencies at 360-Hz intervals before thevowels Ii, e, E, o , 0, u/. Listeners were told to reportwhether they heard tvl, ItI, or Ikl before each vowel.The results showed that burst frequencies above3,240 Hz were heard as ItI before all vowels. A burstpresented at 360 Hz was always heard as tvl. Inaddition, bursts between 1,080 and 2,160 Hz were heardas Ipl before Ii, e, 0, ss] but not before IE, a.o I. Bursts at

720 and 1,080 Hz were heard as Ikl in front of allvowels except Iii (no single frequency was found toproduce Ikl before Iii). In addition, bursts presented atthe frequency of the second formant of the followingvowel were heard as Ik/. In their discussion of theresults, Liberman et al chose to emphasize the variabilityof the data for tvt and Ik/, and concluded that "theresults of this experiment show clearly the influence ofthe following vowel on the perception of the schematicstops Ipl and Ikl [p. 512]." However, it seems to us thatthe most significant result of the Liberman et al study isthat there does exist a single burst frequency for each ofthe unvoiced stops which serves as a cue for thatconsonant (the single exception being Ikl before Iii, forwhich no burst frequency was identified).

The role of vowel context in the perception of theunvoiced stop Ikl was investigated by Schatz (1954).Schatz removed the initial 20-msec explosive burst fromlki/, Ika/, Ikul and spliced the burst from each of thesesyllables onto Ihi/, Iha/, and Ihu/. Since Ikl in initialposition in a syllable consists of an explosive burst of5-10 msec followed by aspiration for a slightly longerperiod (30-60 msec) prior to the vowel, the syllables/hi/, Iha/, Ihul were used to create the impression ofaspiration. For each of these syllables, the Ihl wasshortened to 60 msec. When the burst from Ikil wasspliced before Ihi/, Iha/, Ihu/, Ss heard Iki/, lts], Ipu/;the burst from Ikal yielded Ipi/, Ika/, Ipul and the burstfrom Ikulyielded Ipil or lki/, Ipa/, Iku/. Thus, when theoriginal 20-msec burst from Ikl was spliced onto thesame vowel (preceded by aspiration) from which it hadbeen removed, Ss always heard Ik/. However, when thevowel context was changed, perception of the consonantsound was altered. These results for Ikl agree almostperfectly with the results obtained by Liberman et al forIkl using synthetic speech.

The experiments by Liberman et al and Schatzdemonstrate that a single cue, burst frequency, does notprovide sufficient information to identify Ikl in allsyllable environments. However, the stimuli presented toSs in these experiments do not contain the sameinformation that accompanies stop consonants in normalspeech. For example, using aspirated portions of exactly60 msec eliminates duration of the consonant as a cue toits identity and distorts the shape of the envelope.Fischer-Jorgensen (1954), for example, found that in1,368 spectrograms, the duration of aspiration followingthe burst was always longer for Igl than for either /blortet.

Harris (l953) used tape splicing to transpose theentire energy spectrum of a consonant phoneme fromone syllable to another. Harris identified the initialconsonant portion of a CVC syllable by manuallydrawing recording tape over the playback head of a taperecorder and listening for the onset of the consonant andthe onset of vowel resonance. The entire energyspectrum for the initial consonant was then removedfrom its original syllable and spliced onto a new VC

Page 3: The phantom in the phoneme: Invariant cues for …any class of phonemes in English. The difficulty is that the cues which signal the stop consonants seem to change, depending upon

syllable from which the original consonant hadpreviously been removed. Harris used 20 differentconsonants, which were transposed between six differentvowel environments /I/-/I/; /u/-/I/; hi-lief; Ii/-lief;feI/-/o/; /a/-/o/. In each case, the 20 consonants weretransposed from the first vowel to the second. Results ofthe listening test revealed that syllables were identifiedwith an overall accuracy of slightly better than 80%(chance equals 5%). The results for the six stopconsonants (summed over the different vowelconditions) were: /b/ 87%, /d/ 78%, /g/ 78%, /p/ 33%,/t/ 95%, /k/ 80%. It is clear that the energy spectrumwhich accompanies a stop consonant contains invariantinformation over a wide range of vowel environments. Inthe Harris experiment, only /p/ was identified with lessthan 75% accuracy when it was transposed from onesyllable to another. When /p/ was misperceived, it wasconsistently heard as /t/.

The scores on the listening test might have improvedif Harris had removed formant transitions from thevowels prior to splicing a consonant from one vowelenvironment to another. Since vowel formants change,depending upon the place of articulation of thepreceding vowel, transposed syllables often had formanttransitions that were inappropriate to the syllable. Forexample, in one condition, the syllable /gok/ was formedby splicing /g/ from /ga/ onto loki removed from /mok/.Since /m/ is articulated in the front of the mouth, thevowel transitions before /0/are quite different from thosewhich would normally accompany /gok/. Thus, S mayhave had difficulty integrating the /g/ in /gok/ with thevowel /0/ because of the abnormal vowel transitions.

The present research systematically investigates theamount of perceptual invariance for stop consonants,using the tape splicing technique originally employed byHarris (l953). Experiment I examines the amount ofinvariant information by transposing consonantsbetween the steady-state vowels /i/ and /u/. These twovowels have the most disparate energy spectra of anytwo vowels in English. If, for example, the energyspectrum for /d/ contains invariant information, itshould be possible to transpose the consonant energybetween /di/ and /du/ (after removing vowel transitions)without altering perception of the original syllables.Experiment II extends the results of Experiment I to themiddle vowel /a/ for the unvoiced stops tvt, /t/, /k/.

EXPERIMENT I

Method

Subjects

Forty-two students from the University of Waterloo served asSs. Each S was tested individually in a small sound-attenuatedroom with E, in a session lasting about 30 min.

Stimuli

Each of the six stop consonants was spoken 40 times before

THE PHANTOM IN THE PHONEME 103

IiI and before [u], producing a total of 480 stimuli. From theseoriginal stim uli two types of syllables were constructed:(a) syllables in which transitions were removed, and the initialstop burst plus aspiration was spliced back onto the steady-statevowel, and (b) syllables in which the consonant (burst plusaspiration) originally taken from Iii was spliced onto lui, or viceversa. Stimulus tapes were constructed by first recording all 480syllables in a randomly determined order. Then the initialconsonant energy was removed from each syllable, using thefollowing values (in milliseconds): /bl 20, Idl 30, Igi 40, Ip/60,ItI 80, Ikl 100. These values were arrived at by inspection ofspeech spectrograms of the stimuli. The onset of each consonantwas determined by manually drawing the recording tape over theplayback head of the tape recorder, and monitoring the outputvia earphones and a Tektronix oscilloscope.

All 480 consonants excised from the vowel were played inisolation. By listening to these sounds (burst plus aspiration), wewere able to ascertain that consonants were not followed byvowel resonance. For the voiced stops /bl, Idl, IgI, the energyaccompanying the consonant is quite short, and it is difficult todiscriminate one voiced stop from another. However, theunvoiced stops /p/, ItI, /kl were readily identifiable by bothauthors when the consonant energy was removed from thevowel.

After removing the burst and aspiration portton of each CVsyllable according to the above values, the next 50 msec wasexcised from the remaining vowel portion of the syllable toinsure that a steady-state vowel remained. Then aU 480 vowelswere played to insure that a pure vowel sound was heard and nota stop consonant cued by formant transitions. On those caseswhere fbi was heard (due to the abrupt onset caused by tapesplicing), additional 10-msec segments were removed until a purevowel was heard. Thus, perceptual criteria were used to insurethat the noise portion of the stop consonant was paired with asteady-state vowel.

For half of the syllables, the consonant was spliced back ontothe original vowel to produce transitionless syllables. For theremaining one-half of the syllables, the consonant wastransposed between Iii and [u]. Energy was always transposedbetween vowels originally having the same consonant, such as/bil and /bu/. Examples of original and transposed syllables areshown for each stop consonant in Figs. 2 and 3.

Procedure

The stimulus tape was presented on a Sony Model TC630 taperecorder connected to a Dynaco loudspeaker by means of aDynaco amplifier and preamplifier. S was seated in front of theloudspeaker in a sound-attenuated room. Stimuli were presentedat the rate of one every 4 sec. After each presentation, S wasinstructed to write one of the 12 possible syllables on his answersheet. S was instructed to guess if he was uncertain of the sound.

Results

The results of the listening test are shown in Table 1.1

Inspection of this table reveals a high degree ofperceptual invariance for stop consonants. Performancewas essentially perfect for sounds with vowel transitionsremoved. For the transposed syllables, Ss consistentlyidentified syllables in which the initial segment wastransposed from /u/ to /i/. When the initial segment wastransposed from /i/ to [u], Ss accurately identifiedbilabial and alveolar stops, but fell to 54% accuracy for/kiU/ (/k/ from /ki/ spliced onto /u/), and werecompletely at chance for /giU/ (/g/ from /gi/ spliced onto/ul). An examination of intrusion errors revealed that

Page 4: The phantom in the phoneme: Invariant cues for …any class of phonemes in English. The difficulty is that the cues which signal the stop consonants seem to change, depending upon

104 COLE AND SCOTT

8

0 , . , __

o .1sec

bu

Time

du

Fig. 2. Speech spectrograms of voiced stops used in this experiment before Iii and lui. The stimuli in the top row consist ofsyllables in which vowel transitions were excised from the syllable, and the stop burst and aspiration were spliced onto theremaining steady-state vowel. Note the absence of transitions at the onset of the vowel as compared to the vowels in Fig. 1.The syllables in the lower panel are examples of stimuli in which the stop burst and aspiration for each voiced stop weretransposed between Iii and lui.

ku

Time

pu

Pu+i

pi

oo .1sec

8

o8

Fig. 3. Speech spectrograms of control syllables (upper panel) and transposed syllables (lower panel) for the unvoiced stops.

Page 5: The phantom in the phoneme: Invariant cues for …any class of phonemes in English. The difficulty is that the cues which signal the stop consonants seem to change, depending upon

/kiU/ was identified as /pu/ 63% of the time and as /tu/27%, while /giU/ was perceived as /hu/ 90% of the timeand as /du/ 9%.

The relatively poor results for /kiU/ and /giU/ may notreflect a lack of acoustic invariance for the velarconsonants. When /k/ and /g/ are spoken before /i/, theyare accompanied by energy in the region of the secondvowel formant (about 3,000 Hz), which is not presentwhen /k/ and /g/ are spoken before lui. This is due tothe fact that /k/ and /g/ are produced with a closurenearer to the front of the mouth for /i/. This frequencycomponent at 3,000 Hz is not a necessary cue for theperception of /k/ or /g/, since Ss had no difficultyidentifying syllables in which the initial energy for /ku/or /gu/ was spliced onto lit. It is only when the stopburst from /k/ or /g/ is transposed from /i/ to /u/ that Sshad difficulty identifying the sounds. This suggests thatthe energy at 3,000 Hz which accompanies /kiu/ and/giu/ caused Ss to misperceive these syllables, probablybecause /ku/ and /gu/ are never accompanied byhigh-frequency energy in normal speech.

To test this hypothesis, all of the transposed syllableswere passed through a filter which eliminated energyabove 2,000 Hz. While filtering had little effect upon theperception of any of the other transposed syllables, Sscorrectly identified /kiu/ and /giU/.

In a second demonstration, the consonant portionfrom /gi/ or /ki/ was substituted with the consonantportion from /gu/ or /ku/, respectively, in normalspeech. For this experiment, we recorded 10 shortphrases such as "the sterile cuckoo" or "Hungariangoulash," and replaced the initial stop in /kuku/ and/gulas/ with the appropriate consonant removed from/gi/ or /ki/. For all of these altered phrases, listenersheard a perfectly clear /g/ or /k/ at the beginning of theword. The high- frequency energy which accompanies theconsonant from /ki/ and /gi/ was typically heard as aclick or as background noise.

In a final demonstration, tape loops were made ofeachof the transposed stimuli, and five Ss wererandomly presented with each of these syllables at therate of 2/sec for 1 min. When /kiU/ and /giU/ werepresented, all Ss reported hearing /ku/ and /gu/ afterthree or four repetitions. Repetition of these. syllablescaused Ss to regard the high-frequency component of/kiU/ and /giU/ as irrelevant information which was heardas nonspeechlike background noise. None of the othertransposed syllables were heard to change in this way,although several Ss reported hearing complete satiationof the consonant portion of the syllable. This later fmdingis discussed more fully by Scott and Cole (1972).

EXPERIMENT II

Experiment II was performed in order to extend theresults of Experiment I to syllables produced before themidvowel /a/.

THE PHANTOM IN THE PHONEME 105

Table 1Percent Correct Recognition Scores

Control

/bl Idl Igl Ipl ItI Ikllif 99 100 98 100 99 99

lui 100 100 100 99 99 99

TransposedIbuif Idui/ Iguif IPuil Itui/ Ikui/

96 92 82 98 97 98

/biul IdiUI Igiul IpiUI Itiul IkiUI94 99 21 98 89 54

Method

Subjects

Ten students from the University of Waterloo served as Ss.Each S was tested in a small room with E in a session lastingapproximately 30 min.

Stimuli

The syllables Ipil, Ipu/, ltil, hu], lkil, /kul were recorded 10times each, while Ipal, Ital, /kal were recorded 20 times each inrandom order on magnetic tape. Following the identicalprocedures used in Experiment I, the consonant portion of theenergy spectrum was transposed between Iiiand laland betweenlui and lal for each of the unvoiced stops. Thus, Ss heard 120syllables in random order, 10 each of Ipial, IPail, IPua/, IPau/,Itial, Itail, Itua/, Itau/, Ikia/, Ikai/, /kua/, /kau/.

Procedure

5 was seated in front of a Dynaco loudspeaker and waspresented the 120 syllables at the rate of 1 every 4 sec. 5 wasinstructed that on each trial he would hear Ipl, Itl,or Ikl pairedwith Iii, lal, or lui. 5 was instructed to write tsl, [t], or Ikl onhis answer sheet after hearing a sound, and to guess if he wasuncertain of a particular sound.

Results

The results of the listening test are presented inTable 2. It can be seen that scores were uniformly highfor all of the transposed syllables. Overall, Ss identifiedunvoiced stops transposed between /i/-/a/ with 89%accuracy and between /ul-/a/ with 91% accuracy (chance= 33%). It is interesting that performance was quite highfor /kia/ (78%) since performance was relatively low for/kiU/ in Experiment I. This suggests that the low scoresfor /kiU/ may represent a single case when invariancebreaks down for the unvoiced stops.

DISCUSSION

This research demonstrates that the energy spectrumwhich accompanies the noise portion of a stopconsonant in initial position contains invariantperceptual information for /h/, /d/, /p/, and Itt. Inaddition, a number of demonstrations suggest that /g/and /k/ also have an underlying acoustic invariance. Ourdata, considered together with the results of Harris

Page 6: The phantom in the phoneme: Invariant cues for …any class of phonemes in English. The difficulty is that the cues which signal the stop consonants seem to change, depending upon

106 COLE AND SCOTT

Table 2Percent Accuracy Identification Scores for /p/, /t/, /k/

Transposed Between /i/-/a! and /ul-/a!

(1953), show that listeners can accurately identifycertain stop consonants that have been transposed fromone vowel environment to another.

It is possible that the noise portions of the stopconsonants that were transposed from one syllable toanother did contain some frequency transitionsembedded within the noise (although these were notevident in speech spectrograms). For example, in somecases, the noise burst is continuous with the onset of thevowel transition, and in these cases it is difficult toisolate a "pure" burst for the stop consonant. However,the presence of these transitions should not have aidedin perception of the stop consonant in transposedsyllables, since these transitions by themselves are notcommutable elements (Liberman et al, 1967, p.436).The essential point demonstrated in this experiment isthat the entire energy spectrum that accompanies a stopconsonant prior to the vowel transitions containsinvariant perceptual information.

A recent experiment by Winitz, Scheib, and Reeds(l972) strongly supports our results for the unvoicedstops /p, t, k/. These authors removed the acousticenergy that accompanies /p, t, k/ from conversationalspeech and presented the isolated segments to listenersfor identification. All consonants were identifiedaccurately when they were removed from initial or finalposition in a word before or after /i, a, u/. The onlyexception was /k/ before /i/, which was oftenmisperceived. These findings are in perfect agreementwith the results of the present experiment.

Examination ofspeech spectrograms (see Figs. 2 and3) reveals that the overall energy spectrum for a givenstop consonant does change from one vowelenvironment to another. In particular, the aspiratedportion of tvt and /k/ following the stop burst isaccompanied by concentrations of energy in the regionof the second formant of the following vowel. Thisfrequency information is not an essential cue forperception of /p/ or /k/ since it may be removed byfiltering without altering perception of the consonant.However, this frequency information can interfere withperception of /k/ when it is transposed from /i/ to /u/.

Thus, a stop consonant is composed of invariantfeatures as well as features which vary from one vowelenvironment to another. For /t/, the invariant featureconsists of high-frequency noise. But what are theinvariant characteristics of /p/ and /k/? Recent researchin our laboratory has shown that the shape of the

/Pu a/83

/tua/93

/kua/100

/Pau/99

/tau/100

/kau/70

/Pia/94

/tja/98

/kia/78

/Pai /99

/tai/88

/kai/75

envelope that accompanies tvt and /k/ may containinvariant information for identifying these consonants(Scott, 1973).

The majority of research in speech perception hasinvolved synthetic stimuli and has focused on isolatedcues such as stop bursts or vowel transitions. Theconclusion of much of the research with syntheticspeech is that the acoustic information needed toidentify the stop consonants is embedded in the vowelportion of a consonant-vowel syllable, so thatconsonants must be recoded from the energy in thespeech wave before they are identified (Liberman et al,1967). In fact, Iiberman (1970) maintains that there is agrammar of consonants and vowels no different inprinciple from the grammar which speakers use togenerate and understand sentences.

The present research demonstrates a high degree ofacoustic invariance for the most problematic phonemesin the English language, the stop consonants. The stopconsonants in initial position consist of both invariantcues and transitional cues which vary from one vowelenvironment to another. The literature on speechperception has shown that the perceptual system can anddoes make use of both types of cues; however, theoverwhelming emphasis has been placed on the variabletransitional cues. It is highly unlikely that a process asevolutionarily advanced as speech perception would failto utilize all of the available information contained inthe speech wave. It is our belief that the process ofspeech perception is not confined to the analysis ofprimarily one type of cue, but rather an analysis of allthe available cues.

REFERENCES

Fischer-Jorgensen, E. Acoustic analysis of stop consonants.Miscellanea Phonetica, 1954,2,42-59.

Halle, M., Hughes, G. W., & Radley, J.-P. A. Acoustic propertiesof stop consonants. Journal of the Acoustical Society ofAmerica, 1957,29, 107-116.

Harris, C. A study of the building blocks in speech. Journal ofthe Acoustical Society of America, 1953,25,962-969.

Harris, K. S. Cues for the discrimination of American Englishfricatives in spoken syllables. Language & Speech, 1958, 7,1-7.

Heinz, J. M., & Stevens, K. N. On the properties of voicelessfricative consonants. Journal of the Acoustical Society ofAmerica, 1961,33,589-596.

Liberman, A. M. The grammars of speech and language.Cognitive Psychology, 1970, 1,301-323.

Liberman, A. M., Cooper, F. S., Shankweiler, D. P., &Studdert-Kennedy, M. Perception of the speech code.Psychological Review, 1967,74,431-461.

Liberman, A. M., Delattre, P., & Cooper, F. The role of selectedstimulus variables in the perception of the unvoiced stopconsonants.. American Journal of Psychology, 1952, 65,497-516.

Malecot, A. Acoustic cues for nasal consonants. Language, 1956,32,274-284.

Schatz, C. The role of context in perception of stops. Language,1954,30,47-56.

Page 7: The phantom in the phoneme: Invariant cues for …any class of phonemes in English. The difficulty is that the cues which signal the stop consonants seem to change, depending upon

Scott, B. The verbal transformation effect as a function ofembedded sounds. Unpublished master's thesis, University ofWaterloo, 1971.

Scott, B. The waveform envelope revisited.. Paper presented tomeeting of the Canadian Psychological Association, Victoria,B.C., June 1973.

Scott, B., & Cole, R. A. Auditory illusions as caused byembedded sounds. Journal of the Acoustical Society ofAmerica, 1972,51,112 (abstract).

Stevens, K., & Halle, M. Remarks on analysis by synthesis anddistinctive features. Proceedings of the AFCRL Symposiumon Models for the Perception of Speech and Visual Form,Boston, November 1964. Cambridge: MJ.T. Press, 1967.

Winitz, H., Scheib, M. E., & Reeds, J. A. Identification of stopsand vowels for the burst portion of /p, t, k/ isolated from

THE PHANTOM IN THE PHONEME 107

conversational speech. Journal of the Acoustical Society ofAmerica, 1972,51,1309-1317.

NOTE

1. The data for /g/ and /k/ are based on the listening scores ofthe second 21 Ss only. Scores for the first 21 Ss were notconsidered for the velars because it was discovered that a30-msec burst had been used instead of a 40-msec burst. Ananalysis of variance revealed. no difference in listening scores forthe first or second group of Ss for any of the other phonemes.

(Received for publication February 6, 1973;revision accepted August 15, 1973.)