Role of prosodic features in the human process of per ...

10
J. Acoust. Soc. Jpn.(E) 16, 5 (1995) Role of prosodic features in the human process of per- ceiving spoken words and sentences in Japanese Nobuaki Minematsu and Keikichi Hirose Department of Electronic Engineering, Faculty of Engineering, University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113 Japan (Received 9 December 1994) Although prosodic features of speech are known to play an important role in the trans- mission of linguistic information, experiments are rather rare on the quantitative analysis for their roles in the speech perception process. As a step toward the clarification and formulation of the process, three perceptual experiments were conducted. In the first experiment, synthetic speech of isolated words were generated after accent type manipula- tion. Results showed that the prosodic features are important for word perception especially for the case of type 1 accent. The gating paradigm was applied to natural word utterances in the second experiment. Using the gated utterances as stimuli, the minimum period required for the correct identification was investigated for words with each accent type. Results showed that, utilizing the prosodic features, the perception of words with type 1 accent completes earlier than that of words with other accent types. In the last experiment, sentence stimuli were synthesized after manipulating phrase and accent components of the fundamental frequency contour. Results showed that a phrase component, even with a small command magnitude, can group words in a phrase unit, and, thus, can work as a cue for detecting syntactic structures. Keywords: Speech perception, Prosodic features, Fundamental frequency contour, Analysis-resynthesis, Gating paradigm PACS number: 43. 71. An, 43. 71. Es 1. INTRODUCTION In recognizing lexical items of spoken sentences, it is thought that a listener utilizes various kinds of information on acoustic features and linguistic properties, besides the basic information of phoneme sequences detected by a left-to-right matching pro- cess of segmental features between inputs and tem- plates. In order to clarify the human process of speech perception, a series of psycholinguistic ex- periments has been conducted with the following results1-3): 1) If a syllable in a word is replaced by a silence, the replacement is perceived with a lower rate when it is uttered in a longer context. This result indicates that the unit of speech perception is not limited to phonemes but includes longer segments, such as words. 2) If a word is uttered correctly, the access to it in our mental lexicon completes in a period shorter than the time required for detecting a pronunciation error located at the end of another utterance of the same word. This result indicates that the lexical access occurs before the identification of all the con- stituent segments completes. 3) If a spoken word is reproduced in a noisy environment, the rate of correct identification is largely related to its familiarity : higher rate for more familiar words. This indicates that the correct access to familiar words completes in our mental lexicon with less information than in the case of unfamiliar words, 311

Transcript of Role of prosodic features in the human process of per ...

J. Acoust. Soc. Jpn.(E) 16, 5 (1995)

Role of prosodic features in the human process of per-

ceiving spoken words and sentences in Japanese

Nobuaki Minematsu and Keikichi Hirose

Department of Electronic Engineering, Faculty of Engineering, University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113 Japan

(Received 9 December 1994)

Although prosodic features of speech are known to play an important role in the trans-

mission of linguistic information, experiments are rather rare on the quantitative analysis

for their roles in the speech perception process. As a step toward the clarification and

formulation of the process, three perceptual experiments were conducted. In the first

experiment, synthetic speech of isolated words were generated after accent type manipula-

tion. Results showed that the prosodic features are important for word perception especially for the case of type 1 accent. The gating paradigm was applied to natural

word utterances in the second experiment. Using the gated utterances as stimuli, the

minimum period required for the correct identification was investigated for words with

each accent type. Results showed that, utilizing the prosodic features, the perception

of words with type 1 accent completes earlier than that of words with other accent types. In the last experiment, sentence stimuli were synthesized after manipulating phrase and

accent components of the fundamental frequency contour. Results showed that a

phrase component, even with a small command magnitude, can group words in a phrase unit, and, thus, can work as a cue for detecting syntactic structures.

Keywords: Speech perception, Prosodic features, Fundamental frequency contour, Analysis-resynthesis, Gating paradigm

PACS number: 43. 71. An, 43. 71. Es

1. INTRODUCTION

In recognizing lexical items of spoken sentences,

it is thought that a listener utilizes various kinds of

information on acoustic features and linguistic

properties, besides the basic information of phoneme sequences detected by a left-to-right matching pro-cess of segmental features between inputs and tem-

plates. In order to clarify the human process of speech perception, a series of psycholinguistic ex-

periments has been conducted with the following results1-3):

1) If a syllable in a word is replaced by a silence, the replacement is perceived with a lower rate when

it is uttered in a longer context. This result indicates

that the unit of speech perception is not limited to

phonemes but includes longer segments, such as words.

2) If a word is uttered correctly, the access to it

in our mental lexicon completes in a period shorter

than the time required for detecting a pronunciation

error located at the end of another utterance of the same word. This result indicates that the lexical

access occurs before the identification of all the con-stituent segments completes.

3) If a spoken word is reproduced in a noisy

environment, the rate of correct identification is largely related to its familiarity : higher rate for more

familiar words. This indicates that the correct

access to familiar words completes in our mental lexicon with less information than in the case of

unfamiliar words,

311

J. Acoust. Soc. Jpn. (E) 16, 5 (1995)

4) If a word is reproduced in a sentence under the gating paradigm, the necessary period for its correct identification is largely related to the degree of semantic commonness of the sentence : shorter

periods for more common sentences. This indicates that the process of lexical access is influenced by higher-level linguistic information, such as the meaning. Based on these findings, a tentative model was constructed for the human process of speech per-ception.2) In the experiments, however, no sys-tematic control was conducted on prosodic features of the stimuli and, consequently, the model failed to include any processing modules for the prosodic features.

Through the analyses of prosodic features, it was already shown that the prosodic features play an important role in the transmission of the linguistic information of an utterance, such as the lexical meaning, the syntactic structure, and the focal con-dition.4) These analyses, however, were conducted with the major purpose of constructing prosodic rules for speech synthesis and were only followed by

primitive experiments on perception. Although several perceptual works have already been reported on the role of prosodic features, they were mostly limited to the special cases, viz., the cases with syntactic (and, therefore, semantic) ambiguities only from segmental information.5-9) In view of these considerations, perceptual experiments have been

planned for spoken words and sentences to examine the role of prosodic features in recognizing words and phrases.

Although the process of speech perception is largely affected by the acoustic and linguistic con-texts1,2,10) as mentioned already, experiments with word stimuli are still very useful in clarifying the

process because of their limited correlates. Two experiments were designed to investigate the effect of accent on word identification, where word stimuli were generated from synthetic speech with original and modified accent types. Although it is known that the prosodic features have several acoustic correlates, such as the fundamental frequency contour, the glottal source power, and the phoneme duration, only the first item was directly manipulated in the current experiments, taking its priority in the

generation and perception of Japanese prosody into account.5,11)

An experiment was further designed using sen-

tence stimuli to investigate the effects of accent components of F0 contour on the perception of words in a context and the effects of phrase com-

ponents on the perception of syntactic boundaries. In this experiment, both of the phrase and accent components were manipulated. In the forthcoming sections, each of the above experiments is described in detail.

2. ROLE OF LEXICAL ACCENTS IN THE PERCEPTION OF SPOKEN WORDS

2.1 Background and Objective When a content word of Japanese is uttered in

isolation, its prosodic features can be fully represent-ed by the accent type, denoted by a high-low pattern of F0 for each constituent mora. Although, for n-mora words, 2n high-low combinations are possible, the number of accent types in actual use is strongly limited. In the Tokyo dialect, only (n+1) accent types are used, which are usually denoted as "typei "accents (i=0,1, ..., n). Figure 1 schematically

shows all of the accent types for the case of n=4. Type i accent is characterized by a rapid downfall in the F0 contour around the end of the ith mora, except for the case of i=0 with no apparent down- fall.

Although it is true that a spoken word can be distinguished from the other words without informa-tion on the accent type except for the case of iden-tical phonemic sequence, we cannot conclude that word accents have no role in word perception. Since accent types can be easily recognized by humans, it is quite natural to assume that they have some roles. From this point of view, an experiment was designed,12,13) where 4-mora words with original accent types of 0 to 3 were selected as speech ma-terials. These materials were gone through the analysis-resynthesis process to generate synthetic speech with original and modified accent types. Type 4 accent was excluded in this experiment be cause it can only be distinguished from type 0 accent when the word is followed by a particle such as /ga/.

Fig. 1 Binary description of the F0 con-

tours for each of five accent types of 4- mora words of the Tokyo dialect in

Japanese.

312

N. MINEMATSU and K. HIROSE: ROLE OF PROSODIC FEATURES IN SPEECH PERCEPTION

2.2 Speech Material As shown in Table 1, 12 nouns were selected for

each of type 0 to type 3 accents. They were uttered by an adult male speaker of the Tokyo dialect at the rate of around 7mora/s. After recorded in a digital audio tape, the utterances were resampled at 10kHz with 12 bit accuracy to be used as the speech materials. Then, during the PARCOR anal-ysis-resynthesis process, the following three types of F0 contour manipulation were applied for each speech material.

CASE 1 making F0 constant at 100Hz. CASE 2 converting the accent type to another

type. CASE 3 no change in F0 (original accent type).

The CASE 2 manipulation was performed using a functional model of F0 contour generation14) (hence-forth, F0 model) by changing the onset and offset timings of the accent command. This is because, with the F0 model, every accent type is well charac-terized by the location of the accent command with respect to the segmental boundaries. Although the

command amplitude may differ among the accent types, it was unchanged during the manipulation. A band-elimination of 0.5kHz to 3.0kHz was further

performed to all the synthetic speech to generate the speech stimuli. This final process was conducted to make it difficult to perceive the stimuli only by the left-to-right process of syllable-based matching.

Table 1 Four-mora words of Japanese used for the experiment on word perception. Every word in the ith triple rows has type

(i-1) accent. The accent conversion for a word with the CASE 2 manipulation is indicated in the left end column of the corresponding row.

2.3 Method and Procedure The stimuli were presented through headphones

with 4s inter-stimulus interval to 10 male subjects of Japanese, who were asked to reproduce the words orally with the original accent types. The repro-duced words were recorded to be used for the off-line calculation of the correct recognition rate. The reproduction was counted as correct regardless of the reproduced accent type. Experiments were conducted for the three cases in the order of 2, 1, and 3.

In organizing the experiment, we should note that subjects may memorize the words if they were repeated many times during the experiment. This makes the results unreliable. In order to cope with the effect of memorization, different words were selected for each combination of the original and converted accent types. For instance, as shown in Table 1,"raioN (lion),""akabou (railway porter)," "niNjiN (carrot) ," and "naiyou (content)" were selected for words with type 0 accent, which was converted into type 1 accent by CASE 2 manipula-tion.

2.4 Results and Discussions Figure 2 shows the recognition rates for each

CASE and each "original" accent type as averages over all the subjects. Bars for CASE 3 indicate the word recognition rates for stimuli with "original" F0 contours under the condition of band-elimination. Each of these bars shows a similar value of around 75%, indicating that the quality of the synthesized speech is almost the same for all the accent types. If we compare these results with those for CASE 1,

Fig. 2 Word recognition rates shown sep-

arately for each case and for each "orig-

inal" accent type.

313

J. Acoust. Soc. Jpn.(E) 16, 5 (1995)

large drops in recognition rate are observed for all

the accent types except for type 0 accent, implying

that the prosodic features play a definite role in the

perception of words with accent types 1, 2, and 3.

As for the words with type 0 accent, the role seems

very small, but we should also note that their F0

contours are originally flat and that no drastic

change results from the CASE 1 manipulation. The

figure also shows further drops in recognition rate

for CASE 2 with a large drop for type 0 accent.

With this result, however, we cannot contrarily

conclude that the role of prosodic features is also

important for the perception of words with type 0

accent, considering the following properties of the

CASE 2 manipulation. While the word accent was

neutralized with a flattened F0 contour in CASE 1,

it was converted into a "known" type in CASE 2.

In addition to the effects of modifying the F0 con-

tour, the influence of the conversion into a "known"

type must be considered in CASE 2. Namely,

perception of other known accent types may pre-vent correct recognition more remarkably than in

the case of mere flat contours. Additional experi-

ments are necessary to clarify the issue.

Among the drops of recognition rate in CASE 1

and CASE 2, the largest ones are clearly observed

for words with type 1 accent in both cases. These

results imply the largest role of prosodic features in

the perception of words with type 1 accent. If this

is the case, the recognition rate after the accent

type conversion should be minimum for the con-

Fig. 3 Recognition rates of words (bars on CASE 2) and of accent types (bars on CASE' 2). Bars on CASE 2 indicate the recognition rate for words with accent type conversion into type i from non-type i by CASE 2 manipulation.

version into type 1 accent from non-type 1 ac-cents. Using the same data as those for Fig. 2, this type of recognition rate is shown in Fig. 3 as bars (denoted by CASE 2) in the left-hand side separately for each of converted accent types. As expected, the lowest value is observed for type 1 accent (after conversion). Figure 3 also shows the rate of accent recognition (denoted by CASE 2') for CASE 2 stimuli as bars in the right-hand side, where correct/wrong judgments were conducted with re-spect to the converted accent types of the stimuli.

(If the subject's reproduction had the same accent type as the converted accent type, the result was counted as correct.) The largest rate is obtained for stimuli with type 1 accent (as the converted accent type), indicating that the subjects utilize

prosodic features best for the case of type 1 accent. The reason for the largest role of prosodic features in the case of type 1 accent is considered to be as follows: An early downfall in the F0 contour for type 1 accent makes it possible to identify the accent before the completion of the word recognition pro-cess. Accordingly, prosodic information should be utilized to facilitate the access to the mental lexicon in our LTMs (long term memories) by limiting the searching space.

In order to prove the above hypothesis of the earlier perception of words with type 1 accent, an additional experiment was designed using the gating technique.") In this experiment, 45 target utter-ances of Japanese 4 mora nouns, 15 words for each of accent types 0 to 2, and 20 dummy utterances were prepared as speech materials. As shown in Fig. 4, the initial portion of duration d was retained and the rest was replaced by silence for each ma-terial. The gating interval d was varied from 150

Fig. 4 Original word utterance and gated stimulus for the experiment. The gating

interval d was increased from 50ms to

500ms in 5ms steps.

314

N. MINEMATSU and K. HIROSE: ROLE OF PROSODIC FEATURES IN SPEECH PERCEPTION

Fig. 5 Minimum durations required for

correct identification of 4-mora words

with accent types 0 to 2.

ms to 500ms in steps of 50ms, producing 8 sets of

65 stimuli. They were presented to 6 subjects of

the Tokyo dialect through headphones in 8 sessions.

An earlier session was conducted using a set of

shorter gating interval. In each session, the 65

stimuli were randomized and presented with an

inter-stimulus interval of 5s. The subjects' task

was to guess the missing part and write down the

entire word on a sheet immediately after hearing

each stimulus. If the subjects were not able to

guess, they were allowed to write "•~." After all

the sessions, the minimum durations for the correct

identification were obtained separately for each

word. Figure 5 shows the averages and standard

deviations of the minimum durations separately for

words with each accent type. According to the

analyses of variance for these results, significant

differences were observed only between types 0 and

1 (F(1,178)=18.74, p=2.50•~10-5) and between types

1 and 2 (F(1 ,178)=47.41, p=9.53•~10-11). Between

types 0 and 2, F(1,178)=6.26, p=1.32•~10-2. These

results clearly show that an earlier downfall in the

F0 contour makes the earlier identification of accent

type possible and facilitates the word recognition

process.

Supposing that the effects of prosodic features on

the access to the mental lexicon are determined only

by the location of the downfall in the F0 contour,

the role should gradually decrease for types 1, 2, 3,

and 0, in this order. No result, however, was found

to support this hypothesis in these two experiments.

This implies the existence of perceptual mechanisms

that make it possible to identify the word only with

the acoustic features of the beginning of the utter-

ance.16)

3. ROLE OF PHRASE COMPONENTS

IN THE PERCEPTION

OF SPOKEN SENTENCES

3.1 Background and Objective

Although, in the previous section, investigation was conducted on the role of F0 contours in speech

perception, it was limited to the accent components of word F0 contours. In the case of spoken sen-

tences, however, the phrase components should be

taken into account as well as the accent components because they also play an important role in the

transmission of the linguistic information, such as the syntactic structure.4) Several works have al-

ready been carried out using spoken phrases and sentences with major concern on dissolving syn-

tactic ambiguities by prosodic features.5-9) In these works, however, discussions were mainly on the

prosodic features of small portions such as particles or content words without quantitative formula-tions of F0 contours. Based on these considera-

tions, we have designed an experiment for sentence stimuli, where the phrase and the accent components

of F0 contours were controlled using the F0 model during the analysis-resynthesis process. Investiga-

tion was conducted on the role of accent compo-nents in recognizing words in a sentence and on the

role of phrase components in spoken sentence

perception.13)

3.2 Speech Material For the experiment on sentence perception, 16

target sentences were prepared together with 27 dummy sentences. Every target sentence has the

same syntactic structure, shown in Fig. 6, to avoid the experimental results being affected by differences

in the level of syntactic complexity. The upper row of the figure shows the syntactic structure in

Japanese, while the lower row shows that trans-

formed into English for the readers' sake. Each

Fig. 6 Syntactic structure of the target sentences. It is shown in Japanese word order in the upper row, while it is con-verted into English word order and shown in the lower row. Bold-italic letters indicate major syntactic elements.

315

J. Acoust. Soc. Jpn.(E) 16, 5 (1995)

target sentence contains 11 content words, each of which is followed by a particle to make a Japanese "bunsetsu ." If differences in the level of familiarity are large among words included in the sentences, they may strongly influence the word recognition rate. In order to avoid this situation, all of the content words were selected from Japanese text-books for the lower grades of primary school. An example for the target sentences is:

pasokoNo (O) dounyuushita (V) buchouga (S) teNiNga (S) shimeshita (V) sousahouo (O1) kiiboodoo (O) tsukaenai (V) yakuiNni (O2) kaigishitsude (adv.) hiroushita (V).

(The section head, who have introduced personal computers, showed to the executives, who cannot use keyboards, the operations which the salesclerk explained.)

Although each target sentence consists of 4 phrases, the predicate phrase (the last phrase without under-line in Fig. 6) was excluded from the calculation of the recognition rate. This is because the first 3

phrases have a similar configuration, viz., a com-bination of a subjective/objective noun, a verb, and an objective/subjective noun, while the predicate

phrase has a different configuration, a verb and its adverb. In order to examine the role of accent on word recognition in a sentence context, 4- to 5-mora words with accent types 0 to 3 were selected and were placed at O01 and O2 positions: 8 words for each of 4 accent types. As for the dummy sentences, the number of content words in a sentence varied from 5 to 18 with various syntactic structures. All the target and the dummy sentences were uttered by an adult male speaker of the Tokyo dialect. A

pause usually works as an important cue for a syntactic boundary, and it may hide the role of

phrase components in the perception of the syntactic structure. To avoid this situation, the speaker was asked to read a sentence in one breath. As a result, speech rate of the samples was approximately 9 mora/s, a little bit faster than the spoken words used in the previous section.

After being recorded and digitized, all of the speech samples were gone through the analysis-resynthesis process to generate stimuli for the ex-

periment. Unlike the case of word stimuli in sec-tion 2, the LMA (Log Magnitude Approximation) filter was used for the process to improve the speech quality in segmental features.17,18) During the

process, the Fo contours were manipulated by con-

trolling all of the phrase commands and all of the

accent commands of the Fo model simultaneously.

By making the magnitudes of every phrase command

into zero, phrase components can be deleted from

the stimuli. A preliminary experiment using stimuli

after the phrase component deletion, however,

indicated that they still sounded as if they had had

phrase .components. This is because, in natural

utterances, the amplitude of an accent command is

usually larger if it is located at the beginning part of

a phrase, and is smaller if located at the end. To

cope with this effect, the amplitude of every accent

command in a sentence was controlled to have the

same value. The similar control was also conducted

for the magnitude of phrase command. In the

curent experiment, both of the magnitude of phase

command and the amplitude of accent command

were varied from 0.0 to 0.3 in steps of 0.1, yielding

16 (=4•~4) F0 contours for a sentence. As for

the placement and the timing control of the phrase

and the accent commands, they were conducted as

follows:

(1) Based upon the binary description of accent

types as shown in Fig. 1, the onset of the accent

command was placed 40 ms prior to the voice

onset of the first high-pitch mora in each bunsetsu,

and the offset was placed 40 ms prior to the voice

onset of the first low-pitch mora after a high-pitch

mora (or high-pitch morae).

(2) A phrase component was placed for each

Fig. 7 An example of Fo contours used for

the synthesis of target stimuli. Crosses in the top panel indicate the Fo values ex-

tracted from the original utterance.

316

N. MINEMATSU and K. HIROSE: ROLE OF PROSODIC FEATURES IN SPEECH PERCEPTION

phrase (including the predicate phrase). The onset timing of a phrase command was positioned 50ms

prior to the voice onset of the first morae in the corresponding phrase.

In the above control, the onset and offset timings were placed 40/50ms prior to the corresponding segmental features. This is because, in the F0 model, the actual rise (or downfall) in F0 contour has some delay due to the control mechanism of the model.") Figure 7 shows an example of F0 con-tours obtained with the procedures above. Base-line of each sentence F0 contour was adjusted so that the logarithmic mean in F0 had 110 Hz.

3.3 Method and Procedure As mentioned already, 16 types of F0 contours

are possible for a sentence, and all of these were realized for each of 16 target sentences in synthesiz-ing the stimuli. As a result, 256 target stimuli were

generated, which were divided into 16 groups evenly. Division was conducted so that each one of 16 target sentences and each one of 16 F0 contour types would appear once in a group. (In a group, F0 contour types are different among target sentences.) 27 dummy stimuli (27 different sentences) were also included in each group to form 16 stimulus groups for the experiment. For a stimulus group, 7-bunsetsu dummy stimuli were generated with flat F0 contours, while the other dummy stimuli were synthesized with F0 contours randomly selected from 16 combinations of the command values. The ex-

periment was conducted for 8 male subjects in two sessions with the interval of two days. For the first session, 8 stimulus groups were selected and each group was presented to one of the subjects. The other 8 groups were used in the second session in the similar way. Each stimulus was followed by an inter-stimulus interval with length equal to the stimulus, during which the subjects were asked to reproduce the whole sentence orally or, if not pos-sible, as many content words as they could. As for the correct/wrong judgment of the reproduction,

particles were not taken into account. While listen-ing to the stimuli, the subjects were required to do an extra task, where they had to trace a slowly-moving mark on a CRT display of a personal com-

puter. No mark appeared during the inter-stimulus interval for the reproduction of the stimuli.

After the experiments above, evaluation tests were performed on the quality of the synthetic

speech stimuli, where bunsetsu speech segmented from the following three types of dummy sentence stimuli were presented to 4 subjects: 1) natural speech, 2) synthetic speech with flat F0 contours, and 3) synthetic speech with all of the phrase and accent command values being set to 0.3. The word recog-nition rates for the three cases were respectively 98.3 %, 97.7 % and 98.7 %. These high recogni-tion rates, similar to each other, indicated that the segmental features of synthetic speech were not degraded seriously by the analysis-resynthesis process even with manipulations of the F0 contours.

3.4 Results and Discussions Figure 8 shows word recognition rates obtained

for each magnitude of the phrase command and for each amplitude of the accent command. While the magnitude of the phrase command was indicated by the horizontal axis, the amplitude of the accent command was denoted by the symbols in the figure. This figure also shows the word recognition rate for 7-bunsetsu dummy sentences with flat F0 contours, much higher than the rates for 11-bunsetsu target sentences. This is considered to be partly due to the limited capacity of our short term memory

(STM), so called "seven chunks," indicating that the number of possible entries is around 7. While all of the constituent content words in a sentence can be stored in the STM for the case of 7-bunsetsu sentences, they should partly overflow from the STM for the case of 11-bunsetsu sentences. Another reasoning is also possible: The complexity in syn-tactic structure of the target sentences degraded the

Fig. 8 Word recognition rate for each

magnitude of phrase command and for

each amplitude of accent command.

317

J. Acoust. Soc. Jpn.(E) 16, 5 (1995)

recognition performance. Unlike the case of isolated word stimuli in section

2, no drastic change was observed among any com-binations of phrase and accent command values . This result seems to indicate the very limited role of the prosodic features in a sentence context, but the role can be considered to be obscured by the rather large degradation in the word recognition rate, caused by the limited size of STM and the complexi-ty in the syntactic structure , as mentioned above.

It was already reported that the phrase com-

ponents had no effect in understanding the content of spoken sentences.") Results shown in Fig. 8 seem to support this hypothesis , but an inspection of the results from a different viewpoint indicates that the phrase component has a definite role , in the process of sentence speech perception. Figure 9 shows the ratios of phrase recognition rate to word recognition rate (henceforth, P/W ratios), which in-crease as the magnitude of phrase command in-creases. Since a phrase is counted as being correct-ly recognized only when all of its constituent content words are correctly recognized, higher P/W ratio represents that a correctly recognized word is in-cluded with higher probability in one of correctly recognized phrases. As schematically indicated in Fig. 10, even with no increase in the word recogni-tion rate, the P/W ratio can still increase largely. Therefore, from the results in Fig. 9, we can con-clude that words in a sentence tend to be recognized after being grouped in a phrase unit by the phrase component and that the phrase component should

Fig. 9 Ratio of phrase recognition rate to

word recognition rate for each magnitude

of phrase command and for each am-

plitude of accent command.

work as a cue for detecting the syntactic structure

of a sentence. It should also be noted that a rather

steep rise is observalbe in the P/W ratio from the

case of phrase command magnitude of 0.0 to that of

0.1. This indicates that a small phrase component

is still useful for the detection of syntactic structure.

The analysis of variance showed a significant dif-

ference in the P/W ratio between the cases of 0.0

and 0.1 (T(1,6)=14.49, p=8.89•~10-3).

Figure 11 shows the recognition rate for the words

at the locations 01 and 02 as functions of the am-

plitude of accent command for each accent type.

Although the rate stays in a range rather limited 'for

the accent of types 0, 2, and 3, it increases with the

increase of the command amplitude for type 1

Fig. 10 A modal example showing the increase of P/W ratio with the increase of the magnitude of phrase command. Bold letters indicate the words being correctly recognized, while plain letters indicate the words being not recognized.

Fig. 11 Word recognition rate as func-

tions of the amplitude of accent com-

mand.

318

N. MINEMATSU and K. HIROSE: ROLE OF PROSODIC FEATURES IN SPEECH PERCEPTION

accent. The larger role of type 1 accent in word

perception was indicated also in a sentence context.

4. CONCLUSIONS

As the results of perceptual experiments using synthetic word and sentence stimuli, the following findings were obtained for the role of prosodic features in the human process of speech perception:

(1) The role of prosodic features in word perception is largest for the words with type 1 accent. This is true not only in the case of isolated utterances but also in the case of sentence context.(2) A phrase component has a function of grouping words in the corresponding phrase and, therefore, has a role in detecting the syntactic structure of a sentences.

Although these findings can be incorporated into the speech perception model,2) further experiments should still be necessary to ensure the findings. For instance, the length of the target words were re-stricted to 4 morae in the experiments in section 2. For longer words, a larger role of prosodic features may be found also for the case of non-type 1 accents. As for the experiment in section 3, it should be re-organized for simple sentences with smaller number of words (say, less than 7) to avoid the effect of STM overflow. Segmental features should also be degraded such as by the band-elimination process in section 2. These problems were left for the future research.

Although it has not been mentioned, a prelimi-nary experiment was conducted on the perception of lexical accents.20) The result implied the existence of accent type dictionary in our mental lexicon sep-arately from the ordinary word dictionary. This hypothesis, however, should also be checked with further experiments.

REFERENCES

1) H. Fujisaki, K. Hirose, and H. Udagawa, "A study on units of processing in the perception of continuous speech," IEICE Tech. Rep. SP86-53, 15-22 (1986).

2) H. Fujisaki, K. Hirose, S. Ohno, and N. Mine-matsu,"Influence of context and knowledge on the perception of continuous speech," Proc. ICSLP 90, Vol.1, 417-420 (1990).

3) N. Minematsu, S. Ohno, K. Hirose, and H. Fujisaki, "The influence of semantic and syntactic information on spoken sentence recognition," Proc. ICSLP 92, Vol.1, 153-156 (1992).

4) H. Fujisaki, K. Hirose, and N. Takahashi,"Mani-festation of linguistic information in the voice

fundamental frequency contours of spoken Japa-nese," Trans. IEICE E76-A, 1919-1926 (1993).

5) J. Azuma and Y. Tsukuma,"Prosodic features marking the major syntactic boundary of Japanese: A study on syntactically ambiguous sentences of the kinki dialect," Proc. ICSLP 90, Vol.1, 453-455 (1990).

6) C. M. Beach,"The interaction of prosodic patterns at points of syntactic structure ambiguity: Evidence for cue trading relations," J. Mem. Lang. 30, 644-663 (1991).

7) J. J. Venditti and H. Yamashita,"Prosodic informa-tion and processing of temporarily ambiguous con-structions in Japanese," Proc. ICSLP 94, Vol.3, 1147-1150 (1994).

8) I. Lehiste, J. P. Olive, and L. A. Streeter,"Role of duration in disambiguating of syntactically am-biguous sentences," J. Acoust. Soc. Am. 60, 1199-1202 (1976).

9) A. Ichikawa,"Comparison of comprehension test results of reading and hearing of Japanese garden-

path sentences," Proc. Spring Meet. Acoust. Soc. Jpn. 1-Q-17, 143-144 (1994).

10) A. Salasoo and D. B. Pisoni,"Interaction of knowledge sources in spoken word identification," J. Mem. Lang. 24, 210-231 (1985).

11) S. Takeda and A. Ichikawa, "Analysis of prosodic features of prominence in spoken Japanese sen-tences," Proc. ICSLP 90, Vol.1, 493-496 (1990).

12) N. Minematsu, K. Hirose, and M. Ito,"Experi-mental study on the role of prosodic features in the human process of speech perception," Proc. Autumn Meet. Acoust. Soc. Jpn. 2-9-18, 381-382 (1992).

13) N. Minematsu and K. Hirose,"Role of prosodic features in the human process of speech percep-tion," Proc. ICSLP 94, Vol.3, 1151-1154 (1994).

14) H. Fujisaki and K. Hirose,"Analysis of voice fundamental frequency contours for declarative sentences of Japanese," J. Acoust. Soc. Jpn.(E) 5, 233-242 (1984).

15) F. Grosjean,"Spoken word recognition processes and the gating paradigm," Percept. Psychophys. Vol.28, 267-283 (1980).

16) H. Fujisaki, K. Hirose, H. Udagawa, and N. Kanedera, "A new approach to continuous speech recognition based on considerations on human processes of speech perception," Proc. ICASSP 86, Vol.3, 1959-1962 (1986).

17) S. Imai and T. Kitamura, "Speech analysis syn-thesis system using the log magnitude approxima-tion filter," Trans. IEICE J61-A, 527-534 (1978).

18) S. Imai,"Log magnitude approximation (LMA) filter," Trans. IEICE J63-A, 886-893 (1980).

19) Y. Kitahara, S. Takeda, K. Ichikawa, and Y. Tohkura,"Role of prosody in cognitive process of spoken language," Trans. IEICE J70-D, 2095-2101 (1987).

319

20) K. Hirose, N. Minematsu, and M. Ito ,"Experi-mental study on the role of prosodic features in the human process of spoken word perception," Proc. ESCA '93 Workshop on Prosody, 200-203 (1993).

Nobuaki Minematsu was born in

Hyogo, Japan on December 17, 1966.

He received the B.E. degree in elec-

trical engineering in 1990, and the

M.E. and Ph.D. degrees in electric

engineering respectively in 1992 and 1995 from the University of Tokyo.

Since April 1995, he has been in

Toyohashi University of Technology as a research as-

sistant. He has been engaged in speech information

processing by humans and machines, especially in the fields of speech perception and speech recognition. He

is a member of the Institute of Electronics, Information

and Communication Engineers, Information Processing

Society of Japan, the Japanese Society for Artificial

Intelligence, and the Acoustical Society of Japan.

J. Acoust. Soc. Jpn. (E) 16, 5 (1995)

Keikichi Hirose was born in Kana-

gawa, Japan, on December 3,1949. He received the B.E. degree in electrical

engineering in 1972, and the M.E. and

Ph.D. degrees in electronic engineering

respectively in 1974 and 1977 from the

University of Tokyo. From 1977, he

is a faculty member at the University

of Tokyo, and was a Professor of the Department of Electronic Engineering from 1994. Since 1995, he has been a Professor of the Department of Information and Communication Engineering. From March 1987 until January 1988 he was a Visiting Scientist of the Research Laboratory of Electronics at the Massachusetts Institute of Technology. Although his research interests widely cover the field of speech information processing, such as analysis, synthesis, perception, and recognition, he has a major interest on prosody. He is a member of the Institute of Electrical and Electronics Engineers, the Acoustical Society of America, the European Speech Communication Association, the Institute of Electronics, Information and Communication Engineers, the Japan Society of Applied Physics, the Information Processing Society of Japan, the Japanese Society for Artificial Intelligence, the Association for Natural Language Processing, and the Acoustical Society of Japan.

320