A Text-to-Multivoice converter for Spanish: Real time operation analysis

10
North-Holland 1 Microprocessing and Microprogramming 22 (1988) 1-10 A Text-to- Multivoice Converter for Spanish: Real Time Operation Analysis A. Santos, C. L6pez and E. Mufioz Dpto. Electr6nica y Sistemas Digitales, E. T.S.I. Telecomunica- ci6n, Ciudad Universitaria, 28040 Madrid, Spain The hardware and software structures for a real time, unlimit- ed vocabulary, text-to-speech converter for Spanish are de- scribed. The structures are based on two processors that per- form the basic processes involved in the text-to-speech by rule conversion. One of the basic tasks is to obtain periodically the set of parameters that control the speech synthesizer. A general-purpose INTEL 8086 processor is used for this pro- cess: it is responsible for the linguistic processes and for the input text source interface. The second basic task is digital speech synthesis which is performed by a specialized digital signal processor (NEC 77P20) that implements a modified Klatt synthesizer. Programs developed for this processor are detailed in the text. An occupation factor for each of the two processors is also given: this factor reflects the ability of the processor to perform the proposed tasks in real time. Keywords: Speech synthesis, Text-to-Speech Conversion, Real-time processing. 1. Introduction Text-to-Speech Conversion (Tsc) can be of great in- terest when simple man-computer interface is sought, and as an aid for the blind. Making com- puters accessible to the general public is a task that includes the use of speech as a vehicle of interaction and, in this respect, a stand-alond converter, cap- able of performing a high-quality and high-intelligi- bility TSC in real time, would appear to be a good solution. Our department has developed two versions of a high-quality, unlimited vocabulary text-to-speech converter for the Spanish language. One of the ver- sions was simulated on a general-purpose minicom- puter (a PDP 11/60) [1], while a specialized board, developed for an English text-to-speech converter [2, 3], was used to implement the second. Specific hardware architecture has been designed for the task and is described in this article. 2. Text-to-Speech Conversion in Spanish Text-to-Speech Conversion is a two-step process. First, the set of synthesizer control parameters is obtained from the input orthographic text. Second, the actual synthesis takes place; this step is basically language independent. Calculation of the synthesizer parameters can in turn be subdivided into four tasks (F~. 1) or linguistic processes as they are called [2]: • Input Text Preprocessing." This normalizes the input text and comprises four basic tasks: (1) phonetic-stress location in every word; (2) number-to-text conversion with special proces- sing for cardinal numbers (decimals included), ordinals and dates; (3) abbreviation and acronym expansion; (4) punctuation-mark processing. • Letter-to-Sound Conversion." This is a very simple process in Spanish and it is based on well-deter- mined rules. The system employs 42 phonemes, 14 of which are diphthongs. Conversion for 14 letters is unambiguous and straightforward; cer- tain rules that take into account only the follow- ing and previous sounds are used for the remain- ing letters. Furthermore, there is a special table for those words that do not follow the above rules (mostly foreign words). • Prosodic Functions: To obtain highly-intelligible and natural-sounding speech, it is very important to assign the exact intonation and duration to each phoneme. A set of rules for determining both the correct duration and intonation for each sound has been obtained by an extensive analysis of human speech. An intonation algorithm has been found that identifies the fundamental fre- quency of each sound in a sentence; in this re- spect, factors such as the kind of sentence (enun- ciative or interrogative, basically), the stress and pause positions and certain initial and final phra- se reference levels, are considered. The duration

Transcript of A Text-to-Multivoice converter for Spanish: Real time operation analysis

Page 1: A Text-to-Multivoice converter for Spanish: Real time operation analysis

North-Hol land 1 Microprocessing and Microprogramming 22 (1988) 1-10

A Text-to- Multivoice Converter for Spanish: Real Time Operation Analysis

A. Santos, C. L6pez and E. Mufioz Dpto. Electr6nica y Sistemas Digitales, E. T.S.I. Telecomunica- ci6n, Ciudad Universitaria, 28040 Madrid, Spain

The hardware and software structures for a real time, unlimit- ed vocabulary, text-to-speech converter for Spanish are de- scribed. The structures are based on two processors that per- form the basic processes involved in the text-to-speech by rule conversion. One of the basic tasks is to obtain periodically the set of parameters that control the speech synthesizer. A general-purpose INTEL 8086 processor is used for this pro- cess: it is responsible for the linguistic processes and for the input text source interface. The second basic task is digital speech synthesis which is performed by a specialized digital signal processor (NEC 77P20) that implements a modified Klatt synthesizer. Programs developed for this processor are detailed in the text. An occupation factor for each of the two processors is also given: this factor reflects the abil ity of the processor to perform the proposed tasks in real time.

Keywords: Speech synthesis, Text-to-Speech Conversion, Real-time processing.

1. Introduction

Text-to-Speech Conversion (Tsc) can be of great in- terest when simple man-computer interface is sought, and as an aid for the blind. Making com- puters accessible to the general public is a task that includes the use of speech as a vehicle of interaction and, in this respect, a stand-alond converter, cap- able of performing a high-quality and high-intelligi- bility TSC in real time, would appear to be a good solution.

Our department has developed two versions of a high-quality, unlimited vocabulary text-to-speech converter for the Spanish language. One of the ver- sions was simulated on a general-purpose minicom- puter (a PDP 11/60) [1], while a specialized board, developed for an English text-to-speech converter [2, 3], was used to implement the second. Specific hardware architecture has been designed for the task and is described in this article.

2. Text-to-Speech Conversion in Spanish

Text-to-Speech Conversion is a two-step process. First, the set of synthesizer control parameters is obtained from the input orthographic text. Second, the actual synthesis takes place; this step is basically language independent.

Calculation of the synthesizer parameters can in turn be subdivided into four tasks (F~. 1) or linguistic processes as they are called [2]:

• Input T e x t Preprocessing." This normalizes the input text and comprises four basic tasks: (1) phonetic-stress location in every word; (2) number-to-text conversion with special proces- sing for cardinal numbers (decimals included), ordinals and dates; (3) abbreviation and acronym expansion; (4) punctuation-mark processing.

• L e t t e r - t o - S o u n d Conversion." This is a very simple process in Spanish and it is based on well-deter- mined rules. The system employs 42 phonemes, 14 of which are diphthongs. Conversion for 14 letters is unambiguous and straightforward; cer- tain rules that take into account only the follow- ing and previous sounds are used for the remain- ing letters. Furthermore, there is a special table for those words that do not follow the above rules (mostly foreign words).

• Prosodic Funct ions: To obtain highly-intelligible and natural-sounding speech, it is very important to assign the exact intonation and duration to each phoneme. A set of rules for determining both the correct duration and intonation for each sound has been obtained by an extensive analysis of human speech. An intonation algorithm has been found that identifies the fundamental fre- quency of each sound in a sentence; in this re- spect, factors such as the kind of sentence (enun- ciative or interrogative, basically), the stress and pause positions and certain initial and final phra- se reference levels, are considered. The duration

Page 2: A Text-to-Multivoice converter for Spanish: Real time operation analysis

2 A. Santos eta/. / Text- to- Multivoice Con verter

Input Text Preprocesing

Stresses Numbers

Abrev ia t ions Punctuat ion Marks

I Normalized t ex t

• I Let te r - to -Sound Conversion

Rules Exception d i c t i o n a r y

Phoneme s t r ing

Prosodic Functions

In tona t ion a lgor i thm Durat ion ru les

I Phonemes wi th du ra t ion , p i t ch , stress

Sound-to-Parameter Conversion ]

Phoneme/t ransi t ion c h a r a c t e r i z a t i o n Smoothing and Parameter c a l c u l a t i o n

I Parameter-frame (every 10 ms.)

Synthesizer

Speech sample (every 100 ~s).

Fig. 1. Text-to-Speech conversion processes.

of each sound is also obtained by means of a set of rules derived from human speech analysis. These latter rules take into account several fac- tors, of which some are intrinsic for every sound while others depend on the context. These rules and the process followed to obtain them are de- scribed in [4]. Sound-to-Parameter Conversion." In this process, a frame of acoustic parameter is obtained every 10 ms. so that the synthesizer thus characterized can produce the corresponding sound sequence. The parameters include the first three formants for each sound, sound-source amplitudes and the nasal zero frequency. In order to synthesize intel- ligible voice it is most important that the transi- tion between phonemes at each formant be a smooth one and that the vocal-tract model used be capable of smoothing phoneme transitions by

aligning the frequency and amplitude and the end of one phoneme with the frequency and ampli- tude of the next. A set of equations [5] is used to achieve this complex smoothing and the three

formant frequencies for each phoneme are calcu- lated separately. A new set of 18 parameters is calculated every 10 ms; this time is sufficient for modeling most of the phoneme-to-phoneme tran-

sitions encountered in human speech.

The voice synthesis is obtained by a formant synthe- sizer which uses digital signal processing techniques to simulate the resonance and antiresonance phe- nomena in the vocal tract. The Klatt synthesizer [6] was used in the initial versions of the text-to-speech converter. It has five cascaded digital resonators for producing voiced sounds and six parallel resonators

for producing unvoiced fricative sounds. The sound sources comprise a pulse train (at the fundamental frequency) and fricative and aspiration noises (si-

mulated by a random-number generator). The syn- thesizer uses 18 parameters; these depend on the sound to be produced and, therefore, have to be renewed periodically. Frequencies as high as 5,000 Hz have to be synthesized in order to produce high- quality speech. For this reason, the system's signal processing sampling rate is 10,000 Hz, which is twice the highest frequency, so a sample every 100

#s has to be calculated. Modifications allowing for the production of dif-

ferent voices (male and female) have been made. Several characteristics of the original synthesizer can be modified to simulate the normal speaking voice. The main changes introduced are: ( ! ) the ave- rage fundamental frequency can be modified to ob- tain voices with different pitch; (2) the variation range can also be modified; (3) all formants can be scaled up or down to reflect head size; (4) the fourth and fifth formants can also be modified; (5) a cer- tain amount of noise may be introduced in the syn- thesizer to simulate some people's voices. All these changes are contained in a set of parameters that is sent to the synthesizer whenever a new voice is de- sired. To obtain female voices some more changes have to be introduced, as it is necessary to parame- trize each sound differently. These additional chan- ges are described in [4].

The following sections present the hardware ar-

Page 3: A Text-to-Multivoice converter for Spanish: Real time operation analysis

A. Santos et al. / Text- to- Mult ivoice Con verter 3

chitecture built to implement all these tasks. The

hardware accepts the income text through a stan-

dard serial line and produces speech through an analog output that can be connected to a loud- speaker. The architecture is based on two proces- sors (Fig. 2) since the tasks to be fulfilled can be di- vided into two different kinds. A general purpose processor, the INTEL 8086, has the task of convert-

ing the income text into control parameters for the synthesizer and of handling the interface with the

user (the text source). The other processor is a spe- cialized signal model, the NEC 77P20, whose task is to implement the synthesizer in real-time.

The functions performed by both processors will be described (with special emphasis on the 77P20). A performance evaluation carried out to determine an occupation factor for both processors will also be described.

3. Hardware Description

An INTEL 8086 processor is used, as it is sufficiently fast to support real-time operation; it is also able to handle a relatively great amount of memory (up to I Mbyte) and to deal with several kinds of interrup- tions. It carries out most of the tasks involved (such as the four linguistic processes described above) and also controls the user interface. Since synthesizer simulation involves computing a large number of

arithmetical operations, another processor, the NWC

77P20, is used. This processor is specialized in com-

puting additions and products at high speed. Under normal operating conditions, therefore, the 8086 re- ceives the input text from the user, calculates the corresponding synthesizer coefficients and sends

them to the 77P20 where they are used to produce samples of the speech waveform, which in turn are

sent to an audio output stage.

The 8086 needs 8 Kbytes of RAM memory, 90 Kbytes of EPROM memory (for data tables and pro-

grams), an I/o serial interface to communicate with the text source, and some kind of interrupting mechanism to facilitate real-time operation.

A standard serial asynchronous connection (such as the RS-232-C), capable of communicating with a large number of terminals and equipments, has been chosen. The user sends the income text to be

converted through this line. The interface between this line and the processor is implemented with an USAR~ 8251A (Universal Synchronous/Asynchro- nous Receiver/Transmitter).

The 8086 must respond to three types of interrup- tions if the operation is to be continuous; it has to transmit coefficients to the 77P20 when required and deal with two types of interruptions from the USART: one when it is ready to send a character to the user and another when it has just received a

character. A pxc 8259A (Programmable Interrupt Controller) is used to handle these interruptions.

General I- I Interruption ProcessoE Controller

gf Memor£es z/o

Uni t L TEXT

F

S£gnal ProcesSoE J Audio J SPEECH

Fig. 2. Hardware diagram.

Page 4: A Text-to-Multivoice converter for Spanish: Real time operation analysis

4 A. Santos etal . /Text- to-Mult ivoice Converter

All these requirements are solved by using the iSBC 86/05 board, available from mTEL. This inclu- des the 8086 processor, the clock generation circuit (at 8 MHz), a bus controller, 8 Kbytes RAM, sockets for up to 128 Kbytes EPROM (using 27128 chips), a USART 8251A, a Pie 8259A and a PPI 8255A (Pro- grammable Peripheral Interface).

An improvement to be introduced in the near fu- ture is the replacement of the 8086 by the newer 80186 processor, which will do away with some of the chips presently used for this application; this processor includes an interruption controller, a clock generator, timers, etc.

The 77P20 processor is a digital signal processor, capable of computing fixed point numbers at high speed. It has a parallel structure that allows a number of arithmetic and memory-access opera- tions and fetching and decoding of the next instruc- tion to be carried out simultaneously [7]. The fol- lowing are some of its characteristics: (a) three independent memories (Harvard architecture): an instruction-EPROM, a Data-EPROM and a Data-RAM; (b) fast (250 ns) 16 x 16-bit multiplier; (c) 16-bit ALU with dual accumulators; (d) multi-operation in- structions with fast execution time (250 ns); and (e) several I/O capabilities (serial, parallel and DMA).

This processor is in charge of the synthesizer sim-

ulation and it performs adequately the large number of arithmetical operations needed for pro- ducing samples every 100/~s.

It will be presented in the following section. A powerful communication link between the two

processors must be provided (basically for transmit- ting synthesizer coefficients). The 77P20 also needs to communicate with the analog stage and send samples of the synthesized signal. Fig. 3 shows how this processor relates to its environment.

Communication between 8086 and 77P20 has the following possibilities:

a. The 8086 transfers the synthesizer variable coef- ficients every 10 ms. In order to carry out this data transfer as fast as possible, a simple interfa- ce is connected between the 8086 data bus (16 bit width) and the 7720 parallel port (8 bit width): there is a 16 bit buffer and a control logic that automatically sends data received from the 8086 to the 77P20 port in the form of two bytes.

b. The 8086 knows when the 77P20 is ready to ac- cept new data through its parallel port: the 8086 receives this information from the 77P20 status register.

c. The 77P20 asks for a new coefficient set every

I !

BI

u

'I

PIC PPI

PO RST

SO

77P2G

jSer£al-tO a "IParallel J D/A

Fig. 3. Output stage and 77P20 interfaces.

Page 5: A Text-to-Multivoice converter for Spanish: Real time operation analysis

A. Santos et al. / Text - to- Mu l t i vo ice Converter 5

time updating is required; this is done by inter-

rupting the 8086 through a control line (the

77P20 P0).

4. Software Description

4.1 8086 Programming

d. Every time a voice change is to be performed, the 8086 informs the 77P20 that it is going to send new speaker definition coefficients. This infor-

mation is channeled through the RESET input of the 77P20; the reset routine then receives the cor-

responding coefficients.

The synthesized sample output is sent through the 77P20 serial output; this is an automatic process in which the 8086 does not intervene. A control logic interrupts the 77P20 every 100/~s and the processor then transmits the corresponding sample. The serial

output is connected to the output stage which inclu- des a serial-to-parallel converter, a 12 bit D/A con-

verter, an analog antialiasing filter and an audio amplifier. The filter has been designed with the aid of a computer and has a three-pole and two-zero configuration. It uses four operational amplifiers and six 1% precision capacitors. The signal ob- tained from the audio amplifier can be carried di- rectly to a loudspeaker.

Programs for the 8086 have two main functions: one is to maintain a permanent dialog with the in-

put text source; the other is to send to the 77P20 the

coefficients of sounds to be synthesized. A software structure [8] has been selected which enables these two functions to operate continuously in real time. There are four main routines, one for each of the

linguistic processes described above, namely text

preprocessing, letter-to-sound conversion, prosodic functions and sound-to-parameter conversion.

These four routines (Fig. 4) share the same data structure: a chain composed of several elements or links. Each link contains just one letter or phoneme

from the text and includes several fields with infor- mation related to the said letter or phoneme. It also contains data about the following and previous ele- ment positions. Several pointers allow each linguis-

tic process to have access to a zone of this chain. When they finish processing an element, pointers are modified so that the link is available for the next process. The chain, therefore, acts as input and out-

I

Text. I Preproc~s. l

I

TEXT J Inpur. Buffer

ControZ P~ut£ne

I Letter to I Sound ConY. l

\

\ /

i

I

Functlons Paras. Cony

/ / I

I Chain

/ /

-- -- .-- L

Data

Control

O~eratlon

ICoeff£c. Transfer. ! /

o o e o e Q o o o e o e e e o l o . e ° e

! , .=o

SPEECH ----Ji

: -2 eeeo e * o ' e ° e e o e . e e e o o e

/ / l

Speaker Def£n£t~on I

Fig. 4. Software diagram.

Page 6: A Text-to-Multivoice converter for Spanish: Real time operation analysis

6 A. Santos et aL / Text- to- Mu/tivoice Con verter

put for every process and so data transfers between

processes are avoided.

Two additional buffers have been created to sepa-

rate the text reception from the serial line and send

coefficients to the synthesizer from the remaining

processes. One of the buffers is used to store the text

until it is processed by the linguistic routines. The

other buffer stores the newly calculated coefficients.

The 77P20 interrupt service routine receives a set of

coefficients from the latter buffer and sends it to the

synthesizer. The four linguistic routines are con-

trolled by a further routine that distributes proces-

sor time efficiently. This control routine uses several

Boolean variables to find out if any of the routines

has text to process; it also ensures that there is al-

ways a coefficient set ready to be sent to the 77P20,

4.1.1 Performance Evaluation and Occupation Fac- tor of the 8086 This general-purpose processor (with its family of

peripherals) is clearly able to handle the control and

communication tasks needed in a real-time text-to-

speech system. It is also powerful enough to manip-

ulate the data structure defined. But when first de-

signed, it was not sure whether the system was being over- or under-estimated and the question of

whether it was fast enough to ensure real-time oper-

ation when processing speech at different speaking

rates needed to be answered. A performance evalu-

ation was therefore carried out to find out whether

the processor was in a position to achieve total real- time operation: the evaluation performance was

also used to calculate the time the processor spent

on its main functions as well as the time it remained idle. This present study indicates what the main

processing load is and whether future algorithm im-

provements are compatible with a possible exten-

sion of the processing time.

The need to have a frame of coefficients available to the synthesizer on demand (every 10 ms) determi-

nes real-time operation requirements. In this re- spect, a flag (a RAM-memory position) acts as a control device and is activated whenever the 77P20 interrupt service routine tries to send a frame of co-

efficients and finds that there is none to be sent; in this case, the processor repeats the last frame sent. Another flag tells when this situation happens two or more consecutive times. After several minutes of

continuous speech, the two flags are examined: a small program, activated by a type 3 interruption,

sends their contents through the serial line. The re-

sults show that, up to a speaking rate of 200 words

per minute (w.p.m.), a frame of coefficients was

available to the synthesizer in every case. In some

cases (less than 0.1%), a single frame was missed

when the speaking rate went over the 250 w.p.m, li-

mit (a rather fast speaking speed) but no two con- secutive frames were ever missed. In any case, no

audible effect was detected and the overall result is

negligible: the length of the sound is increased by 10

ms. It can be said, therefore, that the processor fully

achieves real-time operation for speaking rates up

to 250 w.p.m.

The other evaluation carried out deals with time

measurements. Its aim is to calculate the time spent on the four linguistic processes, on the text source,

on coefficient transmission to the 77P20 and on control tasks. The time the processor is idle has also

been calculated, that is, the time spent waiting for a

previously-calculated frame of coefficients to be

sent to the synthesizer to make room for new coeffi-

cients or the time spent waiting for more text to be

processed. Three different speaking rates were used

in this study since the speaking rate is a most impor-

tant factor in this respect. A simple and fast method was used for this study

so that normal operation conditions might be

achieved as far as possible. A counter and a flag (RAM-memory positions) were defined for each of

the seven tasks to be analyzed. The flag is set when

the processor initiates the corresponding task and

reset when the processor changes to a new task. Non-maskable interruptions are periodically pro-

duced in the system during normal operation at a

sufficiently low frequency (200 Hz) and so the ove-

rhead is clearly negligible. The interrupt service rou-

tine identifies the process being executed at the mo- ment of interruption by inspecting the flags; it then increases the corresponding counter. The state of the counters is examined after several minutes of continuous speech production; the amount of time spent by the processor on any task is clearly indicat-

ed by the number of times the said task occupies the processor.

As indicated above, three different speaking rates were used: a slow speaking rate (50 w.p.m.), a very

Page 7: A Text-to-Multivoice converter for Spanish: Real time operation analysis

Coeff£cienc Coe£f£c£enc "Transm£saiou Idle Transmission Text /

~repro

Coef f i c£enc Transm£ss£on

A. Santos et al, / Text- to- Multivoice Con verter 7

50 vords per mlnuce 160 vords per minute 250 words per minute

Fig. 5. 8086 occupation factor.

fast speaking rate (250 w.p.m.) and a normal speak- ing rate (160 w.p.m.). The results (expressed as per- centages of the time spent on each process) are shown in Fig. 5. The most time-consuming process, specially at high speaking rates, is the sound-to-

synthesizer parameter conversion which requires a large number of arithmetical operations. It should be pointed out that, at normal speaking rates (160 w.p.m.), the processor is idle 29% of the time. The

system is able, therefore, to operate in real-time at this speaking rate. More complex algorithms could

be used to improve some aspects of the text-to- speech conversion: intonation could be obtained, for example, by introducing more complex rules or different smoothing algorithms could possibly be used. However, when the speaking rate is very high (250 w.p.m.) idle time is only 1%, so any additional process would not be feasible unless the sound-to- parameter conversion operations were simplified.

4.2 7720 Programming

The main task of this processor is to implement the proposed speech synthesizer and produce an audio speech sample every 100 ~s. So it has basically three tasks: 1. it calculates synthesized signal samples 2. it transmits a sample through its serial output

every 100/is 3. it renews the synthesizer variable coefficients

every 10 ms. During normal operation (Fig. 6) up to a maxi-

mum of six samples are calculated and stored in the memory. This process is temporarily suspended in

two cases: 1. When a sample has to be carried to the serial

output (every 100/ts); an external clock periodi- cally interrupts the processor. The calculation

process is then resumed immediately. 2. When a new, previously-solicited coefficient set

is starting to be received; the processor executes the corresponding routine and suspends the cal- culation process during this time.

This operation mode has been chosen to ensure that sample production never ceases, not even dur- ing the coefficient transfer process. It would be easi- er to calculate one sample each time and wait for an interruption to occur before calculating the next, but the time for each sample calculation would be strictly limited to 100/ts; therefore, samples could not be calculated during coefficient reception. which is a process that lasts for over 200/Ls. On the other hand, this mode allows some single sample calculations to exceed the 100/ts limit, and so no sample is lost, but it needs a FIFO queue which as to be simulated in the RAM memory with the aid of a

Page 8: A Text-to-Multivoice converter for Spanish: Real time operation analysis

8 A. Santos etal. / Text-to-Multivoice Converter

RESET

No

Yes

Ask£ng for 8066 I n t .

Sample ' s ! Co=putat£on

I

0

1

Yes

I Sample ' s [

I Output Routine

I Coef£:Lc£ent' s I

I Reception

pointer. While they are being calculated, the sam-

ples are stored in the queue; the first sample is re- covered from the queue when it has to be sent to the serial output. There is also a way of avoiding a col- lision between the main routine and the interrupt service routine en route to the queue. This is achieved by means of a flag (a memory position) which is activated by the main program using the

queue; the interrupt routine (Fig. 7) does not access the queue while this flag is indicating that it is being occupied by the main routine.

The arithmetical operations use a fixed-point for-

mat: 3 integer bits and 12 decimal bits, plus the signed bit (in binary two's complement notation). The synthesizer simulation includes the calculation of 12 second-order resonators, which perform the following operation:

y(n) = a . x (n) + b . y(n-1) + c.y(n-2). In each resonator, these additions and products are calculated with 32-bit arithmetics (double precis- ion) to increase their accuracy. The remaining oper- ations, apart from those affecting the resonators, use 16 bits; no loss in speech quality has been de- tected.

Programs have been assembler-developed so that the possibilities of parallelism afforded by this pro- cessor can be used. As often as possible, a single in- struction has been used to execute different func- tions, such as ALU and multiplier operations, data transfers and memory pointer modifications. A mean of 2.2 functions per instruction has been ob- tained, which enables an adequate program-execu- tion time to be achieved. Another important consi- deration was the fact that the processor is not very flexible in its ability to address its RAM memory, sin- ce its pointer can only be modified to certain posi- tions in the proximity of the position being used. For this reason, therefore, it was very important to chose adequate data arrangement in memory. Con- sequently, a column order was chosen: for every resonator, all data are always in the same column. This ensures that, in most cases, the next data to be introduced are either in the current column or in the following or previous columns, thus allowing the data pointer to be modified in the same instruction used to process data so that no extra time is needed.

Fig. 6.77P20's main program.

Page 9: A Text-to-Multivoice converter for Spanish: Real time operation analysis

A. Santos et aL / Text- to- Mult ivoice Con verter 9

~n~rupe.£on .~

1

1 s t SamPle ] T=~ms=~.sslon

~O

Reorder Queue .[ and Po£nter

4 .̧ . 0 --b Flag

mined. The following performance times have been

calculated: - A sample calculation takes approximately 82 or

84/~s. Exceptionally, once in each fundamental frequency period, it takes 94.5/~s; the current pa- rameters then need to be renewed. These figures include synthesizer simulation (about 78 /~s), queue handling, new coefficient demands, etc.

- The interrupt service routine takes from 2.5/~s to 10.25/ts; this depends on the occupation state of the queue at the time. On the whole, the simulator showed that 56 sam-

ples can be obtained in about 5 ms. Since, only 50 samples have to be transmitted during this time, there will be 6 stored in the queue. Since then, the processor calculates new samples only to replace those already transmitted. When a new frame of co- efficients is needed (after 10 ms) the queue is always fully occupied by 6 samples. The coefficient transfer, therefore, could take up to 600/zs without losing a sample. An oscilloscope has shown that this time is about 200 ~s, and so it results that the 77P20 is idle about 4% of the time. 123 bytes (of the 128 available) of RAM memory are used. It is seen, therefore, that almost all the processor capacity is used up; any synthesizer improvements in the future have to be based either on the same processing time and memory requirements or on the use of a more powerful processor.

Fig. 7.77P20"s interruptions handling routine. 5. C o n c l u s i o n s and E v a l u a t i o n

4.2.1 The Processor 77P20: Performance Evalua-

tion and Occupation Time As in the case of the 8086, the performance of the 77P20 has been evaluated and an occupation factor determined. Here, however, the evaluation was car- ried out by means of a simulator program, since the interrupt handling and memory access possibilities of the 77P20 are much more limited. Evaluation of the 77P20 is greatly simplified by the fact that every instruction takes 250 ns to be executed, and so if we count the number of instructions any one routine takes we can calculate the exact time taken. The si- mulator enables the number of instructions re- quired for each part of the program to be deter-

A system that implements a continuous text-to- speech conversion in real time has been developed. System efficiency, defined as the capacity to work in real time, has been determined. The results show that the system operation is correct even at high speech production speeds. However, the 77P20 pro- cessor, which simulates the proposed sythesizer, is used at maximum capacity, so any algorithm im- provements requiring higher calculation power will need another processor.

Speech quality has also been evaluated for intelli- gibility and comprehension. Tests (similar to those performed by Pisoni of English converters) have been carried out [9]. The results show that the sys- tem has a comprehension factor very similar to that

Page 10: A Text-to-Multivoice converter for Spanish: Real time operation analysis

10 A. Santos eta/. / Text-to- Multivoice Converter

of the human voice although the quality of intelligi- bility is somewhat inferior [4].

Acknowledgements

This study has been partly supported by Comisidn Asesora, [NSE~SO (M. Sanidad y Seguridad Social) and by M. Educacidn y Ciencia who offered a scholarship.

References

[1] Santos, J.M., and Nombela, J.R.: Text-to-Speech Con- version in Spanish, A Complete Rule-Based Synthesis System. Proc. IEEE-ICASSP (Paris, 1982) 1593-1596.

[2] Olabe, J.C. et al.: Real Time Text-to-Speech Conversion System for Spanish. Proc. IEEE-ICASSP (San Diego, 1984) 2.10.1-3.

[3] Speech Plus, Inc.: PROSE 2000 Text-to-Speech Con- verter User's Manual. Mountain View, Calif. (1982).

[4] Santos, A.: Sistema para la Sintesis Mult iVoz en Tiempo Real a partir de un Texto Escrito. Ph.D. Thesis, ETSIT- UPM (Dec. 1984).

[5] Olabe, J.C.; Sistema para la Conversion de un Texto Or- togrMico a Hablado en Tiempo Real. Ph.D. Thesis, ET- SIT-UPM (Sept. 1983).

[6] Klatt, D.H.: Software for a Cascade/Parallel Formant Synthesizer, J. Acoust. Soc. Am., Vol. 67, No. 3 (Mar. 1980) 971 995.

[7] NEC: pPD 7720 Digital Signal Processor. NEC Micro- computers Inc. (1982).

[8] Telesensory Speech Systems, Inc.: Skeleton Structure of a Text-to-Speech Converter (1981 ).

[9] Pisoni, D.B.: Some Measures of Intell igibi l i ty and Com- prehension. J. Allen (ed.), Conversion of Unrestricted English Text-to-Speech (1980).

AndrOs San tos was born in Valencia (Spain) on Feb. 5, 1958. He received the Eng. degree specializing in telecommu- nication engineering in 1981 and the Ph.D. degree in electri- cal engineering in 1985 from the Polytechnic University of Madrid. Since 1985 he is a Professor in the Electrical Engi- neering Dept. at the same University. He has been engaged in the study of digital systems and microprocessors for real-time applications and his doctoral research was on the develop- ment of a real-time system for text-to-speech conversion in Spanish. His research interests are focused now on VLSI sys- tems for digital processing and CAD tools.

Car los LOpez Ba r r i o received the engineering degree in 1974 and the Ph.D. degree in 1977 from the Polytechnic Uni- versity of Madrid. He is a full Professor and Head of the Digital Systems Dept. at the Polytechnic University of Madrid. He conducts a research group in the field of digital architectures and VLSI design. His current research interests include new methodologies and CAD tools to support the design of Inte- grated Circuits.

Elias M u f i o z M e r i n o received the engineering degree in 1966 from Polytechnic University of Madrid. He is also Mas- ter of Sciences (1968) and Electrical Engineer (1969) from Univ. of Stanford and holds a Ph.D. degree from Polytechnic University of Madrid (1970). Since 1973 he leads the Elec- tronics Dept. at this University. He is author of several interna- tional publications in the fields of speech recognition and synthesis and microelectronics.