ARABIC TEXT-TO-SPEECH SYNTHESIZER - UM …repository.um.edu.my/142/1/Arabic TTS Synthesizer.pdf ·...

ARABIC TEXT-TO-SPEECH SYNTHESIZER

AHMAD QASIM MOHAMMAD AL JAYOUSI

FACULTY OF COMPUTER SCIENCE AND

INFORMATION TECHNOLOGY

UNIVERSITY OF MALAYA

KUALA LUMPUR

DECEMBER 2007

ARABIC TEXT-TO-SPEECH SYNTHESIZER

AHMAD QASIM MOHAMMAD AL JAYOUSI

DISSERTATION SUBMITTED IN FULFILMENT OF THE REQUIREMENTS

FOR THE DEGREE OF MASTER OF SOFTWARE ENGINEERING

FACULTY OF COMPUTER SCIENCE AND

INFORMATION TECHNOLOGY

UNIVERSITY OF MALAYA

KUALA LUMPUR

DECEMBER 2007

ABSTRACT

Text-To-Speech technology has steadily grown over the years to support many languages

by utilizing a number of useful methods and techniques. Despite its overall steady

growth, Arabic TTS technology has gained little attention. Apart from a few commercial

products, an Arabic TTS Synthesizer System has often failed to exceed laboratory

boundaries. This dissertation examines the issues, requirements and methodologies

involved in developing a useful Arabic TTS Synthesizer System. Additionally, this

dissertation describes in details the construction of an Arabic TTS Synthesizer System

using the Concatenation Synthesis method, which relies on prerecorded speech units that

are utilized by the system to generate concatenated speech. Two types of speech units

were used independently: The first consists of 375 diphones that cover Arabic sounds and

the other has 178 allophones which cover Arabic sounds. The system consists of several

programs and algorithms that are integrated in the form of modules to be easily modified

and developed individually in the future. The evaluation of TTS Synthesizer System is

also an important issue, but difficult because the speech quality is a very

multidimensional term. This has led to the large number of different tests and methods to

evaluate different features in speech. Our survey results demonstrate that the majority of

the words and sentences are recognizable by the developed Arabic TTS Synthesizer

System. In fact, 80 to 85 % of the words and 70 % of the sentences were correctly and

completely recognized by the listeners in the survey. Lastly, Arabic TTS speech is

considered as intelligible; however, the system still requires further research and

developments in many directions including handling various Arabic texts such as dates,

abbreviations, foreign names and numbers.

öa‡��ègöa‡��ègöa‡��ègöa‡��èg @@@@

�cŽè@ñ‡è@a‰ĆszŽjÜađi@Žj�ÔbđóŽì@Ş†Šžà@ŽÉ��Şò‹@�g@¶@âÝÉÜa@ky@ðÝ‚a†@bÈŠŒ@æŽà

êïÜg@ðÉ-Üaì@™‹¨aìL@ @@¶g@æŽàŽì�ÔŽÒ�ÕÜa@�ÝžâŽy@đ÷bŽ‹4ađÈ@ĆåŽ‡žèŽáb@ @

žàŽz�ìb4ýŽm@Üa@kïm‹9žzÓì‹@ @Üđ9žï�ÙİíŽçđà@ĆåŽé@b�ØđÝŽáŞpb@ @

Žmđ—žÒŽ’@Ž‹ŽŠa4òđà@ĆæÜ@Ž9�éĆïđkžy@Ü@DŽ9žéŽábL@ @Üaìñ‰Žà@ĆéŽáŽ–@bŽåĆÉžo@ @

ŽàGÐì@bžoï@ýì@�Ø�ÑĆïžoŽy@À@đÕ�éŽáb@ @Žì�Üđi@í�ÕŽ‡�Š@�cĆ‚9Žá��Ô@Ž‡đàï�éŽábNNN@ @

g�Ü�Ùï�c@ŽñaŽ‡đÜaŽì@bŽî@bŽáŽè@ñ‡è@a‰szŽjÜaNNN@ @ ˜Ü��ƒ¾a@áØäia˜Ü��ƒ¾a@áØäia˜Ü��ƒ¾a@áØäia˜Ü��ƒ¾a@áØäia@ @@ @@ @@ @

ACKNOWLEDGMENT

The effort involved in writing a dissertation is tremendous and anyone who has ever done

this can attest to it. I am fortunate to acknowledge a number of people who contributed

along the way in various ways by offering suggestions, corrections, guidance, ideas,

comments, and advice. The work on this dissertation has been an inspiring, often

exciting, sometimes challenging, but always interesting experience.

To begin with, first of all I would thank the Al Mighty Allah (SWT) for giving me the

guidance, knowledge and patience that enabled me to successfully complete this

dissertation. I am very grateful to you my Lord.

I would like to take great pride to forward my sincere appreciation and gratitude to my

supervisor, Madam. Zarinah Mohd Kasirun, for her invaluable guidance, support,

encouragement and help, without her tireless efforts, patience and guidance, this

dissertation would not have been successfully completed.

I would like to thank my beloved family especially my beloved grandparents, my mother,

my brothers, my fiancé, all my dearly beloved uncles, their children and wives, for their

kindness, encouragement, moral and financial support. Without them I would not be what

I am now, a millions of thanks to all of them.

Not forgetting, I would like to thank all my dear friends, for their ideas and support given

to me in completing this work.

TABLE OF CONTENTS

DECLARATION ii ABSTRACT iii DEDICATION iv ACKNOWLEDGMENT v TABLE OF CONTENTS vi LIST OF FIGURES x LIST OF TABLES xi LIST OF SYMBOLS AND ABBREVIATIONS xii

CHAPTER 1: INTRODUCTION

1.1 OVERVIEW 1

1.2 PROBLEM JUSTIFICATION 4

1.3 SCOPE OF THE PROBLEM 9

1.4 SYSTEM AIMS AND OBJECTIVES 10

1.4.1 General Objective 10 1.4.2 Specific Objective 10

1.5 POTENTIAL BENEFITS AND MOTIVATION OF ARABIC TTS 11

1.6 PROJECT METHODOLOGY 12

1.7 ORGANIZATION OF THE DISSERTATION 14

CHAPTER 2: LITERATURE REVIEW

2.1 OVERVIEW 16

2.2 WHAT IS SPEECH? 18

2.3 SPEECH PRODUCTION MECHANISM 19

2.4 WHAT IS TEXT-TO-SPEECH? 21

2.5 HISTORY OF TEXT-TO-SPEECH 22

2.6 TTS METHODS, TECHNIQUES AND ALGORITHMS 26

2.7 THE IMPORTANCE OF TEXT-TO-SPEECH 30

2.8 APPLICATIONS OF TEXT-TO-SPEECH SYNTHESIS 31

2.8.1 Example Applications 31 2.8.2 Other Applications and Future Directions 33

2.9 THE CHALLENGES BEHIND THE TEXT-TO-SPEECH 34

2.9.1 The Diacritization Problem 34 2.9.2 Dialects 35 2.9.3 Differences in gender 36

2.10 EXISTING PRODUCTS 37

2.10.1 MBROLA – PROJECT 37 2.10.2 ACAPELA – GROUP 38 2.10.3 ARABTALK 38 2.10.4 Sakhr TTS 40 2.10.5 Élan TTS 41

2.11 SUMMARY 43

CHAPTER 3: CONCATENATIVE SYNTHESIS

3.1 OVERVIEW 46

3.2 CONCATENATIVE SYNTHESIS 48

3.2.1 Phonemes 52 3.2.2 Diphones 52 3.2.3 Demi-syllables 53 3.2.4 Speech Signal Representation for Concatenative Synthesis 54

3.3 TYPES OF CONCATENATIVE SYNTHESIS 57

3.3.1 Concatenation of Stored Allophone 57 3.3.2 Diphone Concatenation Synthesis 58 3.3.3 Concatenation of Stored Syllables and Demi-syllable 59 3.3.4 Concatenation of Stored Waveform 60 3.3.5 Concatenation of Stored Words 62

3.4 SUMMARY 64

CHAPTER 4: ARABIC LANGUAGE

4.1 OVERVIEW 65

4.1.1 English and Arabic 67 4.2 DEFINITION OF AN ARABIC WORD 69

4.3 ARABIC IS A DIACRITIZED LANGUAGE 71

4.4 INTRODUCTION TO VOWELS 75

4.4.1 Short Vowels 76 4.4.1.1 The First Short Vowel 76 4.4.1.2 The Second Short Vowel 77

4.4.1.3 The Third Short Vowel 77 4.4.1.4 The First Short Vowel Doubled 78 4.4.1.5 The Second Short Vowel Doubled 78 4.4.1.6 The Third Short Vowel Doubled 79

4.4.2 Long Vowels 79 4.4.2.1 The First Long Vowel 79 4.4.2.2 The Second Long Vowel 81 4.4.2.3 The Third Long Vowel 81

4.5 SYLLABLES OF ARABIC LANGUAGE 82

4.5.1 Syllable Structure 82 4.5.2 Syllable Patterns 82

4.5.2.1 Short and Long Syllables 83 4.5.2.2 Closed and Open Syllables 83 4.5.2.3 Ending a Syllable with a Consonant 83

4.6 SUMMARY 85

CHAPTER 5: SYSTEM ANALYSIS AND DESIGN

5.1 OVERVIEW 86

5.2 SOFTWARE AND HARDWARE REQUIREMENTS 91

5.2.1 Microsoft Visual Basic 6.0 91 5.2.1.1 Visual Basic and Arabic Supports 92 5.2.1.2 Visual Basic Bi-directional Features 92 5.2.2 Microsoft Word 93 5.2.3 Sound Forge 8.0 93

5.3 ARABIC TEXT-TO-SPEECH ARCHITECTURE 95

5.3.1 Synthesizing Text Steps 96 5.3.2 Recording of the sounds 98 5.3.3 Storing Sound Files Using Wave File Format 99

5.4 THE DESIGNING OF DATABASE FOR SOUND FILES 100

5.4.1 Representing Relational Database Using Microsoft Access 100

5.5 THE DESIGN OF USER INTERFACE 103

5.6 SUMMARY 106

CHAPTER 6: ARABIC TTS SYSTEM IMPLEMENTATION

6.1 OVERVIEW 107

6.2 CREATING THE ARABIC TTS SYSTEM 108

6.2.1 Understanding the form and its procedures 108 6.2.2 Creating the Command Button 109 6.2.3 Creating the Text Box 110 6.2.4 Creating the Button associated with the TextBox 111 6.2.5 Creating the Drive to See All the Files Inside the Drive 111 6.2.6 Creating the Multimedia Controls and their Properties 112 6.2.7 Combining the Multimedia Controls with the Command Button 113 6.2.8 The User Interface of the TTS Synthesizer System 114

6.3 SUMMARY 124

CHAPTER 7: TESTING AND EVALUATING OF ARABIC TTS SYSTEM

7.1 OVERVIEW 125

7.2 TESTING AND EVALUATING TTS SYSTEMS 125

7.3 TESTING THE ARABIC VOICE 128

7.3.1 Test group 128 7.3.2 Method 128

7.4 TEST AND EVALUATION RESULTS 129

7.4.1 Naturalness 129 7.4.2 Speed 130 7.4.3 Sound quality 131 7.4.4 Pronunciation 132 7.4.5 Clearness 134 7.4.6 Stress/Intonation 135 7.4.7 Error 139

7.5 SUMMARY 140

CHAPTER 8: CONCLUSION AND FUTURE WORK

8.1 CONCLUSION 141

8.2 SUGGESTIONS AND FUTURE WORK 144 REFERENCES

REFERENCES AND RESOURCES 146

APPENDICES

APPENDIX A: Questionnair 153 APPENDIX B: Glossary 156

LIST OF FIGURES Figure 2.1 The human vocal organs. 19 Figure 2.2 Kratzenstein’s resonators. 22 Figure 2.3 Wheatstone’s reconstruction of von Kempelen’s speaking machine 23 Figure 2.4 The Voder electronic synthesizer. 24 Figure 2.5 Some milestones in speech synthesis. 25 Figure 2.6 Linear predictive synthesizer. 27 Figure 2.7 Block diagram of articulatory speech synthesizer. 28 Figure 2.8 Viterbi Alignment. 40 Figure 3.1 Block diagram of a concatenative text-to-speech system. 50 Figure 3.2 Wave sound for Allophone Concatenation Synthesis. 57 Figure 3.3 Wave sound for Diphone Concatenation Synthesis. 58 Figure 4.1 The classification of the Arabic words. 70 Figure 5.1 Use-Case Diagram for TTS Synthesizer System 86 Figure 5.2 IPO – Schematic Architecture Design of Arabic TTS 95 Figure 5.3 Sequence Diagram for Text Normalization 96 Figure 5.4 Sequence Diagram for Text Segmentation 97 Figure 5.5 Sequence Diagram for Text Concatenation 98 Figure 5.6 The procedure taken when user requests for certain data 101 Figure 5.7 User interface of Arabic and EnglishTTS Synthesizer System. 106 Figure 6.1 The Basic Form and its properties 109 Figure 6.2 The Form with the command button “Speak” 110 Figure 6.3 The Form with the TextBox written inside “Text1” 110 Figure 6.4 The Form with the TextBox and the Command Button. 111 Figure 6.5 The Drive C, its directory and all the files inside the drive. 112 Figure 6.6 The form with the Multimedia Control Commands. 113 Figure 6.7 The form with the Multimedia Controls and the Command Buttons. 113 Figure 6.8 User interface of Arabic and English TTS Synthesizer System 114 Figure 6.9 The word �� to be pronounced by the system 123 Figure 7.1 Naturalness of the voice 129 Figure 7.2 The speed of the speech 130 Figure 7.3 The sound quality of the voice. 131 Figure 7.4 The pronunciation’s effect on understanding. 132 Figure 7.5 The concentration needed to hear the pronunciation 133 Figure 7.6 The annoying level of the pronunciation 134 Figure 7.7 Understanding the voice. 135 Figure 7.8 The level of difficulty in understanding the voice. 136 Figure 7.9 The intonation of the system. 137 Figure 7.10 The stress of the system 138 Figure 7.11 Pronunciation mistakes. 139

LIST OF TABLES Table 2.1 Comparison between types of speech synthesis 29 Table 2.2 Comparison between types of speech synthesis 44 Table 3.1 Concatenative Synthesis: Pros and Cons 56 Table 3.2 Advantages and Disadvantages of Units of Concatenative Synthesis 63 Table 4.1 Arabic dotted and undotted alphabet characters 68 Table 4.2 The Arabic diacritics and the significance of each one 72 Table 4.3 Arabic Syllable Patterns 83 Table 5.1 Description of Open File Use Case. 87 Table 5.2 Description of Input Text Use Case. 87 Table 5.3 Description of Clear Text Use Case. 87 Table 5.4 Description of Synthesize Text Use Case. 88 Table 5.5 Description of Update and Delete Use Case. 88 Table 5.6 Description of Record Sound Use Case. 88 Table 5.7 Description of Normalize Text use case. 89 Table 5.8 Description of Text Parser Use Case. 89 Table 5.9 Description of Concatenate Text Use Case. 90 Table 10 Description of Compare Sound Use Case. 90 Table 5.11 Example of Sound segments in wave files 99 Table 5.12 Arabic Word Table 102 Table 5.13 Syllable Table 102 Table 5.14 Abbreviation Table 102 Table 7.1 Possible evaluating attributes 127 Table 7.2 Naturalness/ Clearness 129 Table 7.3 Sound Speed 130 Table 7.4 Sound quality 131 Table 7.5 Pronunciation Question 1 132 Table 7.6 Pronunciation Question 2 133 Table 7.7 Pronunciation Question 3 134 Table 7.8 Clearness Question 1 135 Table 7.9 Clearness Question 2 136 Table 7.10 Stress/Intonation Question 1 137 Table 7.11 Stress/Intonation Question 2 138 Table 7.12 System Error 139

LIST OF SYMBOLS AND ABBREVIATIONS

ANN Artificial Neural Network

ASCII American Standard Character International Institute

BIDI Bi-Directional

C Constant letter

DOS Disk Operating System

DSP Digital Signal Processing

GTP Grapheme-To-Phoneme

HMM Hidden Markov Model

ICT Information Communication Technology

IDE Integrated Development Environment

IPO Input-Process-Output Schematic

IVR Interactive Voice Response

LPC Linear Prediction Coding

MCI Media Control Interface

MFCC Mel Frequency Cepstral Coefficient

MSA Modern Standard Arabic

OCR Optical Character Recognition

PTI Panasonic Technologies Inc.

RDI Research and Development International

SAPI Speech Application Programming Interface

SDK Software Development Kit

SDLC System Development Life Cycle

TTP Text-To-Phonetic

TTS Text-To-Speech

V Vowel Letter

VCR Video Cassette Recorder

CHAPTER ONE

INTRODUCTION

1.1 OVERVIEW

Language is among the mainly most important features that differentiate humans from

other living creatures and speech is the key medium of language (Edwards, 1991). With

the advent of digital electronic technology, the goal of developing machines that imitates

human sounds has come closest to be achieved. It has to be said that no one has really

succeeded in synthesizing a voice that is identical to a human voice. Meanwhile, speech

synthesizers are now available which produce speech of a quality adequate for many

applications.

When we hear speech in our own language, we hear individual words and sounds. We

can write down speech using discrete letters with spaces between words. When we hear

speech in a language that we do not know, we cannot do this. All of the words and sounds

seem to run together in a continuous, individual stream (Robert, 1999). Is speech discrete

or it is continuous? The answer to this question is both. On a purely physical level, speech

is continuous, except where we pause to take breath. On a psychological level, speech is

perceived as composed of discrete sounds and groups of discrete sounds, the former

corresponding more or less to the letters of the alphabet and the letter to words of the

language.

Humans are capable of dividing physically continuous speech signal into discrete units

because of their linguistic expertise. Mastering language is perhaps the most exceptional

single intellectual achievement of a person’s life (Robert, 1999). An unsounded amount

of knowledge is required to speak and understand a language with native fluency. This

dual nature of speech “discrete or continuous” is what makes computer processing of

speech so challenging (Pinker, 1993). Digital computers are capable of only incompletely

representing a continuous signal, devoid of linguistic knowledge except for the small

amount provided by us humans. This perspective is correct in the sense that speech and

language are unique to the human species. Besides, speech is the primary modality for

communication between humans, and reliable speech synthesis by machines would be

very useful. Still more useful would be speech understanding – Speech understanding is

the identification of the meaning of the utterance (Alan Dix, el. at. 2003).

Speech provides our first contact with the raw, unwashed world of real sensor data. These

data are noisy, quite literally: there can be background noise as well as artifacts

introduced by the digitization process (Alan Dix, el. at., 2003); there is a variation in the

way that words are pronounced, even by the same speaker; different words can sound the

same; and so on.

The term speech synthesis evokes memories of mechanical to many of us, tedious or

repetitive voices. Text-To-Speech Synthesizer System (TTS) is simply defined as written

text transformed into speech; reading or dictating machines; the part of speech

technology, which is concerned with automatically generating speech from a computer.

A TTS synthesizer System is a computer-based system that has the ability to read any

text aloud, whether it is directly introduced in the computer by an operator or scanned

and submitted to an Optical Character Recognition (OCR) system. OCR is the process

that allows the transformation of a string of phonetic/syllabic and prosodic symbols into a

synthetic signal, (i.e. the automatic production of speech, through a Grapheme-To-

Phoneme transcription of the sentences to utter). Grapheme is the letters in a words’

dictionary, while, Phoneme is the smallest unit of speech that differentiates one word

from another (Edward, 2003).

The TTS technology is becoming inevitable in some businesses that need to provide for

their customers with the latest and fundamental information in real time. These

businesses usually use Interactive Voice Response (IVR) systems and call centers to

communicate this information to their customers and prospects. Converting fundamental

data stored in Web sites, databases and files into human voice using the traditional

expensive and time-consuming human recordings in studios is becoming a hard and long

process since the information is usually dynamic. In some cases, it would be impossible

to track these changes using the human recordings way.

Arabic is a complex language and it is not like other languages, (i.e., English, French, or

Spanish), those languages, written in Latin script, have vowels while the Arabic language

has special characters called “diacritics”. These diacritics give the Arabic words the

correct meaning within a sentence. For example, two Arabic words have different

meaning can be written the same and only the diacritics can help the reader to distinguish

them. Such as, the word “��” is for example, pronounced differently in sentences “ ��

�� ” meaning “the student submitted the homework” and in the other sentence

“ �� ” meaning “I greeted my friend”.

1.2 PROBLEM JUSTIFICATION

Information and communication technology is rapidly evolving as an effective tool for

making information widespread and available online to several communities. The

industrial society is turning towards information society. The increased use of

information technology is enabling people across the world to participate in the

knowledge network; however, people in some developing countries are being deprived of

the benefits of the use of ICT and the computer system. One of the main reasons for this

is the lack of suitable human computer interface for disabled people or users and the

software designed and developed to meet their needs. To design and develop a computer

interface for a person who can not see what computer displays, is the most challenging

task for many software developers.

Even though, many TTS systems with different language exist, each individual system is

not the same with the other. The options and functions of TTS process vary from one

system to another. Arabic TTS Synthesizer System is like many other systems that came

with the objective to develop a system that would assist people in the task of knowledge

gaining from texts.

TTS Synthesizer Systems converts the written input to spoken output by automatically

generating synthetic or computer generated speech. Typed text is converted into speech

using various algorithms such as formant synthesis, concatenative synthesis or other

methods, which will be explained briefly in the next chapter. Speech synthesis is often

referred to “Text-To-Speech” System conversion (TTS); the term Text-To-Speech will be

used from this point and onwards instead of speech synthesis.

As the system being developed, it cannot avoid from having a problem. Primarily, there

are problems that are actually faced by the people who develop the program as to make

the program works efficiency and fulfills the users’ requirements. This part of

dissertation, discusses the major problems that will be rose during the development of the

system starting from the stage of designing the system until the stage where it is being

implemented and tested.

The problem area in speech synthesis is very extensive. There are quite a few problems in

text pre-processing, such as numerals, abbreviations, acronyms. Moreover, the

pronunciation of written text is a major problem nowadays as well. For example, when

concerning the Arabic words that cannot be translated the same into other languages such

as Malay language. However, in the case of court session, there are some terminologies

that may be used by lawyers or the judges in the court, so they have to use one of these

terminologies in Arabic with the correct pronunciation.

The problem of converting text into speech for some language can naturally be broken

down into two sub-problems (Ronald, et al., 1997). The first sub-problem involves the

conversion of linguistic parameters (for example, phoneme sequences, and accentual

parameters) into parameters (for example, formant parameters, concatenative unit indices,

and pitch time / value pairs) that can drive the actual synthesis of speech. The second

sub-problem involves the computation of these linguistic parameter specifications from

input text.

The first task faced by any text-to-speech synthesizer system is the conversion of input

text into linguistic representation, usually called Text-To-Phonetic (TTP) or Grapheme-

To-Phoneme (GTP) conversion. The difficulty of conversion is highly language

depended and includes many problems. For Arabic, English and most of other languages

the conversion is much complicated. A very large set of different rules and their

exceptions is needed to produce correct pronunciation for synthesized speech.

Conversion can be divided in three main phases, text preprocessing, correct

pronunciation, and the analysis of prosodic features for correct intonation, stress, and

duration. Here the dissertation touches on the first two phases; however, the last phase is

out of the dissertation scope.

Text preprocessing is usually a very complex task and includes several language

dependent problems. Digits and numerals must be expanded into full words. For example

in Arabic, numeral 243 would be expanded as “ �� ! �"#$#! ” meaning “two-hundreds

and forty three” and another example, numeral 1750 as “ �! %& �"'(! ��)�� ” meaning

“one-thousand seven hundreds and fifty”. Fractions and dates are also problematic.

Figure 2/3 can be expanded as “ �#��*�� ” meaning “one-third” in case if the figure is a

fraction or as “�+, -� .�*” meaning “second of March” in case if it is a date.

Abbreviations may be expanded into full words, pronounced as written or pronounced

letter-by-letter (Macon, 1996). There are also some contextual problems. For example,

“/0”can be pronounced either as “123 "�40”, “kg” meaning “kilogram” or as “ �23 "�405� ”

meaning “kilograms” depending on preceding number; yet another example, “ .د ” as

“�"�06”, “Dr.” meaning “Doctor” and “78” as “ 9 :2; ”, “etc” meaning “etcetera”. In some

cases, the adjacent information may be enough to find out the correct conversion, but to

avoid wrong conversions the best solution in some cases may be the use of letter-to-letter

conversion. Innumerable abbreviations for company names and other related things exist

and they may be pronounced in many ways. For example, N.A.T.O. as “"�4<” and RAM as

“1�” are usually pronounced as written; therefore, SAS can be pronounced as “ 8 =�8 >8> ”

and ADP “ =�6 =�8?=4 ” letter-by-letter. Some abbreviations such as MPEG as “ �8@�A4 ” are

pronounced irregularly.

Special characters and symbols, such as (‘@’, ‘#’, ‘%’, ‘&’, ‘*’, ‘(’, ‘)’, ‘-’, ‘/’, ‘<’, ‘>’,

‘[’, ‘]’) are generally spoken as “at symbol”, “pound sign”, “percent”, “ampersand”,

“asterisk”, “left parenthesis”, “right parenthesis”, “dash”, “slash”, “left angle bracket”,

“right angle bracket”, “left square bracket”, and “right square bracket”, respectively,

cause also special kind of problems. In some situations, the word order must be changed.

For example, $71.50 must be expanded as “B��C� �"'(! B�D!6 �"��! %�E!” meaning

“seventy-one dollars and fifty cents” and $100 million as “B�D!6 ��"4�� ” meaning “one

hundred million dollars”, not as one hundred dollars million. Also special characters and

character strings in for example web sites or e-mail messages must be expanded with

special rules. For example, character ‘@’ is usually converted as at and e-mail messages

may contain character strings, such as some header information, which may be omitted.

Some languages also include special non-ASCII characters, such as accent markers or

special symbols.

The second task faced by any text-to-speech synthesizer system is to find correct

pronunciation for different contexts in the text, usually called Pronunciation or

Heteronyms. Some words, called homographs, cause maybe the most difficult problems

in TTS systems. Homographs are spelled the same way but they differ in meaning and

usually in pronunciation (e.g. FG+, ��). The word FG+ is for example pronounced

differently in sentences “ ��H 98 IF�� ” meaning “the boy went to the school” and

“ �� J2�K �� ” meaning “my friend bought gold”. With these kinds of words, some

semantically information is necessary to achieve correct pronunciation.

The pronunciation of a certain word may also be different due to contextual effects. This

is easy to see when comparing phrases the end and the beginning. The pronunciation of

this depends on the initial phoneme in the following word. Compound words are also

problematic. For example, the characters ‘� L 1’ in “M �2I�” meaning “square” and “N��O�� P-��”

meaning “from their lord” is pronounced differently. Some sounds may also be either

voiced or unvoiced in different context. For example, phoneme /س/ in word “�Q 2��”

meaning “path” the character “ص” is voiced as “س”, but unvoiced in word “ N�� 'I�”

meaning “straight”.

1.3 SCOPE OF THE PROBLEM The Arabic TTS Synthesizer System is a stand-alone and event-based system. The scope

of this dissertation looks into a few dimensions. These dimensions can be divided as

follows. The TTS Synthesizer System is able to convert text to audio format using both

languages Arabic and English. That is the system only supports two languages, which are

Arabic and English. Additionally, the Arabic TTS Synthesizer System is able to

pronounce the text input by the user, in a form of word-by-word, or in a form of a

sentence.

Arabic TTS Synthesizer System designed to be used by beginners – those who have no

background or do not speak Arabic – that is the TTS system is to be used mainly by non-

Arabic speaking, audiences, teachers – teachers may use this system to teach their

students on the correct pronunciation; disabled or impaired users and finally students.

Besides, many different types of people whose jobs require them to search for knowledge

in documents would be useful in the field of linguistics and engineering can use this

project. Students who study the words occurrences in text, lexicographers who compile

dictionaries and translators all must examine large quantities of text, often in the millions

of words, in order to find evidence of word usage.

1.4 SYSTEM AIMS AND OBJECTIVES

The Arabic TTS Synthesizer System gives a clear way of pronouncing Arabic words as

the user can listen to the human-speech as they enter the text. This system also benefits

students especially in learning and improving their skills and vocabularies of Arabic

language. With some interesting user interfaces added such as animations, graphics and

colorful text may attract users including students to learn Arabic language.

The aim of this dissertation is to develop an Arabic TTS Synthesizer System that can be

used for reading the input text written in Arabic words; an understanding on what is

speech will guide us to develop the system. The system can be used for the following

purposes:

1.4.1 General Objective:

To design, implement and evaluate an Arabic TTS Synthesizer System for novice

non-Arabic speaking audiences.

1.4.2 Specific Objective:

1. To analyze the current state-of-the-art on TTS Synthesizer System particularly for

the Arabic language.

2. To design and implement the Arabic TTS Synthesizer System.

3. To develop a suitable mechanism for converting textual information into audio

form for TTS Synthesizer System application.

4. To assess the implemented TTS Synthesizer System in terms of its functionalities

and quality of speech.

1.5 POTENTIAL BENEFITS AND MOTIVATION OF ARABIC TTS

The motivation behind building and developing such a system is that the TTS interface

can improve the user’s experience on a desktop. It is more relaxing to listen instead of

reading large portions of text. It is good for the blind, slow readers, and less straining for

the eyes. Arabic TTS Synthesizer System brings benefits especially in the educational

field. It assists the research, data collecting and text analyzing. It is very useful for the

students, educators, and language researchers. It provides them with an effective way of

knowing how to pronounce the words. The following are benefits of the Arabic TTS

Synthesizer System:

1. Easy to use – intuitive: Arabic TTS Synthesizer System interface will be designed

to be intuitive and easy to use.

2. Efficient: Arabic TTS Synthesizer System reduces costs and increases efficiency.

3. Standard technology: Arabic TTS Synthesizer System relies on the use of

standard technologies and open standards like Microsoft Visual Basic 6.0, Forge

8.0, Microsoft Word.

4. Arabic TTS Synthesizer System will help users to learn Arabic language.

5. Provides accessibility: Over the Web and over the phone.

6. Offers adaptability and flexibility: Any time, anywhere.

7. Increases competitiveness: Increases depth of value-added services and develops

competitive advantage by creating differentiation.

1.6 PROJECT METHODOLOGY

The dissertation methodology intends to build on already existing resource that is Arabic

language together with additional computing tools. This dissertation is derived from the

observation on the similar works and initiatives from different organizations and

individuals in the field words databases. A more specific focus would be on the modern

and recent technology of words processing and database programming. In order for the

Arabic TTS Synthesizer System to be developed, the following activities are to be carried

out throughout the development the Arabic TTS Synthesizer System:

1. Literatures Search and Collection of Relevant Information: This phase is a

process of gathering and compiling information about the project work and gets

the relevant information in the form of journals, books, articles, Internet and other

resources. Prior to the development of the software, a study has been made to

understand what text-to-speech system is as well as on available TTS systems.

Here there researcher after studying the existing TTS systems is going to make a

comparison among the TTS systems and finding the strengths and weaknesses of

each system, the features of each system, and finally the technology used in the

system. In addition, in this phase the researcher is going to look into different

methods or algorithms used to produce the TTS System such as formant, linear

prediction, and others. The method that is to be used in this system is

Concatenative Synthesis Method, which uses different length prerecorded

samples derived from natural speech.

2. System Analysis and Design: In this phase, various activities are being carried

out such recording of the sounds using special type of software known as Cool

Edit, building the database of these sounds using Microsoft Access and designing

the user interface using Visual Basic 6.0. This is an important and critical stage

because without the sounds we cannot develop the software. Sounds are the basic

requirement of this software. This phase will discuss the system design, such as

the software process model; here the model will be the System Development Life

Cycle (SDLC). In addition, the researcher is going to touch on some other aspects,

like, requirements analysis, development of the system model, Text-To-Speech

Analysis, resources of databases, methods, techniques, and algorithms. Finally in

this stage the designing of the interfaces of Arabic TTS Synthesizer System will

start.

3. Development of Text-To-Speech Synthesizer System: Implementation phase

starts here. The software is developed using Visual Basic 6.0. This is the most

time consuming stage, where coding activities are involved. This is a critical and

difficult phase in developing the software.

4. System Testing and Evaluation: The last phase involves the testing of the

software. This is the most important stage, where the system is tested, information

are collected, analyzed and results are discussed. This is an integration of all of

the previous phases. In this phase, several users were participating to use the

system and test it.

1.7 ORGANIZATION OF THE DISSERTATION

This dissertation is divided into 8 chapters. The organization of the chapters is as follows:

Chapter 1: Introduction

The aim of this chapter is to introduce the field of speech synthesis. The chapter began

with the basic definitions of important terms, in section 1.1. To understand the Problem

Description and Background of the problem are presented in section 1.2. In section 1.3

Scope of the problem is discussed. Project Objectives are presented in section 1.4. The

motivation for this dissertation happens to be the application of TTS synthesizers for

increasing human-computer interaction for people using Arabic as their native language.

Chapter 2: Literature Review

This chapter takes a look into the literature review of the existing TTS Synthesizer

Systems that are similar to the proposed system; This chapter is divided into 8 sections,

which are: section 1, Overview; section 2, What is Text-To-Speech; section 3, explains

the History of Text-To-Speech; section 4, the Importance of Text-To-Speech; section 5,

Speech Production Mechanism; section 6, discusses the Existing Products; section 7,

Applications for Text-To-Speech; and section 8, The Challenges Behind The Text-To-

Speech.

Chapter 3: Concatenative Synthesis

This chapter explains in detail the methodology of the system. This chapter is divided

into 3 sections, which are: section 1, Overview; section 2, explains the Concatenative

Synthesis; and section 3, explains the Types of Concatenative Synthesis.

Chapter 4: Arabic Language and TTS

This chapter explains in detail the Arabic language. This chapter is divided into 6

sections, which are: section 1, Overview; section 2, Definition of an Arabic Word;

section 3, Arabic is a Diacritized Language; section 4, Introduction to Vowels; section 5,

Joining Letters; and section 6, Syllables and Doubled Letters.

Chapter 5: System Analysis Design and Architecture

This chapter discusses the System Analysis and Design and the Architecture of the

proposed system Arabic Text-To-Speech Synthesizer. This chapter has been divided into

4 sections which are: section 1, Overview; section 2, explains the Arabic Text-To-Speech

Architecture; section 3, The Designing of Database for Sound File; and section 4, The

Design of User Interface and Its Procedures.

Chapter 6: The System Implementation

This chapter discusses the system implementation by providing screen shots of the

system, and every screen shot is provided with an explanation. Section 1, explains

overview of the chapter and section 2, the process of Creating the TTS Application.

Chapter 7: Testing and Evaluation

This chapter shows the result of the testing and evaluation of the system.

Chapter 8: Conclusion

This chapter outlines the findings and the recommendation and future enhancement of the

current system.

CHAPTER TWO

LITERATURE REVIEW

2.1 OVERVIEW Synthetic speech has been the vision of humanity for decades. To know how the current

systems function and how they have been developed to their current form, a

chronological review may perhaps be valuable. In this chapter, the history of synthesized

speech from the first mechanical efforts to systems that form the basis for today’s high-

quality synthesizers is discussed. Some separate milestones in synthesis-related methods

and techniques will also be discussed briefly.

Electronic and computer technology developments are causing an explosive growth in the

use of machines for processing information (Holmes, 1988). In most cases this

information originates from a human being, and is finally to be used by a human being.

Therefore, there is a need for effective ways to transfer information, in both directions

between people and machines. One suitable way for this communication in many cases is

in the form of speech, because speech is the communication method that is most widely

used between humans; it thus seems extremely natural and requires no special training.

There are many situations where speech is not the best method for communicating with

machines. For example, large amounts of text are much more easily received by reading

from a screen, and positional control of features in a computer-aided design system is

easier by direct manual manipulation. However, for interactive dialogue and for input of

large amounts of text or numeric data speech offers great advantages (Holmes, 1988). For

all applications where the machine is only accessible from a standard telephone

instrument there is no practicable alternative.

Several researches are being carried out in the area of speech synthesis. In the early days

of synthesis, research efforts were devoted mainly in simulating human speech

production mechanism, using basic articulator models, which was based on electronic

acoustic theories. Even though this way of modeling is one of the ultimate goals of

synthesis research, advances in computer sciences have widened the research field to

include text to speech processing. It is not only human speech generation is modeled but

also text processing is modeled. This way of modeling is generally done by a set of rules

derived, for example, from phonetic theories and acoustic analysis.

2.2 WHAT IS SPEECH? Speech is a complex signal from which we extract many types of information, from the

message content and meaning, to the nature of the transmission medium, to the identity

and condition of the speaker. How well we perform these tasks depends on the quality of

the speech signal, the efficacy of our hearing, the nature of the listening environment, and

our accumulated auditory experience. Because of its importance in human

communications, and since we do not have to consciously direct our attention (or even be

awake) to detect it, sound and in particular, speech, provides a channel for

communication which combines a degree of immediacy with the release of sight and

touch for other tasks. It could be argued that speech science is 2,500 years old, (Robert,

1999).

Human speech is produced by a learned, coordinated process involving: drawing air

through the larynx, varying the tension on the vocal cords, positioning the articulators of

the vocal tract, and performing these actions under the conscious control of learned

language skills. The character and frequency content of the basic sound source are altered

by voluntary muscular control of the tension on the vocal cords. Thus are generated

voiced sounds (e.g., “oo”) having a periodic structure or unvoiced sounds (e.g., sh”)

having an aperiodic structure. This is the basis of sounds that are then amplified by the

resonant features of the vocal tract and shaped by articulation. Articulation is the process

of interrupting and shaping the sound signal from the larynx to form the basic sounds of

speech (phonemes) and combining them to form words. The articulators used are the jaw,

lips, soft palate, teeth, and tongue.

2.3 SPEECH PRODUCTION MECHANISM

The human speech production system is an interesting and complex mechanism. Initially,

every person has vocal characteristics that are unique so that one can often recognize an

individual by their voice alone. These characteristics directly relate to the physiology of

the talker (Robert, 1999). Features such as age, gender, height, weight, and the structure

of the vocal chords, nasal and oral cavities, teeth and lips all play a major role in the

speech production process.

Figure 2.1 The human vocal organs (Ntsourak's Home Page). Vocal organs presented in figure 2.1 above, form human speeches. The core energy

source is the lungs with the diaphragm. When speaking, the airflow is forced through the

glottis between the vocal cords and the larynx to the three main cavities of the vocal tract,

the pharynx and the oral and nasal cavities (Owens, 1993). From the oral and nasal

cavities, the airflow exits through the nose and mouth, respectively. The V-shaped

opening between the vocal cords, called the glottis, is the most important sound source in

the vocal system. The vocal cords may act in several different ways during speech.

The most significant function is to adjust the airflow by rapidly opening and closing,

causing energetic sound from which vowels and voiced consonants are produced (Robert,

1999). The basic frequency of vibration depends on the mass and tension and is about

110 Hz with men, 200 Hz with women, and 300 Hz with children. By stopping

consonants, the vocal cords may operate suddenly from a completely closed position, in

which they cut the airflow completely, to very open position producing a light cough or a

glottal stop. On the other hand, with unvoiced consonants, such as /s/ or /f/, they may be

completely open. An intermediate position may also occur with for example phonemes

like /h/.

The pharynx connects the larynx to the oral cavity. It has almost fixed dimensions, but its

length may be changed slightly by raising or lowering the larynx at one end and the soft

palate at the other end (Flanagan, 1972). The soft palate also isolates or connects the

route from the nasal cavity to the pharynx. At the bottom of the pharynx are the epiglottis

and false vocal cords to prevent food reaching the larynx and to isolate the esophagus

acoustically from the vocal tract. The epiglottis, the false vocal cords and the vocal cords

are closed during swallowing and open during normal breathing.

The oral cavity is one of the most important parts of the vocal tract. The movements of

the palate, the tongue, the lips, the cheeks and the teeth can vary its size, shape and

acoustics. Especially the tongue is very flexible, the tip and the edges can be moved

independently and the entire tongue can move forward, backward, up and down. The lips

control the size and shape of the mouth opening through which speech sound is radiated.

2.4 WHAT IS TEXT-TO-SPEECH? Speech is the key means of communication between people. TTS Synthesis, automatic

generation of speech waveforms, has been under development for several decades

(Santen et al., 1997). Recent progress in TTS synthesis has produced synthesizers with

very high intelligibility but the sound quality and naturalness remain a major problem.

According to (Stuart et al., 2002) they stated that Speech Synthesis or TTS Synthesis is a

complementary to speech recognition. The idea of being able to converse naturally with a

computer is an attractive one for many users, especially those who do not consider

themselves as computer literate, since it reflects their natural, daily medium of expression

and communication. However, there are as many problems in Speech Synthesis as there

are in recognition. The most difficult problem is that we are highly sensitive to variations

and accent in speech, and are therefore intolerant of imperfections in synthesized speech.

We are used to hear natural speech that we find it difficult to adjust to the monotonic,

non-prosodic tones that synthesized speech can produce. In fact, most speech

synthesizers can deliver a degree of prosody, but in order to decide what tone to give to

word, the system must have an understanding of the domain. Therefore, an effective

automatic reader would also need to be able to understand natural language, which is

difficult. However, for ‘canned’ messages and responses, the prosody can be hand coded

yielding speech that is much more acceptable. The term Speech Synthesis also refers to

the technologies that enable computers or other electronic systems to output simulated

human speech (Weinschenk & Barker, 2000). They provide acoustic information that is

phonologically acceptable yet have meaning to human listeners.

2.5 HISTORY OF TEXT-TO-SPEECH

Human attraction with talking machines is not new. For decades, people have tried to

make machines powerful with the capacity to speak; before the machine age humans

even hoped to construct speech for non-living objects. The early men attempted to show

that their idols could speak, usually by hiding a person behind the figure or channeling

voices through air tubes. Gerbert of Auricllac, Albertus Magnus, and Roger Bacon made

early examples of “speaking heads”.

The first efforts to create synthetic speech were made over two centuries ago (Flanagan,

1972). One of the earliest attempts at speech synthesis was made in 1779, when a Russian

scientist called Christian Kratzenstein, constructed a set of five acoustic resonators, see

figure 2.2, which when activated by a vibrating reed, produced imitations of the vowels.

He built models of the human vocal tract that could produce the five long vowel sounds (/

a /, / e /, / i /, / o / and / u /).

Figure 2.2 Kratzenstein’s resonators (Owens, 1993).

In 1791, Wolfgang Von Kempelen, a Hungarian, demonstrated a more successful

machine that could speak whole words and phrases. It consisted of a large bellows to

supply air to a reed that in turn excited a hand-held rubber tube (resonator). Extra tubes

and whistles were added to imitate the nasal and fricative sounds.

In 1837, Charles Wheatstone constructed his well-known version of Von Kempelen’s

speaking machine. It was a bit more complicated and was capable to create vowels and

most of the consonant sounds. Some sound combinations and even full words were also

possible to create. Vowels were produced with vibrating reed and all passages were

closed. Resonances were affected by deforming the leather resonator like in Von

Kempelen’s machine. Consonants, including nasals, were produced with disordered flow

channel a suitable passage with reed-off. In addition, in 1857 M. Faber built the

“Euphonia”. Wheatstone’s design was resurrected in 1923 by Paget see figure 2.3.

Figure 2.3 Wheatstone’s reconstruction of von Kempelen’s speaking machine (Flanagan, 1972).

One of the first electrical synthesizers was a device called the Voder (from voice

demonstrator), see figure 2.4, which was built in 1938. It had a voicing/noise source with

a foot pedal for fundamental frequency control. Signals, in this synthesizer, routed

through ten band-pass filters. The device was played like a musical instrument.

Figure 2.4 The Voder electronic synthesizer (Owens, 1993)

The first formant synthesizer was in 1953, constructed by Lawrence which consisted of

three electronic formant resonators connected in parallel. The input signal was either a

buzz or noise. A moving glass slide was used to convert painted patterns into six time

functions to control the three formant frequencies, voicing amplitude, fundamental

frequency, and noise amplitude. At about the same time when Lawrence’s machine was

introduced, Gunnar Fant introduced the first cascade formant synthesizer, OVE (Orator

Verbis Electris) which consisted of formant resonators connected in cascade.

The first complete full TTS synthesizer for English was developed in the Electro

technical Laboratory, Japan in 1968 by Noriko Umeda and his companions (Klatt, 1987).

It was based on an articulatory model and included a syntactic analysis module with

sophisticated heuristics. The speech was quite intelligible but monotonous and far away

from the quality of present synthesizers.

More details on the history of Speech Synthesis, in a chronological order are given in

(Klatt, 1987). Some milestones of speech synthesis development are shown in figure 2.5.

Figure 2.5 Some milestones in speech synthesis (Sami, 1999).

One of the complicated and sophisticated methods and algorithms applied recently in

TTS Synthesizer Systems is hidden Markov models (HMM). HMMs have been applied

to speech recognition from late 1970’s. For TTS Synthesizer Systems, it has been used

for about two decades (Schroeder, 1993).

2.6 TTS METHODS, TECHNIQUES AND ALGORITHMS

Concerning the complexities and difficulties that the researchers face nowadays, to

produce high quality TTS Synthesis System they followed many methods to accomplish

this goal. These methods are classified into two main categories:

The first category, synthesizes the speech without using human sound, this category is

known as Synthesized Speech. This category can be divided into three parts:

Formant Synthesis specifies directly the formant frequencies and bandwidths as well as

the source parameters (Weinschenk & Barker, 2000). Formant Synthesizers are also

referred to as Rule-Based Synthesizers, where generalized rules are extracted from the

filtered information. The input is then tested on the rules. The vocal tract transfers

function can be modeled by simulating formant frequencies and formant amplitudes. The

model transfers function for the vocal tract makes the formants much more evident. The

speech output is determined by phonetic rules to determine the parameters that are

necessary to synthesize a desired utterance using a formant synthesizer

Linear Prediction is a method initially designed for speech coding systems; it is used in

speech synthesis as well. The first speech synthesizer was produced from speech coders.

Similar to formant synthesis, the basic LP is based on the source-filter-model of speech.

The digital filter coefficients are estimated automatically from a frame of natural speech.

The main deficiency of the ordinary LP method is that it represents an all-pole model,

which means that phonemes that contain anti-formants such as nasals and nasalized

vowels are poorly modeled. The quality is poor too with short plosives because the time-

scale events may be shorter than the frame size used for analysis. With these deficiencies,

the speech synthesis quality with standard LP method is generally considered poor, but

with some modifications and extensions for the basic model the quality may be increased.

Several other variations of linear prediction have been developed to increase the quality

of the basic method (Donovan, 1996). Figure 2.6 shows the basic structure of the linear

predictive synthesizer.

Figure 2.6 Linear predictive synthesizer (Donovan, 1996).

Articulatory Synthesis is a method of synthesizing speech by controlling the speech

articulators. This method determines the characteristics of the vocal tract filter by means

of a description of the vocal tract geometry and places the potential sound sources within

this geometry. Articulatory synthesis typically involves models of the human articulators

and vocal cords. The articulators are usually modeled with a set of area functions

between glottis and mouth. The first articulatory model was based on a table of vocal

tract area functions from larynx to lips for each phonetic segment (Klatt, 1987).

Figure 2.7, shows a block diagram of the main components of a typical articulatory

speech synthesizer. It consists of three main components, an articulatory model, an

acoustic-tube model and a vocal-cord model. The articulatory model transforms a set of

perhaps 6 – 10 articulatory parameters, representing the position of the speech

articulators (lips, tongue, jaw etc.) into a cross-sectional area function, A(x), of the vocal

tract. An excitation model that simulates the modulated airflow from the vocal cords and

may include detailed modeling of cord vibration drives the acoustic-tube model.

Figure 2.7 Block diagram of articulatory speech synthesizer (Owens, 1993).

The second category synthesizes the speech using human sound so-called Concatenative

Synthesis.

Concatenative Synthesis is the most used technique of today, where segments of speech

are tied together to form a complete speech chain. The speech output is produced by

coupling segments from the database to create the sequence of segments. This technique

requires a bit of manual preparation of the speech segments. There are three categories

within this method, unit-selection, domain-specific, and diphone.

In addition, trainable concatenative speech synthesis uses computer assembly of recorded

voice sounds to create meaningful speech output. The basic process for developing

concatenated synthesizer is to have a human reader read units of speech and store the

recorded units of speech. These units are then assembled on demand according to given

business rules. It also works best for systems requiring a small vocabulary. (Weinschenk

& Barker, 2000). It uses also a large speech databases, has become popular due to its

ability to produce high quality natural speech output. The large footprints of these

systems do not present a practical problem for applications where the synthesis engine

runs on a server with enough computational power and sufficient storage.

It can be concluded that, most of the existing commercial text-to-speech systems can be

classified as either formant synthesizers (Klatt, 1980) or concatenation synthesizers

(Donovan, 1996 & Hamza, 2000). The formant synthesis was dominant for long time, but

today the concatenative method is becoming more and more popular. The articulatory

method is still too complicated for high quality implementations, but may arise as a

potential method in the future. Table 2.1 summaries the main differences between the

Concatenative and Parameter types of speech synthesis.

Table 2.1 Comparison between Concatenative and Parameter types of speech synthesis

Concatenative Synthesis

Parameter Synthesis (Formant and Articuraly)

Basic Human-voiced Fragments

Algorithm

Quality More Natural More Synthetic Prosody Lower Higher

Memory Requirement Higher Lower Algorithm Splice Phoneme Model Vocal Tract

Effort to develop new voice language

Higher Lower

2.7 THE IMPORTANCE OF TEXT-TO-SPEECH

TTS is emerging as a major feature in telecommunication systems. Several factors are

involved including the increased computer power, the deregulation of the telephony

networks, and a general acceptance of TTS as a practical tool for business and consumes.

The increase in computer power has greatly enhanced TTS in all applications. They have

made TTS Systems for telecommunications much less expensive, on a per port basis, by

allowing host based solutions to become a reality without having to use a high-cost,

dedicated Digital Signal Processing (DSP) resource board.

Deregulation of the telephony industry has also been a principle driving force for TTS

technology. The telecommunication industry has gone from a series of monopolies to

competitive industries. This competition has created a great need for differentiation

among companies offering similar services. TTS provides an opportunity to offer real

value applications and enhanced services to their customers.

Consumers, including business users, now expect greater automation and access to

variety of telecommunication services. For many tasks, automation is practical and

enjoyable. Calling for train schedule, airline departures and arrivals, and entertainment is

more easily and conveniently carried out using automated systems that incorporate Text-

To-Speech. All these show the important of speech synthesizer in any area of application.

2.8 APPLICATIONS OF TEXT-TO-SPEECH SYNTHESIS

A Text-To-Speech Synthesizer System is a computer based system that can convert text

into speech. Over the last few decades, extensive work has been done on text-to-speech

synthesis for the English language. Other languages such as Arabic have had limited

testing mentioned earlier. Concatenative speech synthesis can be achieved in two

different ways, either by concatenating a fixed number and size of units, such as phones

or diphones, or by concatenating a variable size unit, which is called unit selection.

Comparing to unit selection, diphone synthesis are more challenging in terms of signal

processing, since only one example of each unit exists in the database. On the other hand

diphone synthesis is to prefer when building applications like mobile devices, where the

memory size is the main concern.

Today we have Text-to-Speech Synthesizer Systems with a very high intelligible level

and an adequate level for numerous applications. These high quality TTS Synthesizer

Systems have numerous applications like the examples below:

2.8.1 Example Applications

• Applications for the Blind

By the help of an especially designed keyboard and a fast sentence assembling

program, synthetic speech can be produced in a few seconds to remedy the

voice handicaps. Also blind people can benefit from TTS Synthesizer Systems

which gave them access to written information.

• Applications for the Deafened and Vocally Handicapped

Voice handicaps originate in mental or motor/sensation disorders. Machines

can be an invaluable support in the latter case. With the help of an especially

designed keyboard and a fast sentence-assembling program, synthetic speech

can be produced in a few seconds to remedy these impediments.

• Educational Applications

Synthesized speech can be used also in many educational situations. They

provide a helpful tool to learn a new language known as computer aided

learning system. It can also be used with interactive educational applications.

• Applications for Telecommunications

In these systems textual information can be accessed over the telephone.

Mostly they are used when the requirement of interactivity is little and texts

range from simple messages. Queries can be given through the user’s voice

(needs speech recognition) or through the telephone keyboard.

• Applications for Multimedia

Man-machine communication that can help people with their work and other

things, for example voice activation systems in the car.

• Fundamental and Applied Research

TTS synthesizers possess a very peculiar feature, which makes them

wonderful laboratory tools for linguists. These are completely under control,

so that repeated experiences provide identical results (as is hardly the case

with human beings). Consequently, they allow investigating the efficiency of

intonative and rhythmic models. A particular type of TTS synthesizer, that is

based on a description of the vocal tract through its resonant frequencies (its

formants) and denoted as formant synthesizer, has also been extensively used

by phoneticians to study speech in terms of acoustical rules.

• Government Services

Government offices receive many calls requesting information. These range

from tax information from the Internal Revenue Service to the time and place

of town meetings. Much of this information can be dispensed by use of speech

synthesis over phone lines. To state a few applications, we have tax

information, road closing information, lottery results, opening and closing

times of public buildings, and unemployment claims processing.

2.8.2 Other Applications and Future Directions

Text-To-Speech Synthesizer Systems can be used in all kind of human-machine

interactions. For instance, in warning and alarm systems synthesized speech may be used

instead of warning lights or buzzers to give more accurate information of the current

situation. It may also be used to receive some desktop messages from a computer, such as

printer activity or received e-mail.

In the future, synthesized speech may also be used in language interpreters or several

other communication systems, such as videophones, videoconferencing, or talking mobile

phones. With talking mobile phones it is possible to increase the usability considerably

for example with visually impaired users or in situations where it is difficult or even

dangerous to try to reach the visual information. It is obvious that it is less dangerous to

listen than to read the output from mobile phone for example when driving a car.

2.9 THE CHALLENGES BEHIND THE TEXT-TO-SPEECH

The task of developing TTS Synthesizer System is very challenging and not easy, as it

requires lots of work and understating. Compared to limited vocabulary systems, which

must produce only a small-predefined set of possible utterances, TTS Synthesizer

Systems must be able to intelligently handle any input text. Consider, for example, the

number 1904, in order to correctly produce this number, the system must determine that

1904 is to be pronounced “�� ! ��)�'R! & ”, that is “one thousand nine hundred and

four” (as opposed to “nineteen o four”).

Anyway, the task of analyzing the input text is only half of the challenge. The next

challenge where, it must generate the actual sounds that accurately produce these

pronunciations, once the system has determined the desired pronunciations of the input.

This task is complicated by the fact that perceptually identical sounds, or phonemes, are

acoustically quite different in different phonetic contexts.

The precise duration and frequencies of a sound depend on many factors such as which

segments precede and follow it, its position in the word, syllable, or phrase, whether the

syllable containing it is emphasized, whether the speech is fast or slow, whether the voice

is that of a male or a female, and so on. The challenges of Arabic text-to-speech can be

divided into the following problems:

2.9.1 The Diacritization Problem The Arabic written text does not contain vowels and other markings that make the

orthography easy to understand. In their article, Mayfield Tomokiyo et al. (2003)

compare the Arabic language with English. They mean that the correct pronunciation of

an English word is not often obvious from its spelling and that there are many words with

multiple pronunciations. This problem can be solved by relying on electronically lexicons

that provide the correct pronunciation for an orthographic string. Mayfield Tomokiyo et

al. mean that Arabic has no such electronic solution.

To be able to use the language, we must know what the correct vowel is. The authors

mention two approaches to solve the vowelling problem for spoken language; either

inferring the vowels or enumerating the lexicon. Other authors that have approached this

specific problem are Al-Muhtaseb et al. (2003). They describe in their article an Arabic

Text-to-Speech System (ATTS) for classical Arabic where they solved the problem with

vowelization by implementing a processor for automatic vowelization of the text before

applying it to speech rules. To be able to generate vowels automatically, the processor

requires integration of morphological, syntactical, and semantic information (Al-

Muhtaseb et al., 2003).

2.9.2 Dialects Arabic is spoken in more than 23 countries and as mentioned before by 300 million

people. There are many varieties of the Arabic language, many dialects that reflect social

diversity of its speakers. Mayfield Tomokiyo et al. (2003) concern these varieties in

dialects as a problem for speech synthesis for many reasons. First, what dialect is to be

generated? The Modern Standard Arabic (MSA) or one of the dialects should be

generated. The second problem would be that a limitation of listeners would rise because

MSA is understood only by people with high education and/or with a good level in

reading and writing. The third problem considers the transcription of spoken Arabic.

Spoken Arabic has very little occasions to be written down. News and Newspapers are

delivered in Modern Standard Arabic to some extension. For example, munitions are not

fully said in news and this can be considered as being influenced by the dialect one

speaks. Mayfield Tomokiyo et al. mean that speakers of the same dialect can differ

significantly in their choice of which vowel is being used in the spoken language. The

reason for this is that the vowelling of the Modern standard Arabic is only learned in

school.

2.9.3 Differences in gender Mayfield Tomokiyo et al. (2003) bring up the issue of gender differences in speech.

There are in Arabic inflectional components that reflect the gender of the speaker and of

the listener. For example, consider the word “ �S��T R” “takallama” which means, “he spoke”

or “he has spoken”. If the listener is male then the imperative form “�S��T R” “takallam” is

said. In cases where the listener is female we say “ S��T R�) ” “ta-kal-lami” is said where the

long vowel “i” indicates the female gender. Therefore, when speech is the final product in

systems such as a translation system or a synthesizer, an appropriate gender marking

becomes more obvious and should be done correctly.

2.10 EXISTING PRODUCTS

This section introduces some of the commercial TTS Synthesis Systems available today.

More than 28 TTS Synthesizer Systems currently existing in the market. Some of the text

in this section is based on information collected from Internet, fortunately, mostly from

the manufacturers and developers official homepages.

First commercial TTS Synthesis Systems were mostly hardware based and the

developing process was very time-consuming and expensive. Since computers have

turned out to be more and more powerful, currently most synthesizers are software-based

systems. Software based systems are easy to configure and update, and usually they are

much less expensive than the hardware systems. However, a stand-alone hardware device

may still be the best solution when a portable system is needed.

2.10.1 MBROLA – PROJECT MBROLA-project is one of the main systems that have an Arabic voice. The MBROLA

project was initiated by the TCTS Laboratory in the Faculté Polytechnique de Mons,

Belgium. The main goal of the project is to have a speech synthesis for as many

languages as possible. MBROLA is used for non-commercial purposes. Another purpose

with it is to increase the academic research, especially in prosody generation.

The MBROLA speech synthesizer is based on diphone concatenation. MBROLA

produces speech samples on 16 bits (linear) if it is provided with a list of phonemes as

input together with prosodic information. MBROLA uses the PSOLA (Pitch

Synchronous Overlap Add) method that was originally developed at France Telecom

(CNET). It is actually not a synthesis method itself but allows prerecorded speech

samples to be concatenated and provides good controlling for pitch and duration.

The diphone databases are currently available for US English/UK English/Breton

English, Brazilian Portuguese, Dutch, French, German, Romanian, Spanish, Greek,

Welsh, Indian languages, Venezuelan Spanish, Hungarian, Turkish and Arabic. Some of

these languages exist with male and female voice (MBROLA).

2.10.2 ACAPELA – GROUP Acapela group constitutes all speech technologies that have been developed over the last

20 years. Speech synthesis and speech recognition have been created and improved by

Acapela. Acapela Group evolves from the strategic combination of three major European

companies in vocal technologies: "Babel Technologies" created in Mons, Belgium,

"Infovox" created in Stockholm, Sweden and "Elan Speech" created in Toulouse, France.

Acapela owns currently three technologies, TTS by diphone, TTS by Unit Selection and

Automatic Speech Recognition. Acapela is currently available for US English, UK

English, Arabic, Belgian Dutch, Dutch, French, German, Italian, Polish, Spanish and

Swedish.

2.10.3 ARABTALK

The ARABTALK TTS Synthesis System was developed at Research ad Development

International (RDI), for Arabic language. ARABTALK is a state-of-the-art corpus based

concatenative TTS System. The system employs Artificial Neural Networks (ANN)

statistical prosody based models for duration, energy, and global pitch contour prediction.

In addition, it has a real time synthesis by selection algorithm to explore large speech

corpus. ARABTALK has a Hidden Markov models (HMMs) based procedure to

automatically time-align new voices transcriptions to their acoustic phoneme boundaries.

The system is multi-user and safe-threaded enabled for server-based applications.

ARABTALK has a mature Arabic phonology framework. The current system has 41

phonetic letters by adding extra phonemes to consider the effect of the pharyngealized

phonemes.

The system has two databases one for male speaker at 22 kHz sampling rate (one hour)

and the other database is for a female speaker at 16 kHz sampling rate (four hours). The

speech is coded into 12 dimensional MFCCs plus log energy and their derivatives. The

EGG signal is recorded with each utterance to support pitch synchronous analysis and

prosodic modification if necessary during the synthesis process. The system uses HMMs

based Viterbi alignment procedure that is developed at RDI for this purpose (Wael et al.,

2000).The Viterbi alignment procedure can be summarized as a problem of searching

time boundaries for known sequence of HMM models for phonemes. Since the best state

sequence, which is known to be the Viterbi path, is obtained during decoding process,

time boundaries can be obtained directly. This process is illustrated in figure 2.8.

The overall architecture and general features of ARABTALK Text-To-Speech system for

Arabic language has been presented. An online demo is available at http://www.rdi-

eg.com/rdi/research/Arabtalk.asp. The system is corpus-based system and has many

statistical models. The system has real time unit selection with different caching methods.

Figure 2.8 Viterbi Alignment (Yasser, 2000).

2.10.4 Sakhr TTS

Sakhr TTS engine converts any Arabic/English text into a human voice. Sakhr has been

focusing in the last 5 years on creating an Arabic TTS engine that can match in its quality

the human voice. This technology gives businesses a competitive edge by allowing them

to provide their customers with the latest static and dynamic information anytime,

anywhere using normal telephones and mobiles.

Sakhr developed the Diacritizer engine (examples of diacritics are “ ”). This

engine can put the diacritics needed in Arabic texts automatically. The Diacritizer is the

main component in Arabic TTS. Without the Diacritizer, the output quality of the TTS

engine would be inaccurate and not clear. Since Arabic native speakers write Arabic text

without diacritics, the TTS engine should handle the non-diacritized text. The Diacritizer

will convert the non-diacritized text into a diacritized text and then the TTS engine will

convert it to a clear and human Arabic voice. Moreover, Text-To-Speech technology

Software Development Kit (SDK) converts any computer readable text into a human

sounding synthetic speech. Arabic is at least one order of magnitude difficult than other

common languages due to the lack of diacritics, i.e. vowelization needed to properly utter

any input text.

A major limitation encountering the development of Arabic TTS was the constraints

imposed by handling undiacritized Arabic text. Sakhr automatic Diacritizer is integrated

with a speech synthesizer engine to produce a real system with undiacritized Arabic as

input and high quality speech as output. The generation of such an output would be

impossible without an automatic Diacritizer, due to the abundance of different

pronunciation of the majority of the words without diacritics. Sakhr used its unique

diacritizer to provide the TTS synthesizer with the adequate vowelization to produce a

natural and intelligible sound.

Sakhr TTS Engine is composed of three basic parts. The Linguistic Module converts the

input text into a phonetic transcription. The Phonetic Module calculates speech

parameters, and the Acoustic Module uses those parameters to generate synthetic speech

signals.

2.10.5 Élan TTS Élan TTS translates any text to speech. The software has been developed by Learnout &

Hausple Company, which is a major world wide supplier to Text-To-Speech software and

a leading developer of this advanced technology. This software simply reads IT-

generated texts out loud, with the flexibility and richness of natural sounding speech. The

text can be spoken in several languages with both male and female voices depending on

the requirement from the user. This company has developed such kind of software for the

Telecom application and the name given for that speech synthesizer software is speech

cube. There is software as well that has been developed for the Telecom application, and

there are proverb speech platform and proverb speech unit.

Speech cube is a high density multilingual and multi-channel software Text-To-Speech

component for Telecom. It has been designed to run on telephony application and it is

compatible with all market standards. This software can automatically translate all

written texts into speech and can read it loudly. The user can choose either one of eight

languages provided by the software. This software is available under Windows NT,

Linux, SCO, Qnx, and Salaries. The main process, named syntheses interacts with the

user or with the application via standard input stream. The sound formats that this

software supports are 8 kHz, 11 kHz, and 16 kHz, with or without sun header (16 bits per

sample).

There are several key features highlighted regarding the software. And among these

features are as follows: Simultaneously run on several languages, Progressive which

means easy upgrade of server capacity, Large volume text treatment capacity, Unlimited

vocabulary, High quality voice, smooth, and natural intonation, Voice speed control and

pitch control Male and female voices

The software can be used as follows: As an engine, this software can translate text into

speech in real time for a voice board supporting a multimedia audio interface. As a

process, it can convert text database into speech for later use. As a server, this software is

available to a group of application, on a network as (Client/ Server) Protocol. Élan Text-

To-Speech has benefits such as; it simplifies the access to a value added system, with

updated information quickly and easily for improved efficiency.

Table 2.2 shows the summary of the existing Arabic Text-To-Speech Synthesizer

Systems. The table compares the five products in terms of manufacturer, platform,

languages supports, voices, controls, requirements, and lastly the synthesis method used

in building the system. From the table it is clear that the five systems support Arabic

language, except for ARABTALK TTS which supports only Arabic language, while the

rest are multilingual systems. It is also obvious that four systems out of five using the

same method of synthesis which is concatenative synthesis method; however, the

ARABTALK TTS uses Artificial Neural Networks (ANN) synthesis method. Lastly, in

terms of the voice, four systems out of five have both male and female voices except for

ACAPELA which has male voice only.

2.11 SUMMARY

As a conclusion, speech is a complex signal from which we extract many types of

information, from the message content and meaning, to the nature of the transmission

medium, to the identity and condition of the speaker. Yet, several researches were being

carried out in the area of TTS synthesis. The chapter also has described the mechanism of

speech production, and brief history of TTS. The product range of TTS synthesizers is

very wide and it is quite unreasonable to present all possible products or systems

available out there.

Table 2.2 Summary of Text-To-Speech Existing Products

Product Manufacturer or Developer Platforms Languages Voices Controls / Support Requirements Method

MBROLA

TCTS Laboratory in the Faculté Polytechnique de Mons, Belgium http://www. mbrola.com

UNIX Windows 95/98/xp

English French Spanish Italian German Hungarian Romanian Turkish Arabic

Male Female

Speed, Intonation contours, Lexical stress, Sentence accent, Segmental durations, Pitch and pitch range, Gender, Age, Vocal track scaling, glottal source param.

32 Mb memory 15 Mb disk Pentium 75 2 Mb memory 10 Mb disk

Concatenative Synthesis.

ACAPELA

Acapela group http://www. acapela.com

Windows 95/NT/98/xp UNIX

English Polish Spanish Italian Arabic Swedish

Male

Pentium 75 MHz 160 Mb Disk 8 Mb mem (UNIX: 32 Mb)

Concatenative Synthesis

Arabtalk TTS Research and Development International RDI http://www.rdi-eg.com/rdi/research/Arabtalk.asp

Windows 98/NT/2000/XP

Arabic Male Female

- - Artificial Neural Networks (ANN), statistical prosody, Hidden Markov Models

Sakhr TTS Sakhr Software http://www.sakhr.com/TTS/TTS.asp

Windows 98/NT/2000/XP

Arabic / English

Male Female

- - Unit Selection Diphone Concatenative Synthesis

Elan TTS Élan Speech http://www.tmaa.com/tts/Elan_profile.htm Or http://www.elanspeech.com Developed by Learnout & Hausple Company

Windows NT Linux, SCO, Qnx, and Salaries

Am. English Br. English French Dutch German Portuguese Romanian Spanish Arabic

Male Female Pitch and speed modification, timber alteration

Chip, embedded, client/ server

- Unit Selection Diphone Concatenative Synthesis

CHAPTER THREE

CONCATENATIVE SYNTHESIS

3.1 OVERVIEW

One of the most popular methods of synthesizing speech from text is by stringing

together, or concatenating, prerecorded words, syllables, or other speech segments (Olive,

1996). This avoids many of problems encountered in phoneme-to-phoneme synthesis,

such as the co-articulatory effects between neighboring speech sounds (Schroeter, 1996).

Still, even words do not usually occur in isolation: the words immediately preceding or

following a given word influence its articulation, its pitch, its duration and stress – often

depending on the meaning of the utterance. This chapter describes in details

Concatenative Synthesis Method and its types.

Synthetic voices are made by concatenating units of sound that have been previously

stored in a reference database. The contents of these units and methods of concatenation

vary, but the principle of concatenation is universal for TTS involving all but the briefest

messages. Nowadays, the use of actual speech waveforms has become increasingly

popular, where stored waveforms of various sizes are fetched as needed, with adjustments

made mostly at unit boundaries, but sometimes more generally throughout the utterance

(Browman, 1980).

Concatenative Synthesis Method uses a large database of source sounds, segmented into

units, and a unit selection algorithm that finds the sequence of units that match best the

sound or phrase to be synthesized, called the target. The selection is performed according

to the descriptors of the units, which are characteristics extracted from the source sounds,

or higher-level descriptors attributed to them. The selected units can then be transformed

to fully match the target specification, and are concatenated. However, if the database is

sufficiently large, the probability is high that a matching unit will be found, so the need to

apply transformations is reduced. The units can be non-uniform, i.e. they can comprise a

sound snippet, an instrument note, up to a whole phrase. Concatenative Synthesis can be

more or less data driven, where, instead of supplying rules constructed by careful

thinking as in a rule based approach; the rules are induced from the data itself. The

advantage of this approach is that the information contained in the many sound examples

in the database can be exploited.

3.2 CONCATENATIVE SYNTHESIS

The term Speech Synthesis is used solely for the development of speech sounds

completely in a machine, which to some extent modeled the human speaking system. The

applications were mainly for research in speech production and perception. Recently,

particularly in an engineering environment, speech synthesis has come to mean provision

of information in the form of speech from a machine, in which the messages are

structured dynamically to suit the particular circumstances required. The applications

include simple information services, reading machines for the blind and communication

aids for people with speech disorders. Speech synthesis can also be an important part of

complicated man-machine systems, where various types of structured conversation can be

made using voice output, with either automatic speech recognition or key pressing for the

man-to-machine direction of communication.

It is an indication of the times that parametric synthesizers are becoming old-fashioned.

Neither the allophone nor the prosody problem has yet been solved satisfactorily.

Progress is being made but a new kid has blown into town by the name of concatenative

synthesis. Today it is feasible for the units of synthesis to be digitized, human speech.

The job of the “Synthesizer” is arranging these units into the desired output, adjusting

prosody, and smoothing the boundaries between units to prevent infelicities in

articulation, all of which are still worthy challenges. TTS Synthesis is the process of

converting a written text into artificial speech. The system processes the text and repeats

it aloud in a computer-based program.

Concatenative Synthesis has been around since the 1950s; however, the writer Jonathan

Swift explained the important nature of concatenative synthesizers in a Gulliver’s

Travels, published in 1726:

These bits of wood were covered on every square with paper pasted on them, and on these papers were written all the words of their language, in their several moods, tenses, and declensions, but without any order. The professor then desired me to observe, for he was going to set his engine at work. The pupils at his command took each of them hold an iron handle, whereof there were forty fixed round edges of the frame, and giving them a sudden turn, the whole disposition of the words was entirely changed.

Concatenative Synthesis uses actual short segments of recorded speech that were cut

from recordings, and stored in an inventory voice database as waveforms, or encoded by

a suitable speech coding method.

Firstly, Concatenative Synthesis consisted only of digitally recorded full utterances, any

one of these recorded can be played back on request according to the situational needs. If

the spoken output consisted of a relatively small vocabulary of words, each one can be

recorded digitally and played back in the appropriate order on request.

The unit, syllable, words, etc. has the intelligibility and naturalness of human speech

within the limits of digitization. Mainly, the pronunciation of consonants; it is similar for

the vowel durations in stressed syllables that parametric synthesizers cannot get

consistently right. Concatenative Synthesis at the word level solved the allophone

problem, and that part of the prosody problem concerned with the relative durations of

segments within the word.

Figure 3.1 below shows a block diagram of a typical concatenative TTS system. The

front-end on the left converts a particular input text string into a string of phonetic

symbols and prosody (fundamental frequency, duration, and amplitude) targets. The

front-end uses a set of rules and/or a pronunciation dictionary. With a string of phonetic

symbols, it constructs target values for fundamental frequency (i.e., pitch), phoneme

durations, and amplitudes. The center block in figure 3.1 gathers the units according to

the list of targets set by the front-end. These units are selected from a store that holds the

inventory of available sound units.

Different types of speech units may be stored in the inventory of a concatenative TTS

system. Storing whole word units is unreasonable for general TTS because of the

tremendous demands on a voice talent that would have to read a few hundreds of

thousands of words in a consistent voice and manner. Even if recorded successfully in

multiple sessions spread over several weeks, a lack of co-articulation and phonetic

recoding at word boundaries may result in unnatural sounding speech. On the other hand,

using phones is as well unacceptable for the reason that of the large co-articulatory

effects that exist between adjacent phones.

Figure 3.9 Block diagram of a concatenative text-to-speech system (Olive, 1996).

As a result, conversions from one unit to the next may be audible as glitches that

introduce perceptually disruptive discontinuities. Naturally, longer units are more likely

to result in higher quality synthesis, given that the rate of concatenations (how many unit-

to-unit conversions occur per second of speech) is lower than in the case of shorter units.

In contrast, there is a need for a larger set of longer units to cover any application domain.

Until the mid of 1990s most practical TTS implementations compromised by using one of

two types of inventory units, the diphone and the demi-syllable.

Inter-unit effects and prosody over and above the level of the recorded units are two

problems of concatenative speech synthesis. For example, in naturally spoken speech,

words are run together. There are no white spaces except where we pause for breath or

effect. Natural between word articulations still needs to be simulated. As well,

appropriate prosody has to be computed because it depends heavily on the context of the

whole utterance, which is not known when the speech units are prerecorded. With little

computing power, early concatenative synthesizers could only speak in a monotone with

discontinuities between each word.

Parametric synthesis ruled the rest because it could handle a vocabulary of unlimited size,

lately. Developed computers now permit hundreds of thousands of words to be

prerecorded digitally. The synthesizer reorganizes them as needed, smoothes between

words, and attempts to approximate the correct prosody, all in real time.

However, these kind of parametric synthesizers are capable of producing because their

units are sound-based, individuals cannot use words as units. Apart from the

impracticality, word-based synthesis is unable to handle the many applications that

require a large unlimited vocabulary, such as ones that “read” texts not known in

advance.

3.2.1 Phonemes Phonemes are possibly the most commonly used units in speech synthesis since they are

the standard linguistic presentation of speech. The list of basic units is usually between 40

and 50, which is obviously the smallest compared to other units (Allen et al., 1987).

Using phonemes gives maximum flexibility with the rule-based systems. But, some

phones that do not have a steady-state target position, such as plosives, which are difficult

to synthesize. The articulation must be formulated as rules as well. Phonemes are

occasionally used as an input for speech synthesizer to make for example diphone-based

synthesizer.

3.2.2 Diphones A diphone is the snippet of speech from the middle of one phone to the middle of the next

phone. The middle of a phone tends to be its acoustically most stable region. Therefore,

diphones represent acoustic transitions from the stable midsection of one phone to the

next. Diphones are also distinct to widen the middle point of the steady-state part of the

phone to the middle point of the subsequent one, thus they include the transitions between

contiguous phones. Meaning that the concatenation point will be in the most steady-state

region of the signal, this reduces the distortion from concatenation points. Another

benefit with diphones is that the co-articulation consequence needs no more to be

formulated as rules. Theoretically, the number of diphones is [the square of the number of

phonemes (+ allophones)], but not all combinations of phonemes are needed. The

number of units is normally from 1500 to 2000, which increases the memory

requirements and makes the data collection more difficult compared to phonemes. On the

other hand, the number of data is still acceptable and with other advantages, diphone is an

appropriate unit for sample-based text-to-speech synthesis. The number of diphones may

be reduced by inverting symmetric transitions, like for example the letter / �ـ / at the

beginning of a word would become / UVW / at the end.

3.2.3 Demi-syllables A demi-syllable encompasses half a syllable; that is, either the syllable-initial portion up

to the first half of the syllable nucleus, or the syllable-final portion starting from the

second half of the syllable nucleus. The number of demi-syllables in English is roughly

the same as the number of diphones. Because demi-syllable units are usually longer than

diphones, and allow for better capture of co-articulation effects compared to diphones,

they should pose fewer concatenation problems.

One benefit of demi-syllables is that only about 1,000 of them is needed to construct the

10,000 syllables of English (Donovan, 1996). Using demisyllables, rather than for

example phonemes and diphones, requires significantly less concatenation points. Demi-

syllables as well take account of most transitions, then also a large number of co-

articulation effects and cover a large number of allophonic variations due to separation of

initial and final consonant clusters too. But, the memory requirements are still somewhat

high, but tolerable. Compared to phonemes and diphones, the exact number of demi-

syllables in a language cannot be defined. With purely demi-syllable based system, not all

possible words can be synthesized correctly. This problem is faced at least with some

proper names. However, demi-syllables and syllables may be successfully used in a

system that uses variable length units and affixes, such as the HADIFIX system

(Dettweiler et al., 1985).

Triphones or tetraphones have longer segmental units, they are somewhat seldom used.

Triphones are like diphones, but contains one phoneme between steady-state points (half

phoneme + phoneme + half phoneme). In other words, a triphone is a phoneme with a

specific left and right context. For English, more than 10,000 units are required (Huang et

al., 1997).

Structuring the unit list consists of three main phases. First, the natural speech must be

recorded so that all used units (phonemes) within all possible contexts (allophones) are

included. After this, the units must be labeled or segmented from spoken speech data, and

finally, the most appropriate units must be chosen. Gathering the samples from natural

speech is usually very time-consuming. Nevertheless, some of this work may be done

automatically by choosing the input text for analysis phase accurately. The

accomplishment of rules to decide on accurate samples for concatenation must be done

very cautiously too.

For many languages, demi-syllables minimize the co-articulation effects at syllable

boundaries because the demi-syllable is obtained from natural utterances by “cutting” in

the middle of a steady-state vowel. Thus, only relatively simple concatenation rules might

be required – in the best of the worlds. However, the reality of human speech is more

complex and a successful concatenation system may have to rely on a concatenation of

demi-syllables, diphones, and suffixes (postvocalic consonant clusters).

3.2.4 Speech Signal Representation for Concatenative Synthesis A good speech signal representation for concatenative synthesis approximates the

following set of requirements (Schroeter, 1991):

1. The speech signal can be stored in a highly compressed (i.e., coded) form so that a

large voice database can be used even under tight memory limitations. Coder and

decoder are of low computational complexity.

2. Coding/ decoding are perceptually transparent. Since there is a need to imitate all

the voice characteristics of a real person, subjecting the speech signal to vocoder

like degradations will not lead to speech synthesis of high naturalness.

3. Coding algorithms have to allow for random access. Since most speech coders

contain some sort of autoregressive memory, all state variables of the coder have

to be made available at concatenation points since the decoder will have to switch

between units of speech that are very unlikely to have been recorded

consecutively in time.

4. An ideal speech representation must allow for natural-sounding modifications of

pitch, duration, and amplitude. This is particularly important for small inventories

with one, or just a few, typical examples for each unit.

5. For some advanced applications, it even might be desirable to allow for fine-

tuning of the voice, for example, to add more aspiration, maturity, or let the voice

scream when needed. Instead of recording different voice inventories for different

speaking styles, advanced voice conversion might be used to approximate an

angry voice using a happy or neutral voice as a starting point.

Table 3.2 Concatenative Synthesis: Pros and Cons

Pros Cons Units include difficult sounds and

transition Long-distance co-articulation not

captured Units capture local co-articulation DSP: at least smoothing of concatenative

points

Neither pros nor cons: ± DSP: Prosodic modifications possible/ required. ± Compromise between coverage and inventory size.

3.3 TYPES OF CONCATENATIVE SYNTHESIS Natural sounding speech can be produced by concatenating (or stringing together)

segments from a database of recorded speech. In general, this approach yields the most

natural sounding synthesized output. However, variations in speech and the automated

techniques used for segmenting analysis speech waveforms sometimes result in audible

glitches in the output. There are four basic types to concatenative synthesis:

3.3.1 Concatenation of Stored Allophone Hypothetically, allophones appear to be the perfect unit for concatenation. A few hundred

of them will serve as the basic building blocks for all utterances. The problems only

appear upon implementation. It is impossible to either articulate the consonant letter [k]

in English without also articulating a vowel, however short, as the onset the beginning

part or offset the ending part. Figure 3.2 shows the wave sound for Allophone using

concatenative synthesis.

Figure 3.2 Wave sound for Allophone Concatenation Synthesis (Olive, 1996).

3.3.2 Diphone Concatenation Synthesis Diphones are speech units that begin in the middle of the stable state of a phone and end

in the middle of the following one. The number of diphones depends on the possible

combinations of phonemes in a language. In diphone synthesis, only one example of each

diphone is contained in the speech database. The quality of the resulting speech is

generally not as good as that from unit selection but more natural sounding than the

output of formant synthesizers. Diphone synthesis suffers from the robotic-sounding

quality of formant synthesis. In order to build a diphone database, the following questions

have to be answered and determined: What diphone pairs exist in a language and what

carrier words should be used? The answer for these questions is very language

independent. Figure 3.3 shows the wave sound for Diphone using concatenative

synthesis.

Figure 3.3 Wave sound for Diphone Concatenation Synthesis (Olive, 1996).

The raison for diphones, whether used as units of speech recognition or speech synthesis,

is that they capture the articulation effects between allophones. Moreover, the articulation

effects between the diphones themselves are not serious a problem because the diphone

boundaries are the midpoints of the same sound, which tend to match up fairly well.

Considering allophonic variation gives us the new concept of allodiphones – diphones

made up of the combinations of the various allophones of the language.

To achieve the most natural sounding allodiphones, they must be isolated from naturally

spoken speech. Unfortunately, one must process enormous amounts of speech to build a

database of allodiphones sufficient for high quality concatenative speech. Today’s

diphone synthesizers’ compromise and use a modest number of allodiphones derived

artificially invented free flowing speech.

3.3.3 Concatenation of Stored Syllables and Demi-syllable The syllable appears to be a convenient unit for storage, as it is large enough to reduce

the number of joins that are necessary and yet small enough to make each stored unit

reusable in a number of word contexts. There is often a choice as to where to separate one

syllable from the next; whenever possible a stop consonant should be placed at the start

of a syllable. This allows the join to be made at the silence preceding the stop. A seamless

join between the first and second syllables of “dis-crim-in-ate” would be easier to achieve

that it would be for “disc-rim-in-ate”.

The use of demi-syllables reduces some of the problems that occur with syllables, and

several examples have been reported of the concatenation of demi-syllables (Browman,

1980)

Demi-syllables occur as two types: the syllable onset plus the first half of the nucleus

vowel sound and the second half of the nucleus vowel sound plus the syllable coda. There

are several thousands demi-syllables, ignoring all but the most common allophones. If we

get into allodemisyllables that is demi-syllables that are composed of differing allophones

that number becomes much larger.

Demi-syllables have two strengths as units of concatenative speech synthesis. One is in

consonant cluster such as the skr in scrap. While diphones are better than simple

allophones in such clusters, demi-syllables are best of all because the entire cluster is

taken from human speech.

The other advantage of the demi-syllable is in achieving natural sounding segment

durations. A synthesizer that fails to capture the nuances will be perceived as unnatural.

Since the length of a syllable ending in one or more consonants is distributed over the

vowel and the final consonant(s), the sizing natural sounding syllable lengths is more

easily achieved with demi-syllables than other units are.

A drawback of demi-syllables is that not all syllable boundaries fit smoothly together.

Recently, some speech engineers have proposed a mixed inventory of both diphones and

demi-syllables, taking advantage of their respective strengths and compensating for their

respective weakness.

3.3.4 Concatenation of Stored Waveform An obvious way of producing speech messages by machine is to have recordings of a

human being speaking all the various words, and to replay the recordings at the required

times to compose the messages. The first significant application for this technique was a

speaking clock, introduced into the UK telephone system in 1936, and now provided by

telephone administrations all over the world.

The original UK Speaking Clock used optical recording on glass disc for the various

phrases, words and part-words that were required to make up the full range of time

announcements. Some words can be split into parts for this application, because, for

example, the recording can be used for the second syllables of “twenty”, “thirty”, etc. The

next generation of equipment used analogue storage on magnetic drums.

The development of large, cheap computer memories has made it practicable to store

speech signals in digitally coded form for use with computer-controlled replay, and,

provided sufficiently fast memory access is available, this arrangement overcomes the

timing problems of analogue waveform storage. Digitally coded waveforms of speech

signals of adequate quality for announcing machines generally use digit rates of 16 – 32

Kbits per second of message stored; so quite a large memory is needed if many different

elements are required to make up the messages.

The decoding of LPC speech is the least complex. However, that is not its chief

advantage. The pitch of LPC speech can be adjusted via a single parameter, and the all-

important duration of a speech can be adjusted by the simple expedient of adding or

subtracting frames during decoding. Thus some, but not all, prosodic adjustments can be

effected at the decoding stage.

Waveform coders require two stages for synthesis. One is the decoding process that

produces the reconstructed waveform. The second is the imposition of prosodic on the

reconstructed waveform, in particular, a smoothing of the transitions between

concatenated units. Until recently computing power was insufficient to do this

effectively.

There is a tradeoff between LPC and the various methods of waveform coding. With

waveform coding the speech units themselves sound natural, but the transition between

units is stilted. With LPC coding the transitions are smoother, but the speech has an

inescapable mechanical quality that listeners find both unnatural and unpleasant.

3.3.5 Concatenation of Stored Words It is possible to concatenate words if care is taken to ensure that the recorded words have

intonation and rhythm that are consistent with the eventual use. A potential difficulty

with this technique is that it is extremely difficult to add to the vocabulary at a later date.

Some reasons for this are that the original speaker may not be available or the voice

quality of the speaker may have changed; even the acoustics of the recording studio or the

choice of microphone can make the newly recorded words identifiably different when

heard in the context of previously recorded utterances.

Table 3.2 shows the summary of the advantages and the pitfalls of units of Concatenative

Synthesis in terms of a sentence, a word, a syllable, an allophone, a diphone, and a demi-

syllable.

Table 3.3 Advantages and Disadvantages of Units of Concatenative Synthesis.

Concatenation Unit

Advantages Disadvantages

1

Sentence

Naturalness throughout

Usually impractical to prerecord every sentence that might be needed.

2

Word

Naturalness within the word

Inter-word articulation may sound mechanical; usually impractical to prerecord all the words that might be needed.

3

Syllable

Naturalness within the syllable

Good inter-syllable articulation is difficult to achieve, making individual words unintelligible; usually impractical to prerecord all the syllables that might be needed, though not as seriously as with words or sentences.

4

Allophone

Naturalness within the allophone, especially vowels; much fewer units need be prerecorded than with other choices

Good articulation between most allophones is difficult to achieve, making individual words unintelligible; stop consonantal allophones are difficult to isolate for prerecording.

5

Diphone

Naturalness on the allophonic level; much better articulation between diphones than between allophones

Consonant clusters do not always sound natural; syllable lengths do not always sound natural; impractical to extract and prerecord all possible allodiphones from naturally spoken speech.

6

Demi-syllable

Naturalness on the syllable level; natural sounding consonant clusters; natural sounding syllable lengths

Articulation at syllable bounders may sound unnatural, making individual words unintelligible, however not as seriously as with syllables; impractical to extract and prerecord all possible allodemisyllables from naturally spoken speech.

Any Concatenative Sound Synthesis system must perform the following tasks, which

may sometimes perform implicitly. These tasks or steps are:

� Analysis The source sound files are segmented into units and analyzed to express

their characteristics with sound descriptors.

� Database Source file references, units and unit descriptors are stored in a

database. The subset of the database that is pre-selected for one particular

synthesis is called the corpus.

� Target The target specification is generated from a symbolic score (expressed in

notes or descriptors), or analyzed from an audio score (using the same

segmentation and analysis methods as for the source sounds).

� Selection Units are selected from the database that match best the given target

descriptors according to a distance function and a concatenation quality function.

The selection can be local (the best match for each target unit is found

individually), or global (the sequence with the least total distance if found).

� Synthesis is done by concatenation of selected units, possibly applying

transformations.

3.4 SUMMARY

This chapter has explained in details what Concatenative synthesis method is. It also has

shown the different types of speech units, phonemes, diphones, and demi-syllables. The

next chapter will explain in details the Arabic language.

CHAPTER FOUR

ARABIC LANGUAGE

4.1 OVERVIEW In this chapter, the Arabic language is introduced, where the alphabet is represented as

well as the Arabic morphology and prosody that is relevant for this dissertation.

The Arabic language, or simply Arabic, is the largest member of the Semitic branch of

the Afro-Asiatic language family and is closely related to Hebrew and Aramaic (Michel,

1970). It is spoken throughout the Arab world and is widely studied and known

throughout the Islamic world. Arabic has been a literary language since at least the 6th

century and is the liturgical language of Islam. Because of its liturgical role, Arabic has

lent many words to other Islamic languages, akin to the role Latin has in Western

European languages. During the Middle Ages Arabic was also a major vehicle of culture,

especially in science, mathematics and philosophy, with the result that many European

languages have also borrowed numerous words from it. The Arabic script is written from

right to left (Nicholas and Putros, 1986). Arabic is ranked as number four among the

world’s major languages, with 300 million native speakers of all dialects.

As humans, our capability to receive any form of communication is based on our sensory

organs: the ears, eyes, skin, nose, and taste buds. The reception of the communication

through a single sensory organ is referred to as a modality. Each of us has five modalities,

directly related to our senses. They are auditory/vocal, visual, tactile, olfactory, and

gustatory. The study of how humans perceive communication through these sensory

organs is called the science of semiotics (Clement, 1993). Specifically, semiotics

provides a theoretical approach and related techniques for analyzing the structure of all

forms and meaning of signals and signs in human communication received via these

modalities.

Most Semitic languages in both ancient and contemporary times are usually written

without short vowels and other diacritic marks, often leading to potential ambiguity.

While such ambiguity only rarely impedes proficient speakers, it can certainly be a

source of confusion for beginning readers and people with learning disabilities (Abu-

Rabia, 1999). Diacritization is even more problematic for computational systems, adding

another level of ambiguity to both analysis and generation of text. Dialectal Arabic refers

to the dialects derived from Classical Arabic. These dialects vary occasionally that means

that it is hard and a challenge for a Lebanese to understand an Algerian and it is worth

mentioning there is even a difference within the same country.

As mentioned before there are many varieties of the Arabic language (Shafi, 1978), many

dialects that reflect social diversity of its speakers. Arabic can be sub-classified as

Classical Arabic, Eastern Arabic, Western Arabic and Maltese. Western Arabic

encompasses the Arabic spoken colloquially in the region of northern Africa, often

referred to as the Maghreb. Eastern Arabic includes the Arabic dialects spoken in North

Africa, the Middle East. Arabic speakers use Modern Standard Arabic (MSA) to

communicate across dialect groups. It is used in a situation where the native dialect will

not be understood.

In the Arabic alphabet, there are 29 letters, three of which are long vowels and the rest are

consonants. Each letter is given a name which contains the letter itself (Nicholas and

Putros, 1986). The characters of the Arabic alphabet are neither capital nor small; they

have one form only. Moreover, some of these letters are very similar to English letter

sounds e.g. “baa” is very close to the letter “B” in the English language; this is a useful

way to remember the sounds. However, many Arabic letters have no equivalent sound in

English e.g. “ein”, and some letters have subtle but important differences in

pronunciation, e.g. “haa” which is pronounced with a lot more emphasis in the throat than

the letter “H” in English. Also, please note that the Arabic script is read from right to left.

4.1.1 English and Arabic These two languages are drastically different in both family and writing system. English

belongs to the Indo-European family, while Arabic is a typical Semitic language (Michel,

1970). The sound systems of the two languages are extensively different in consonants,

vowels, stress placement and the dynamics that govern the function of those linguistic

systems in real speech. In the area of consonants, Arabic has a series of back consonants

– uvular, pharyngeal and emphatic – that have no existence in English, whatsoever. These

sounds heavily color the phonetic setting of Arabic compared to that of English.

Additionally, English vowel system is dominated by vowel quality/quantity reduction as

opposed to only some vowel quantity [length] reduction in Arabic. Concerning stress

placement, the rules in Arabic are more systematic and well defined than in English. In

Arabic, the rule of long syllable, primarily due to a long vowel, is very powerful and is,

therefore, a source of stress misplacement for Arab learners of English (Shafi, 1978).

Concerning the writing system, English has a Latin-based orthography, while Arabic has

its own Aramaic-based orthography that is so different in its design of sound

representation. English hardely has any diacritical marks as opposed to in Arabic in

which slightly more than half of its consonants are distinguished by one or more dots in

the form of superscripts or subscripts to set them apart from the undotted consonants as in

table 4.1, below:

Table 4.4 Arabic dotted and undotted alphabet characters

Undotted Arabic Sound English Sounds Dotted Arabic Sound English Sounds

& ă alif

X Y�� b baa

5 Y�R t taa

Z Y�# Th thaa

[ �4\ j jeem

] Y�E h`` haa

^ Y�; kh khaa

6 _6 d daal

+ _+ th zaal

� Y� r raa

` Y` z` zaa

> a� s seen

b aK sh sheen

c 6�� s` saad

d 6�e Dh`` dhaad

Q Y�f th`` thaa

g Y�h z`` zaa

i a� --- ein

j a3 Gh ghein

k Y�l f faa

m k�n Q qaaf

o k�0 k kaaf

_ 1D L laam

1 �4� m meem

� �"< n noon

: Y�G h` haa

! !! w or u waow

Y pqr A` hamza

s Y�� Y yaa

4.2 DEFINITION OF AN ARABIC WORD

As morphology is concerned with the analysis of words, it is primary to define the

term word. As words are in the form of written text, word is defined by any text

editing program. A word is the alphanumeric string between any two non-

alphanumeric characters (Clement, 1987). An Arabic word is a word, as defined

above, which meets the following two conditions:

� All its characteristics are bare or diacritized Arabic alphabets. Diacritized words

cannot be fully determined by their spelling characters only.

� It belongs to either of the following two categories:

o The original Arabic words.

o The Arabic words.

Original Arabic words are divided in turn into two sub-categories:

� Derivative Arabic words – There are the verbs and nouns that are built according

to the Arabic derivation rules. The sweeping majority of the Arabic words belong

to this category.

� Fixed Arabic words – These are a set of words molded by Arabs, and do not

obey the Arabic derivation rules. Most of these fixed words are neither verbs nor

nouns; most of them are functional words like pronouns, prepositions,

conjunctions, question words, and the like. They tie the words of the Arabic

sentence together. The category of the fixed Arabic words contains a limited

number of members (According to a research, there is approximately 260

significant fixed words only).

The Arabized words are nouns borrowed from foreign languages (perhaps with some

phonetic adjustments to suit the Arabic pronunciation) and have become common

among the native Arabic speakers. To preserve the purity of the Arabic language, it is

not preferable to consider a word in this category unless its meaning has no

counterpart in the category of the original Arabic words.

Figure 4.1 below summarizes the definition of Arabic words:

Figure 4.10 The classification of the Arabic words (Raja, 1979).

Although the number of the derivative Arabic words is much larger than both the fixed

Arabic and the Arabized words, the frequency of the latter ones, specially the fixed

words, is considerably high that any treatment of Arabic must treat all the above

categories with the same degree of care. Example of fixed words in the Arabic text:

�� !� �"#$%� &$��!�� ' (")*+ , -.# �/�*�� 012��34 . 617��,89: �"#$; 3��: ' �� <= �$.)>.

Total number of words in the paragraph = 23, Number of fixed words in the paragraph

= 8 Frequency of fixed Arabic words in the paragraph = 8/23 = 35%

Arabic word

Original Arabized

Fixed Derivative

4.3 ARABIC IS A DIACRITIZED LANGUAGE

The pronunciation of a word in some languages, like English, is almost always fully

determined by its constituting characters and vowels determine the correct corresponding

voice while pronouncing a word. Such languages are called non-diacritized languages.

On the other hand, there are languages, like Latin, where the pronunciation of their words

cannot be fully determined by their spelling characters only. In such languages, two

different words may have identical spelling whereas their pronunciations and meanings

are totally different. To remove this ambiguity, special marks are put above or below the

spelling characters to determine the correct pronunciation (Mitchell, 1993). These marks

are called diacritics and the language that uses them is called a diacritized language.

Arabic is also a diacritized language. In fact, Arabic has the most elaborate diacritization

system. Table 4.2 shows the Arabic diacritics and the significance of each one. Each

character in an Arabic word must be assigned two things about diacritics:

• The Shadda state of character. (With Shadda / without Shadda)

• The diacritic of the character.

These are called the diacritic information of the character.

Unfortunately, in today’s Arabic writing, people do not explicitly mention diacritics.

They dependent on their knowledge of the language and the context to supple the missing

diacritics while reading a non-diacritized text. They only mention diacritics in writing

when a severe ambiguity is feared or for educational purposes.

Table 4.5 The Arabic diacritics and the significance of each one

Diacritic Name Sounds like Example Comments

V Fateha a MC t � -

u Damma U FI��0 -

� Kasra I X� ��0 -

v Sokoon A non vowelized consonant

_�)N�� -

w Tanween fateha Fateha + � B��0 Only the last character may be assigned diacritic.

x Tanween damma

Damma + � %6�� Only the last character may be assigned this diacritic.

y Tanween kasra Kasra + � y�� Only the last character may be assigned this diacritic.

L ! L s Vowel Long (a), (i) or (u) vowel

_��n -

J Alif leyna Long (a) vowel 9�� Only a terminal J

may be assigned to this diacritic.

_ Bypassed character

Not pronounced )'Y� -

Hidden Alif vowel

Long (a) z+ -

{ Shadda

|_+�_

N5+ 5

Nm+%m

�~��

��E

In fact, shadda is not a diacritic but is a mark of doubling the character while pronouncing it. The character with a shadda needs another diacritic to determine the vowel.

An automatic morphological analyzer must consider diacritics in its model of Arabic

words and must have some mechanism of figuring out the missing diacritics of a given

Arabic word. The three diacritization states of Arabic word are:

i) Full diacritization: It is the assignment of all the diacritic information for

each character in the word including the last one. In Arabic, the

diacritization of the last character sometimes depends on the syntactic

analysis of the word within its sentence.

ii) Half diacritization: It is the same as full diacritization except for that it

does not provide the diacritic mark of the last character if it depends on the

syntactic analysis of the word. As the morphological analysis deals with

words one by one and does not analyze the sentence as a whole, it can

only be hoped to provide half diacritization.

iii) Partial diacritization: Any other diacritization state of the word that

provides less diacritic information than half diacritization is called partial

diacritization.

An Arabic word in general is complex. It may be a word or a sentence. In a study of

sufficiently large sample of Arabic text, the following simple structure of the Arabic

words are inferred:

• The main part, a noun or a verb, of the word occurs in the middle. It is called the

word's body.

• The body may be prefixed by something like the definitive article, a preposition, a

gender determiner, a tense determiner and so on, or some combination of them.

When a prefix precedes a body, it may slight modify its string and also be slightly

modified. The prefix cannot be a standalone word.

• The body may also be suffixed by something like a pronoun, a gender determiner,

a tense determiner and so on, or some combination of them. When a suffix

succeeds a body, it may slightly modify its string and also be slightly modified.

Suffix cannot be a standalone word.

In the absence of a prefix is assumed as a null prefix and the absence of a suffix is

assumed as a null suffix, it can be generalized that the structure of Arabic word as

(Salman and Jacob, 1980):

Any Arabic word = Prefix Body Suffix

The prefix or suffix can add an entity to the noun or the verb. For example, the prefix

may be a preposition and the suffix may be a pronoun.

4.4 INTRODUCTION TO VOWELS There are two sets of vowels in Arabic (Salman and Jacob, 1980): Short vowels and long

vowels. It requires about twice as much time to produce a long vowel as to produce a

short one. There is a tendency in English to obscure those vowels that are in non-stressed,

syllables. This tendency has to be overcome when speaking Arabic. In general, the

Arabic vowels are pronounced more crisply, clearly, and more tensely than the English

vowels. Syllables are the building blocks of speech and they come in three types;

consonant vowel, vowel consonant, consonant vowel consonant.

That is to say, a syllable may be formed by having a consonant followed by a vowel, such

as in the word “TO”, a vowel followed by consonant, such as in the word “OF”, or two

vowels with a consonant in between, such as in the word “FOR”. In Arabic, we use only

types one and three. Each vowel has a name, the vowels collectively have a name, the

letters that contain a vowel have a name, letters are named specifically depending on

which vowel they hold, and doubled vowels are given names. The first row of the charts

contain the examples, the second rows show the individual names of the vowels, the third

rows show the adjective used to describe the letter which is attributed with the given

vowel, the fourth rows show the English equivalents, and the columns to the right show

the collective name for the group.

As a corollary to the restricted vowel quality in Arabic, its diphthongs are limited in

number. In fact, some linguists treat the so-called diphthongs as combination of simple

vowels [i.e., abutting vowels] rather than as blended clusters of vocalic elements (Raja,

1979). The five-parameter description of the system is as follows:

1. Place: Has three places [front, center, back]; each place yields one vowel;

system has no neutral vowel schema, [ə].

2. Stricture: Has two tongue heights [high, low] or [close, open]; two are high

one is low.

3. Tense/ lax: Is contrastive in combination with length.

4. Lip-position: Has three distinct lip positions [round, spread, neutral] with one

degree of rounding and spreading; rounding is restricted to back vowels and

spreading to front ones.

5. Oral/ nasal: No contrasts; has only oral vowels.

4.4.1 Short Vowels Unlike English, the short vowels in the Arabic writing system are not actually written

within the word. These vowels are indicated by “signs” and not by letters, (e.g., / / and /

/) which are placed over the consonant and / / which is placed below the consonant,

according to the pattern of the word. These signs represent the sounds (/ a /, / u / and / i /),

respectively. When a sign is used with a consonant, the vowel pronunciation always

precedes the consonant. The vowel marks or signs are always placed over or below the

first half of the consonant.

4.4.1.1 The First Short Vowel

/ u / A high back rounded short vowel which is similar to the English “o” in the words

“to” and “do”. This vowel mark is also placed above the consonant and its shape

resembles an English comma / /. It is called a “dhamma”.

The first short vowel is known as the dhamma. The letter that holds the dhamma is

known as madhmoom. The dhamma is one of the three diacritics and the letter that holds

one of these diacritics is known as mutaharrik. The dhamma is the English equivalent of

the letter “o” and “u”. For example, the letter “baa” with its dhamma ب will be

pronounced “bu” with a long “u”.

4.4.1.2 The Second Short Vowel

/ a / A short un-rounded low central vowel which is similar to the English “a” in the

word “fat”. However, it is even shorter in duration. It is symbolized by a short

diagonal stroke / / written above the consonant. It is called a “fathha”.

Each letter of the alphabet is given a fathha. This vowel is placed atop the letter and is

called a fathha. The letter which holds the fathha is known as maftooh. This vowel is the

English equivalent of the letter “a”. For example, the letter “baa” with its fathha ب sounds

like “ba”.

4.4.1.3 The Third Short Vowel

/ i / A high front un-rounded vowel which is similar to the English “i” in the words

“sin” and “sit”. This vowel, as with the / a / is also symbolized by short diagonal

/ / only it is placed below the consonant and not above. It is called a “kasra”.

The kasra is one of the diacritics. The letter that holds the kasra under it is known as

maksoor. This vowel is the English equivalent of the letter “e” and “i”. For example, the

letter “baa” with a kasra beneath it ب sounds like “be”. Please become familiar with the

sounds associated with these letters when they are makssor.

4.4.1.4 The First Short Vowel Doubled

Each diacritic may be doubled in order to add the sound of the letter “n” to the end of the

word. For example, the letter “baa” with two dhamma ب sounds like “bun” or “bon”.

When a letter is doubled, the vowel is called a tanween. The letter that holds this tanween

is known as munawwan. Since all three harakat have this capability of doubling, they are

given separate names: a doubled dhamma is called a dhammatein, a doubled fathha is

called a fathhatein, and a doubled kasra is called a kasratein.

This is a feature in the Arabic language that few, if any, languages adopt. This is

primarily used to differentiate between nouns and other parts of speech. That is to say, the

noun in Arabic may or may not have a tanween, but verbs and particles will never have it.

Also, the tanween occurs on the last letter of the word; it may not come upon a letter

which is in the middle or beginning. The dhamma, when doubled is called a dhammatein.

4.4.1.5 The Second Short Vowel Doubled

A doubled fathha is known as a fathhatein. When a letter such as “raa” is given a

fathhatein, it sounds like “run”, as in the word “run” in English. There is an important

observation to make with the fathhatein; it is that when a letter is written with this vowel,

it is always followed by the letter “alif”. Each letter in the alphabet below will have the

letter “alif” after itself, and the fathhatein will be written atop this “alif” (i.e., B��). This is

only for script purposes. When writing in Arabic, this is how a fathhatein is used.

4.4.1.6 The Third Short Vowel Doubled

The vowel is the kasra, known as the kasratein when doubled. This vowel is the English

equivalent of the letter “e” and “I” with the addition of the letter “n”. Taking the letter

“baa” for example and placing a kasratein beneath it ب will cause the letter to sound like

“bin”.

4.4.2 Long Vowels In contrast to the short vowels, the long vowels in the Arabic writing system are

represented by letters of the alphabet and not by signs. Although their pronunciation is

similar to that of the short vowels, the sound is prolonged, that is, it is held longer. With

/ a / and / aa / there is not only a quantity difference but a substantial quality difference as

well. One must be careful to pronounce them as pure long vowels without diphthong

quality. In ordinary speech, when they occur at the end of a word, they are shortened. The

long vowels are explained below. In as much as they are represented by letters and not by

signs, the variations of their written forms will be described in detail later.

4.4.2.1 The First Long Vowel

/ aa / A long un-rounded low central back vowel which is similar in pronunciation to

the Arabic short vowel / a / but rather longer in duration. It is similar to the

English “a” of “father”. The / aa / tends to be lower and further back than the / a /.

Apart from the three normal vowels, the Arabic alphabet contains three letters which are

often considered to be long vowels. They are considered this because they elongate and

emphasize the short vowels on the letter before them. The three long vowels are the

“alif”, “waow”, and “yaa”.

As for the “alif”, it is always empty of vowels. Whenever we see the “alif” with a vowel,

this letter is not an “alif”; rather it is a “hamza”. This letter always necessitates a fathha

before it. That is to say that we will never see an “alif” before which there is a dhamma or

a kasra. The job of this letter is to lengthen the stretch of the fathha on the letter before it,

and it is for this reason that it is known as a long vowel. For example, the letter “baa”

with a fathha atop it sounds like “ba”. But adding the “alif” causes it to sound like “baa”.

The other two long vowels are the “waow” and the “yaa”. These letters are not always

considered to be long vowels. This is because they may have vowels themselves; in this

case, they are treated as consonants. However, if these letters are empty of all vowels,

they may be considered to be long vowels. When they are empty of vowels, there are two

situations; either they are preceded by their appropriate haraka (dhamma for “waow”, and

kasra for “yaa”), or they have a fathha before them. Never will the “waow” be preceded

by a kasra.

In the case of being preceded by a fathha, the letters are not considered long vowels; they

are called “waow” leen and “yaa” leen. But if they are preceded by their appropriate

vowels, they are considered long vowels and they are called “waow” maddah and “yaa”

maddah. They emphasize the sound of the preceding haraka. For example, the letter

“baa” with a dhamma on it sounds like “bu” and with a kasra on it sounds like “be”.

Adding a “waow” to the former and a “yaa” to the latter causes the letter to sound like

“buu” and “bee”. The “alif”, “waow” maddah, and “yaa” maddah may occur in the

middle of words. The same applies for the “waow” leen and “yaa” leen. The leen will not

be discussed as they are like any other letter.

4.4.2.2 The Second Long Vowel

The “waow” may have a vowel of its own; in this case, it is not a long vowel. Also, it

may be empty of the three vowels but preceding it may be a fathha; in this case too, it is

not a long vowel. The final situation is where the “waow” is empty of the three vowels

and preceding it is its appropriate vowel; the dhamma; this “waow” is a long vowel.

/ uu / A high back rounded long vowel which is similar to the English “oo” in the word

“tool”. However, it is much longer in duration.

4.4.2.3 The Third Long Vowel

The final long vowel is the letter “yaa”. This letter corresponds to the kasra and it

enhances and emphasizes the sound of this vowel just as the “alif” emphasized the fathha

on the preceding letter and the “waow” emphasized the dhamma on the preceding letter.

/ ii / A high front un-rounded vowel which is similar to the English “ee” in the words

“seen” and “bee”, but which is longer in duration and has no diphthongization.

As mentioned, the “yaa” may be mutaharrik itself and thus will not be considered to be a

long vowel. Also, it may be without a vowel but proceeded by a maftooh letter. In this

situation, too, the “yaa” is not a long vowel. The “yaa” maddah is that “yaa” which has

no vowel of its own and is preceded by a maksoor letter.

4.5 SYLLABLES OF ARABIC LANGUAGE

4.5.1 Syllable Structure

The sounds of Arabic are divided into syllabic and non-syllabic entities (Wright, 1974).

The three short vowels and their long counterparts always form the syllable nucleus. The

syllable nucleus is the segment that stands out and has more prominence in an utterance.

The vowels always form the syllable nucleus and all of the consonants represent the

marginal elements including / y / and / w / of the syllable structure.

In as much as there is a clear cut between vowels and consonants, there is no need to

mark syllabicity. In accordance with this clear-cut division of vowels and consonants, the

number of syllables in a word will be identical to the number of vowels. Every Arabic

word begins with a single consonant. Also, every syllable forming a word structure

begins with a single consonant. Therefore, whenever there are consonant clusters or

double consonants in the middle of the structure of a word, the point of the syllable

division is between either the consonants of the consonant cluster between the double

consonants. Here the syllable is predictable and automatic.

4.5.2 Syllable Patterns There are five syllable patterns (Michel, 1970). In their representation, C stands for all of

the consonants, V for short vowels and VV for long vowels. The long vowels (VV) are

always considered as mono-phthongs since there are no vowel clusters in the Arabic

sound stream. The five patterns are:

Table 4.3 Arabic Syllable Patterns

Pattern Example in Arabic Pronunciation Meaning in English CV ?@ Bi “in”

CVC �� Tub “Repent”

CVV $�+ Yaa “O” (vocative particle)

CVVC �@$�4 Baab “door”

CVCC �AB�, Waθb “jumping”

4.5.2.1 Short and Long Syllables

A short syllable can be defined as any short vowel immediately preceded by a single

consonant. The CV pattern, listed in the above classification, is to be considered as a

short syllable. The remaining four syllable patterns are considered as long syllables.

4.5.2.2 Closed and Open Syllables

An open syllable ends with a vowel and includes the patterns CV and CVV. A closed

syllable is any syllable that ends with a consonant(s). Closed syllables include the CVC,

CVVC and CVCC patterns.

4.5.2.3 Ending a Syllable with a Consonant

There are three ways to construct a syllable. They are (1) consonant vowel, (2) vowel

consonant, and (3) consonant vowel consonant. In the first instance, the syllable ends in a

vowel and this is represented in the Arabic language by the consonant letter of the

syllable having one of the three vowels. The second instance is not an option in the

Arabic language because vowels do not precede consonants; rather they follow them. As

for the third instance, we do not yet know how to construct that syllable whose last letter

is a consonant.

When a syllable ends in a consonant, the consonant is free of all vowels. In Arabic, this is

represented by means of a special symbol atop the letter. This symbol is called a sukoon

and the letter which holds this symbol is called saakin. Therefore, the letter ‘M’ in the

word “FROM” would be transliterated with a sukoon atop the letter “meem”. Only the

sound of the letter would be pronounced just as we pronounce the sound of the letter ‘M’

when we recite this word. Therefore, saakin letters can occur in the middle or at the end

of a word.

Notice that the word “from” is strange in that the first letter of this word is saakin. Only

the sound of the ‘F’ is pronounced and there are no vowels surrounding it. It is as if the

‘F’ is a syllable of its own. In Arabic, this is an intolerable situation; we cannot initiate

pronunciation with a saakin letter.

In English, the ‘f’ is incorporated into the other syllable. There is no problem in saying

that this letter combined with the ‘R’ together act as the first consonant of this syllable. In

Arabic, however, the ‘F’ would be preceded by a consonant-vowel pair. This consonant-

vowel pair would connect with the ‘f’ forming a new syllable. The word ‘FROM’ would

be changed to something like “IFROM” with a hamza maksoor preceding the “faa” in

order to complete the incomplete syllable (the ‘f’); the vowel of the “hamza” varies. The

broken syllable was ‘f’; it has now been changed to a sound syllable type three because

that is the only type of syllable in Arabic which ends in a consonant. It is type three

because there is a consonant (the ‘hamza’), a vowel (the kasra beneath the ‘hamza’), and

the “faa” of the original word.

This is a common problem when Arabs speak or write. Often, a syllable ends in a

consonant, and the next syllable starts with a consonant. This is not a problem for us,

however, as our goal is to learn how to read and scribe Arabic, not to speak it.

Below are a few examples of saakin letters in the middle of syllables, at the end, and

syllables which are created in order to complete incomplete syllables. Notice that the

third example in the first table has “type 3” listed twice. This is because there is a syllable

between the first two letters and the vowel in between and there is a syllable between the

last letter, its vowel, and the ‘N’ sound which is made by the tanween. Also notice that in

the last two examples, there was an incomplete syllable which was completed by adding a

‘hamza’ in the beginning of the word.

%F|��n |��n N-�� type 3, type 3 type 3 type 3

N2ItN<� �1N��8 2�*N� � incomplete, type 3 incomplete, type 1 type 3, type 1, type 1

4.6 SUMMARY As a summary, The Arabic language, or simply Arabic, is the largest member of the

Semitic branch of the Afro-Asiatic language family. There are three vowels in Arabic

language they are /a/, /i/ and /u/ each one has two forms; the first one is Short Vowel and

the second Short Vowel Doubled. The next chapter will show the architectural design of

the Arabic TTS Synthesizer System.

CHAPTER FIVE

SYSTEM ANALYSIS AND DESIGN

5.1 OVERVIEW

A system is always developed for a purpose; this purpose is to provide functionality, or

behavior, that will satisfy the needs and wishes of clients and users. In this chapter, the

researcher discusses the analysis of the TTS system and the design. Analysis is the first

part of the system development where we begin to understand in depth the needs of the

system. Analysis involves substantial amount of efforts. The second part of the system

development is software design, which is discussed in details in later sections of this

chapter. Figure 5.1 shows the use case diagram of the Arabic TTS Synthesizer System.

Admin User

Figure 5.11 Use-Case Diagram for TTS Synthesizer System

Open File

Input Text

Synthesize Text

Clear Text

Update and Delete

Compare Sound

Normalize Text

Concatenate Text

Text Parser

Record Sounds

<<extend>> <<extend>> <<extend>>

<<extend>> <<extend>>

<<extend>>

Tables 5.1 to 5.10 below represent the specification document of all the use cases:

Table 5.6 Description of Open File Use Case. Use case: 1 Open file Description

Actors Admin and User Pre-conditions None Post-conditions Open file dialogue box will appear Basic Flow The user and/ or the admin first open the

TTS Synthesizer System and choose either Arabic or English

Alternative flows None Special Requirements The Operating System that runs the TTS

Synthesizer System must support Arabic Language, in order for the user to be able to type the text using Arabic letters

Use case relationships None

Table 5.7 Description of Input Text Use Case. Use case: 2 Input Text Description

Actors Admin and User Pre-conditions None Post-conditions The input text by the user and/ or will

appear in the text box provided Basic Flow The user and/ or the admin type the

targeted text to be converted into speech in the provided text box

Alternative flows None Special Requirements The Operating System that runs the TTS

Synthesizer System must support Arabic Language, in order for the user to be able to type the text using Arabic letters


Table 5.8 Description of Clear Text Use Case. Use case:3 Clear Text Description

Actors Admin and User Pre-conditions None Post-conditions The input text by the user and/ or the admin

will be cleared from the text box Basic Flow The user clicks the clear button Alternative flows None Special Requirements None


Table 5.9 Description of Synthesize Text Use Case. Use case: 4 Synthesis Text Description

Actors Admin and User Pre-conditions None Post-conditions Hear the sound of the input text by the user

or admin Basic Flow After the user and/ or the admin input a text

in the text box, next the user or admin click the speak button to hear the pronunciation of the input text

Alternative flows None Special Requirements None


Table 5.10 Description of Update and Delete Use Case. Use case: 5 Update and Delete

Description

Actors Admin Pre-conditions Admin must access the code of the system Post-conditions The system must display the updated

version of the system Basic Flow The admin updates the data, design, sound

files, and other files Alternative flows None Special Requirements None


Table 5.11 Description of Record Sound Use Case. Use case: 6 Record Sounds

Description

Actors Admin Pre-conditions Choosing the sound format Post-conditions Sounds recorded must be added to the

system’s database Basic Flow The admin records the necessary sounds

and adds them to the database, by accessing into the database file



Table 5.12 Description of Normalize Text use case. Use case: 7 Text Normalization

Description

Actors None Pre-conditions Normalized sound Database should be

existed Post-conditions Retrieve sound to be spoken Basic Flow After the user input the targeted text to be

spoken, The system will compare the input text with the sound file and match with it and then produce the output

Alternative flows If the text did not match with the sound file in the Normalized database, it will compare with the other 2 databases (Parse and Concatenation database)

Special Requirements None

Use case relationships Extend with Record Sound use case

Table 5.13 Description of Text Parser Use Case. Use case: 8 Text Parser Description

Actors None Pre-conditions Parser sound Database should be existed Post-conditions Retrieve sound to be spoken Basic Flow After the user input the targeted text to be


Alternative flows If the text did not match with the sound file in the Parser database, it will compare with the other 2 databases (Normalized and Concatenation database)



Table 5.14 Description of Concatenate Text Use Case. Use case: 9 Text Concatenation

Description

Actors None Pre-conditions Concatenation sound Database should be

existed Post-conditions Retrieve sound to be spoken Basic Flow After the user input the targeted text to be


Alternative flows If the text did not match with the sound file in the Concatenation database, it will compare with the other 2 databases (Normalized and Parse database)



Table 5.15 Description of Compare Sound Use Case. Use case: 10 Compare Sound

Description

Actors None Pre-conditions All the database table must be existed Post-conditions Compare and retrieve targeted sound from

the database Basic Flow After the user input the targeted text to be



Use case relationships Extend relationship with Text Normalization, Text Parser and Text Concatenation Use cases.

5.2 SOFTWARE AND HARDWARE REQUIREMENTS There are many tools available to do database programming purposely to develop a

program. For this project to develop TTS, the selection of an appropriate tool to support

the program application is made. Different tools are compared as to reach the one that is

compatible with the project. To choose the tool, the first thing to be considered is to

revise the purpose of the project and what are the functions of the program application.

This step also establishes the objectives and the scope of Arabic TTS Synthesizer System

and the tasks that need to be undertaken.

All the criteria of the program are included to evaluate the available tools and the most

appropriate would be chosen. The decision is made based on several factors for example

the budget, level of vendor support, compatibility with other software, and whether the

product runs on particular hardware. This project used Microsoft Visual Basic 6 and

Microsoft WordPad as the development tools.

5.2.1 Microsoft Visual Basic 6.0

Visual Basic is a Microsoft Windows programming language. Visual Basic is created in

an Integrated Development Environment (IDE). The IDE allows the programmer to

create, run, and debug Visual Basic programs conveniently. IDEs allow a programmer to

create working programs in a fraction of the time that it would normally take to code

programs without using IDEs.

5.2.1.1 Visual Basic and Arabic Supports

Visual Basic is Bi-directional (also known as “BIDI”)-enabled. Bi-directional is a generic

term used to describe software products that support Arabic and other languages, which

are written right-to-left. More specifically, bi-directional refers to the product ability to

manipulate and display text for both left-to-right and right-to-left languages. For example,

displaying a sentence containing words written in both English and Arabic requires bi-

directional capability. Microsoft Visual Basic includes standard features to create and run

Windows applications with full bi-directional language functionality.

5.2.1.2 Visual Basic Bi-directional Features

Although the Microsoft Visual Basic 6.0 user interface (menus, dialog boxes, and Help)

is in English, user will find the convenience and ease of use of Microsoft Visual Basic 6.0

bi-directional features are indispensable for the bi-directional programming needs:

1. Create bi-directional applications quickly and easily:

Many bi-directional features appear as new properties in the Properties window

for easy program development.

2. Combine languages:

Mix right-to-left and left-to-right language text (for example, Arabic and English)

in program code as user builds applications for a bi-directional 32-bit Microsoft

Windows environment.

3. Create right-to-left visual features:

Add right-to-left visual features to forms, menus, and more than 15 custom

controls such as grids, list boxes, and combo boxes.

4. Database support:

Develop database solutions with support for Arabic and other right-to-left

language sort orders.

5.2.2 Microsoft Word

Microsoft Word is a word processing application from Microsoft. Richard Brodie for

IBM PCs running DOS in 1983 originally wrote it. Later versions were created for the

Apple Macintosh (1984), SCO UNIX, OS/2 and Microsoft Windows (1989). It later

became part of the Microsoft Office suite, and Microsoft refers to Word as Microsoft

Office Word in this context to indicate its inclusion in the suite, although it is still also

sold as a standalone product or bundled with Microsoft Works. Microsoft Word is used to

store collection of Arabic text in rich text format. There are three test files created in

WordPad. These files are provided in Text-To-Speech System and can be used by user

for the purpose of trying and testing the program functions.

5.2.3 Sound Forge 8.0

Sound Forge 8.0 is used to record the sounds in the development stage of the Arabic

speech synthesizer software. Sound Forge software provides the ultimate set of tools for

recording professional audio. Sound Forge 8.0 is a digital audio editor for Windows

98SE, Me, 2000, and XP.

Sound Forge can be used to audio editing, multitask background rendering, 32-bit/64-bit

float/192kHz file support, enhanced DirectX plug-in management, QuickTime and

Windows Media format import and user interface enhancements.

Sound Forge 6.0 includes an extensive set of customizable processes, effects and tools for

manipulating audio, and supports a wide range of audio and video file formats, including

Windows Media, RealMedia, QuickTime and MPEG 1&2. Users can save time by

continuing production through open, play, preview, cut, copy, paste, and delete files

while other project files render in the background.

Sound Forge 8.0 supports full resolution 32-bit files for high audio quality. It enables the

user to import and save high-resolution 32-bit files, even to record 32-bit files if the

hardware supports 32-bit recording.

Tools provided in the Sound Forge are track at once CD burning, CD Ripping, Spectrum

Analysis, Auto Region (using beats and measures, or peak detection), Crossfade Loop,

Extract Regions, Find Tool, Enhanced Preset Manager, Sampler Tool, Statistics Tool

(Max, RMS, DC offset, Zero Crossings), Simple Synthesis, FM Synthesis and DTMF/MF

Tone Synthesis. Recording tools provided are Auto calibration for DC Offset, Generate

SMPTE/MIDI Time Code, Glitch/Gap Detection, Punch In option, Real-time record

meters, and Remote record function. The figure below shows the tools for recording

voice in Sound Forge 8.0.

5.3 ARABIC TEXT-TO-SPEECH ARCHITECTURE Figure 5.2 shows the architecture of the Arabic TTS Synthesizer System. It is composed

of three major components, which are depicted by the rectangle as in figure 5.15. Raw

Text is the input of the system. Any typed text will be normalized in Text Normalization

Module. Syllable Parser will segment the normalized text to syllable unit according to

Arabic rules. Lastly, Syllable concatenation will combine syllable unit sound file to

produce a synthesized speech. All the sound files are stored in one folder named as sound

to be accessed by the system. The architecture is based on Input, Processing and Output

Schematic (IPO).

Figure 5.2 IPO – Schematic Architecture Design of Arabic TTS

Input Layer

Processing

Layer

Output Layer

5.3.1 Synthesizing Text Steps The Arabic TTS Synthesizer is using concatenative approach with syllables unit. There

are three different stages to produce a synthesized speech:

� Text Normalization

The inputted text may not only contain words. There could be other different type

of character such as numbers, abbreviation and acronyms, which we considered as

symbol in this system. This module will convert the symbols input into readable

text. Figure 5.3 shows the sequence diagram of the Text Normalization

component.

Figure 5.3 Sequence Diagram for Text Normalization

� Text Segmentation

Figure 5.4 shows the sequence diagram of the Text Segmentation component. Input

text may be in the form of paragraphs, sentences, or words. Thus, it is necessary to

segment text in hierarchal order: higher level structures to paragraphs, paragraphs to

sentences, sentences to words and words to manageable units. For this TTS

Synthesizer System, the manageable units are in the form of phonemes.

In this research, we limited the input text to paragraph form. A paragraph was

segmented into sentences by finding the sentence punctuation marks such as ‘.’, ‘!’

and ‘?’. However, there are exceptions. For example, if abbreviations were used in the

input text such as [7] the system must be aware that the periods are not sentence

punctuation marks but rather used as an abbreviation mark. To solve this dilemma,

the system checked the preceding letters whether it is included in our abbreviation

database. If not, then the period will be considered as a mark that ends a sentence. To

segment sentences into words, blank spaces were located in the text that has been

classified as a sentence. From the text that has been identified as words, the phonemic

representations equivalent to the set of letters of the retrieved word were generated. In

our implementation, segmentation of words into phonemes only allows a maximum

of 16 letters.

Figure 5.4 Sequence Diagram for Text Segmentation

� Text Concatenation

This design process will receive a list of syllable segment that has been properly

arranged according to the raw text. Base on the list of syllable, Syllable

Concatenation module will concatenate the sound according to the sequence and

finally play the sound that we know as synthesized speech. The system is capable

of doing Arabic language text conversion into Arabic synthesized speech. Figure

5.5 shows the sequence diagram of the Syllable Concatenation component.

Figure 5.5 Sequence Diagram for Text Concatenation

5.3.2 Recording of the sounds The next step is recording of the sounds. The recording of the sounds is one of the tasks

that consume a lot of time in development of the software. Below activities that are

carried out prior to recording of sounds:

1. Determining the fields

Fields are small units of application data recognized by system software. In this

system, the fields are set as sounds.

2. Determining the data type

Data type is a detailed coding scheme, which is recognized by system software for

representing data. The data are represented in the files as wave.

3. File organization

A file organization is a technique for physically arranging the records of the file

on the secondary storage device. The sounds in the files are stored alphabetically

from أ to ي, which is known as sequential file organization.

5.3.3 Storing Sound Files Using Wave File Format

WAVE File Format is a file format for storing digital audio (waveform) data. It supports

a variety of bit resolutions, sample rates, and channels of audio. This format is very

popular upon IBM PC (clone) platforms, and is widely used in professional programs that

process digital audio waveforms. It takes into account some peculiarities of the Intel CPU

such as Little Endian byte order (Craig, 2000).

WAVE (.wav) is the standard form for uncompressed audio on a PC. Since a wave file is

uncompressed data - as close a copy to the original analog data as possible - it is therefore

much larger than the same file would be in a compressed format such as mp3 or

RealAudio. Audio CDs store their audio in, essentially, the wave format. Any audio will

need to be in this format in order to be edited using a wave editor, or burned to an audio

CD that will play in the stereo. Table 5.11 below shows some examples of sound

segments in a wave format.

Table 5.16 Example of Sound segments in wave files

1 letter 2 letter 3 letter

wav. �ب .wav� wav.ت

wav. �ت wav ! .wav.ت

wav " .wav #$ .wav.ت

.wav %$ .wav� wav.ت

Segment for letter ت (ta) Segment for letter ب (ba)

1 letter 2 letter 3 letter

wav.&�ب &.wav� wav.ب

wav.&�ت wav !&.wav.ب

wav "&.wav #$&.wav.ب

wav.&"ن &.wav� wav.ب

5.4 THE DESIGNING OF DATABASE FOR SOUND FILES In this dissertation, the type of database used is a relational database is a collection of

data items organized as a set of formally-described tables from which data can be

accessed or reassembled in many different ways without having to reorganize the

database tables.

5.4.1 Representing Relational Database Using Microsoft Access

In this dissertation, Microsoft Access is used as a database to store the sound files.

Microsoft Access is one of the most popular database software that is used by users to

store their data and information for future retrieval. Using Microsoft Access, all of the

data and information can be managed from a signal database file. Without database, all

the wave files have to be treated separately and we have hundreds of wave files need to

be managed.

By having a database, all the sound files are treated as a signal database file and this will

save time in searching for particular files. The sounds are first stored in a hard disk drive

and then copied to the CD. Then the database is created for this sound files. The reason

for creating the database is mainly to minimize the retrieval time. To retrieve any file or

information directly from the hard disk takes longer time compared to accessing it from

the memory. Once we activate the database, all the related information will be loaded into

the memory from the hard disk.

By doing so, the time needed to retrieve the data and information is relatively short.

Moreover, if we do not create the database, when we want to read data it will take long

time, as it first has to search from the hard disk, then load to the memory then from the

memory it will fetch the desired data to us. This shows that it takes double of time as

compared to the first case. Figure 5.6 shows the procedure taken by the system when the

user request for a particular data or information.

Figure 5.6 The procedure taken when user requests for certain data Here is a simple description for the above given process:

• Process A – when user request sound ب.wav it will search for it in the

memory.

• Process B – if the sound is not in the memory it will go to the hard disk and

then search for the sound ب.wav

• Process C – once the sound is found, it will bring the targeted sound file

.wav to the memory.ب

• Process D – the sound will be fetched from the memory and return to the user

as he/she requested.

Tables 5.12 to 5.14 below show the list of the database tables included in the Arabic TTS

Synthesizer System.

User request Memory Hard disk A

D C

B

Table 5.17 Arabic Word Table

Field Type Attributes Null Default Extra Id Int(11) Primarykey No Auto_increment

Sound Varchar(60) No

Table 5.18 Syllable Table



Table 5.19 Abbreviation Table



All the sound files with the extension of .wav are stored in one folder called Sound

folder. While the database tables contain the name of the file, for example, the file ت.wav

in the Sound folder exist in the Syllable Concatenation Table as ت only without the

extension.

5.5 THE DESIGN OF USER INTERFACE

There are numerous methods and interfaces for creating the implementation of

synthesized speech in desired applications easier have been developed during this decade.

It is quite clear that it is not possible to create a standard for methods speech synthesis

because most systems act as stand-alone device that means they are incompatible with

each other and do not share common parts. However, it is possible to standardize the

interface of data flow between the application and the synthesizer.

Generally, the interface contains a set of control characters or variables for controlling the

synthesizer output and features. The output is usually controlled by normal play, stop,

pause, and resume type commands and the controllable features are usually pitch baseline

and range, speech rate, volume, and in some cases even different voices, ages, and

genders are available. In most frameworks, it is also possible to control other external

applications, such as a talking head or video.

Microsoft Visual Basic 6.0 is used as a programming tool in designing the user interface

for the Arabic Text-To-Speech Synthesizer System. Visual Basic provides us with a

visual interface; it provides us with design windows area in which to work. This design

window comes complete with toolbox, toolbars, and menus. The design of the user

interface is carried out in many stages before the real features are designed. The reason

behind this is to have a full understanding in designing the user interface using Visual

Basic. Below are the stages of designing:

When we create a Visual Basic program, we will often need to make choices regarding

how to organize our program code. Visual Basic recognizes three different types of

modules, or file, of program code:

� Form modules

� Class module

� Modules, or code modules

Each type contains variable declarations, subs and functions. Both form modules and

class modules contain the code required to describe the contents and behavior of a

particular class of object. A form module also contains a physical description of a form or

screen window, indicating what controls appear on it, what they look like, and some

aspects of how the form and controls will behave at run time.

Code modules, or simply modules, are rather different, in that the statements that go into

one of these do not describe anything about a class of objects, but instead define an

individual set of subs, functions and data descriptions. These can be thought of as the

methods and properties of individual objects in the program – each code module defines

one object instead of a class of objects. Objects defined as classes or forms have to be

created or loaded when a program runs before we can access their methods or properties,

but the code in a code module is available immediately.

Any number of code modules can be added to as Visual Basic program. Each code

module that we add to a project will define another set of program-wide data items, subs

and functions. Each module must have a unique name. to summarize:

� Form modules contain the data items, methods and event-handlers for each

instance of a type of form. Every form of a particular class that is created and used

in a program will have its own set of data items, called instance variables, which

will describe the state of that form. Every form will also have the use of all of the

subs, functions and event-handlers (or methods) defined for the class.

� Class modules contain the data items, methods and event-handlers for each

instance of a type of object. Every object of a particular class that is created and

used in a program will have its own set of data items, called instance variables,

which will describe the state of that form. Every object will also have the use of

all of the subs, functions and event-handlers (or methods) defined for the class.

� Code modules contain the data items, subs and functions for use in a program.

Only a single instance of each data item will exist throughout the program, but

these may be made accessible to every object in the program.

The following is a list of procedures used in implementing and designing the Arabic TTS

Synthesizer System Interfaces and the Conversion Engine. These procedures are as

follows: Understanding the form and its procedures, Creating the Command Button,

Creating the Text Box, Creating the Button associated with the TextBox, Creating the

Drive to see all the Files inside the Drive, Creating the Multimedia Controls and their

Properties, Combining Multimedia Controls with Command Button, and lastly The User

Interface of the TTS Synthesizer System. The detailed description of each of the above-

mentioned procedures will be explained in the next chapter, the Implementation Chapter.

The User Interface of the TTS Synthesizer System

The user interface for the TTS Synthesizer System is created by combining the functions

and control described in the previous sections. The interface contains TextBox,

CommandButton, DataControl, label, Multimedia Control, DirListBox, and so on. The

figure 5.7 below shows the user interface for the TTS Synthesizer System. The system

contains two sub-systems; the first sub-system is Arabic TTS as shown in the figure

below and the second sub-system is English TTS as shown in the figure below.

Figure 5.7 User interface of Arabic and English TTS Synthesizer System. 5.6 SUMMARY

The first part of this chapter has shown the use case diagram of the Arabic TTS, and the

next part has described in details the architectural design of the Arabic TTS. It also listed

the software and hardware requirements. Last part of the chapter has explained the

database construction and the user interface design. The next chapter will explain the

implementation.

CHAPTER SIX

ARABIC TTS SYSTEM IMPLEMENTATION

6.1 OVERVIEW

This chapter is the embodiment of the theoretical ideas presented in the previous chapters

into a practical and working system. This chapter discusses the activities needed to

successfully build the Arabic TTS System. This chapter will explain the implementation

of the system step-by-step.

Programming is time-consuming and costly, but except in unusual circumstances, it is the

simplest for the systems analyst because it is well understood. After maintenance, the

implementation phase of the systems development life cycle is the most expensive and

time-consuming phase of the entire life cycle. Implementation is expensive because so

many people are involved in the process; it is time-consuming because of all the work

that has to be completed during implementation. Physical design specifications must be

turned into working computer code, the code must be tested until most of the errors have

been detected and corrected, the system must be installed, use sites must be prepared for

the new system, and users must come to rely on the new system rather than the existing

one to get their work done. System implementation is made up of many activities, such as

coding, testing, installation, documentation, training and support. The purpose of these

steps is to convert the physical system specification into working and reliable software

and hardware, document the work that has been done, and provide help for current and

future users of the system.

6.2 CREATING THE ARABIC TTS SYSTEM

Text-To-Speech applications have recently gotten easier to develop with the introduction

of the ActiveX Text-To-Speech and Direct Speech Synthesis controls found in the (SAPI

SDK), which stands for Speech Application Programming Interface Software

Developer’s Kit. The SAPI is a set of functions that enable us to incorporate TTS and

Speech Recognition into our applications. These controls provide us with a wide range of

flexibility in creating applications that can convert written text to speech.

Below is a list of procedures used in implementing the Arabic TTS Synthesizer System

Interfaces and the Conversion Engine. These procedures are as follows: Understanding

the form and its procedures, Creating the Command Button, Creating the Text Box,

Creating the Button associated with the TextBox, Creating the Drive to see all the Files

inside the Drive, Creating the Multimedia Controls and their Properties, Combining

Multimedia Controls with Command Button, and lastly The User Interface of the TTS

Synthesizer System. Below is the detailed description of each.

6.2.1 Understanding the form and its procedures

When we begin a standard.EXE Project, a form will appear in the design window, see

figure 6.1. During this process, we are able to use the tools provided on the menus,

toolbar and toolbox. The objects that appear as icons in the toolbox are called “controls”.

These controls can be added to the form to interact with users. The form design window

presents us with gray background. We can change the color and size of the form.

Controls are placed on the form by selecting then from the toolbox. The property window

either displays the properties list as alphabetically or categorized. We can select one of

these options by clicking the respective tab. The categories are such as Appearance,

Behavior, Dynamic Data Exchange, Font, Position, Scale, and so on. The Appearance

category, for example, is the property that deals with color, caption, pictures, border and

others.

Figure 6.12 The Basic Form and its properties

6.2.2 Creating the Command Button

The important feature of the software is to have a command button whereas the user

presses the button, it will respond accordingly. When the user clicks on the button, an

event will occur. In a simple term, an even is something happening. The command button

is created on the form and this button has its own properties where we can manipulate it

as our requirements. The caption property allows us to rename the button, as we want, for

example “Speak”. The figure 6.2 below shows the form with the command button.

Figure 6.13 The Form with the command button “Speak”

6.2.3 Creating the Text Box

The TextBox control, see figure 6.3, is typically used to collect user input. It usually

allows the data to be edited. The default amount of the text can be entered in the TextBox

is 2084 characters, but with the MultiLine Property changed to True, the user can enter

up to 32 Kb. The Textbox is the most important feature in the proposed system. Inside the

text box, users can type whatever they want. The properties of the text box can be

manipulated as well. We can change its height and width based on our own preference.

The font type and the font size of the text can be changed too.

Figure 6.14 The Form with the TextBox written inside “Text1”

6.2.4 Creating the Button associated with the TextBox and their procedures

All of these, the Form, the Command Button, and the TextBox, need to be integrated

together. For example, how the TextBox will response when the user clicks on the

Command Button. Again, the TextBox and the Command Button are created. The

Command Button is given the name “Clear”. The user will type a text inside the TextBox

and once the “Clear” Button clicked, the TextBox will be cleared from any text written

inside by the user. Figure 6.4 below shows the Button associated with the TextBox.

Here is the procedure written in the “Clear” Button, inside the code page:

Private Sub Clear_Click ()

Text1.Text = “ ”

End Sub

Figure 6.15 The Form with the TextBox and the Command Button.

6.2.5 Creating the Drive to See All the Files Inside the Drive

The DriveListBox enables the user to select a valid drive at the runtime. Figure 6.5 shows

the DriveListBox, which identifies the drive available to the users on a particular system.

the box will show all mapped drives. Everything in My Computer will be accessible to

the user for selection. The DriveListBox displays all the available directories on a

selected drive. The FileListBox further enables users to select files from the selected

directory. If the pattern property has been changed to a particular file extension, then only

those files will appear in the FileListBox. Why do we need this tool the proposed system?

Because, we are going to play the sound files, the pattern property is changed to “.wav”.

Figure 6.16 The Drive C, its directory and all the files inside the drive.

6.2.6 Creating the Multimedia Controls and their Properties

The MCI control manages the recording and playback of Multimedia files on MCI

devices as seen in figure 6.6. The control looks like a set of VCR controls, containing a

set of buttons that supplies MCI commands to a device such as sound card, MIDI

sequencers, CD-ROM drives, audio CD players and so on. The Multimedia control uses a

set of sophisticated commands, known as MCI commands, which control a range of

Multimedia devices. Many of these commands correspond directly to a button on the

Multimedia control. The Play Command carries out the same instruction as the Play

Button in VCR panel.

Figure 6.17 The form with the Multimedia Control Commands.

6.2.7 Combining the Multimedia Controls with the Command Button

The Multimedia Controls are combined with the Command Buttons such as “Stop”,

“Play”, and “Beep”, sounds and we want to see how the Multimedia Controls response

when any of the buttons is pressed. Figure 6.7 shows the above procedures:

Figure 6.18 The form with the Multimedia Controls and the Command Buttons.

6.2.8 The User Interface of the TTS Synthesizer System

The user interface for the TTS Synthesizer System is created by combining the functions

and control described in the previous sections. The interface contains TextBox,

CommandButton, DataControl, label, Multimedia Control, DirListBox, and so on. The

figure 6.8 below shows the user interface for the TTS Synthesizer System. The system

contains two sub-systems; the first sub-system is Arabic TTS and the second sub-system

is English TTS as shown in the figure below.

Figure 6.8 User interface of English and Arabic TTS Synthesizer System.

The user interface combines the functions as described below:

• Text Box

A text box is created to allow user to input their text in the text box. The

maximum length of characters in the text box is also specified. The users are

allowed to type in a multi-line text box. The multi-line property is set to true.

• Command Button

There are four command buttons created for this software. They are “Clear”, “Say

It” and “Exit” buttons. Each of those buttons has its own function as in the coding

procedure as below:

� Clear Button

Private Sub Command1_Click ()

TxtFileName.text = " "

TxtFileName.text.setFocus

End Sub

� Exit Button

Private Sub Command2_Click ()

End

End Sub

� Open Button

Private Sub cmdopen_Click()

On Error Resume Next

With CommonDialog1

.CancelError = True ' turn Cancel error on

.DialogTitle = "Open File" ' update dialogs title

.Filter = "Text Files(*.txt)|*.txt|Batch Files(*.bat)|*.bat |Module Files(*.bas)|*.bas" ' Update filetypes

.ShowOpen ' show open dialog

End With

Open CommonDialog1.FileName For Input As 1 ' Load file into text box.

Text1.Text = Input$(LOF(1), 1)

Close 1

End Sub

� Speak Button

This is the most important part of the user interface in which the program

will execute when a user click on this button. When clicking on the button,

the text input in the text box by the user will be read.

� Home Page Button

Private Sub Command3_Click()

Unload Me

Load Home

Home.Show

End Sub

• The Data Control

The data control is a control created on the form to see all the sound files in the

database and check if the sounds exist or not. The data control is connected to the

database. The RecordSource property is set to “Sounds” which is the name of the

table in the database. In the user interface, the data control is set to invisible.

• DirListBox, FileListBox, and DriveListBox

All these three controls are also created in the form to check the existence of the

files that we are looking for in a selected drive. These controls are also made

invisible.

• Multimedia Control

Multimedia Control is meant to play the sounds when the user clicks on the 'Say

It' button. The Multimedia control is important in which the system will response

to the click on the button with Multimedia control. The Multimedia control is set

to invisible in its properties. The procedures of converting text to speech can be

shown as below.

• Specify the number of character in an array

Dim A(20)

Dim B(20)

Cls

X = txtFilename.Text

X = X + " "

Cls: the Cls method, which stands for Clear Screen like the DOS command,

clears all the text written with text methods from a form or TextBox control.

• The length of the text

Len (X): for many operations, we may need to know how many characters are in

a string. We might need this information to know whether the string with which

we are working will fit in a fixed-length database field. Alternatively, if we are

working with big strings, we may want to make sure that the combined size of the

two strings does not exceed the capacity of the string variable. To determine the

length of any string, we use the Len ( ) function, as in the code:

L = Len(X)

p = 1

q = 1

For i = 1 To L

• Cut the text, word by word

Mid ( ) is a function that is used to retrieve a substring from a string, we can use

the Mid ( ) function to retrieve a letter, word, or phrase from the middle of a

string. The Mid ( ) function contains two required arguments and one optional

argument, as shown in the following syntax:

If Mid(X, i, 1) = " " Then

oneword = Mid(X, q, i - q)

q = i + 1

i represents the character position at which the retrieved string begins. If i is

greater than the length of the string, an empty string is returned. The optional

argument 1 represents the number of characters to be returned from X. if 1 is

omitted, the function returns all characters in the source string, from the starting

position, on to the end.

• Specify the number of sounds in the database

Data1.Recordset.MoveFirst

For k = 1 To Data1.Recordset.RecordCount

• Identify string in one word

InStr ( ): the function that enables us to search a string for a character or group of

characters is the InStr ( ) function. This function has two required and two

optional parameters. The required parameters are the string to be searched and the

text to search for. If the search text appears in the string being searched, InStr ( )

returns the index of the character where the search string starts. If the search text

is not present, InStr ( ) returns 0. as shown in the following syntax:

Pos = InStr(oneword, Data1.Recordset.Fields(0))

SegL = Len(Data1.Recordset.Fields(0))

• Go to next record if string not there

If Pos = 0 Then

Data1.Recordset.MoveNext

GoTo 10

End If

If A(Pos) = 0 Then

For n = Pos To Pos + SegL - 1

If B(n) = 1 Then


GoTo 10

End If

Next n

• Specify the array of the segment

A(Pos) = SegL

For m = Pos To Pos + SegL - 1

B(m) = 1

Next m

'Print "A("; Pos; ")="; A(Pos)

End If


p = p + 1

10 Next k

Data1.Recordset.MoveFirst is a function that moves the record pointer from the

current record to the first record in the opened recordset.

Data1.Recordset.MoveNext is a function that moves the record pointer from the

current record to the next record (the record following the current record) in the

opened recordset. If no record exists (that is, if we are already at the last record),

the end-of-file (EOF) flag is set, and there will be no current record.

• Play the sounds of the word in an array

For t = 1 To 20

If A(t) <> 0 Then

MMControl1.FileName = "F:\Arabic TTS\Sound\" + Mid(oneword, t, A(t)) + ".wav"

Constructing the Database

A database may contain phonemes, diphones, syllables, or words, or for better prosodic

synthesis, a mixture of these. But, as a consequence, a decision algorithm needs to be

implemented to decide which acoustical unit suits a given text better. Also, the use of

diphones and/or words requires larger memory space. Phonemes were being used as a

basic acoustical unit due to its simple implementation and low memory requirement.

• The first step in constructing a diphone database for Arabic is to determine all

possible diphone pairs of Arabic. In general, the typical diphone size is the square

of the phone number for any language.

• Arabic has 28 consonant phonemes; four of these consonants are the emphatic

ones. Two semi-vowels (as /ay/ in ( ��) “bayt” (house) or as /aw/ in ( م ��) “yawm”

meaning (day)) and six vowels and three additional consonants (/p/ ‘پ’, /g ‘چ’/

and /v ‘ڤ’/). This results in 39 possible phonemes. Since we are interested in

possible phoneme combinations, i.e. diphones, we get 39 times 39 = 1521 diphone

pairs.

Text Analysis

Text analysis is composed of three processes: text normalization, text segmentation

and text concatenation. Text normalization spells out numerical values and

abbreviation in the input text. Text segmentation segments text into basic acoustical

units identical to the ones stored in the database. As well as text concatenation

concatenates these acoustical units, phonemes and syllables to produce the targeted

sound.

Text Normalization

The process of reformatting the text or unwrapping the strict token sequence from the

visual presentation style, and encoding the useful parts of the style in a defined and

explicit way is called text normalization. Input text to a TTS system usually does not

have constraints. It can contain abbreviations, symbols, foreign words and numbers.

Thus, a pre-processing unit is required to convert this unlimited text into a format,

which can be processed. For example, [ ()*ا] should be read as [ ,-). /)*ا]. In TTS

Synthesizer System, a dictionary for abbreviation was constructed where the system

can look at during text normalization.

Text Segmentation

Input text may be in the form of paragraphs, sentences, or words. Thus, it is necessary

to segment text in hierarchal order: higher level structures to paragraphs, paragraphs

to sentences, sentences to words and words to manageable units. For this TTS

Synthesizer System, the manageable units are in the form of phonemes.

In this dissertation, the input text is limited to paragraph form. A paragraph was

segmented into sentences by finding the sentence punctuation marks such as ‘.’, ‘!’

and ‘?’. However, there are exceptions. For example, if abbreviations were used in the

input text such as [7] the system must be aware that the periods are not sentence

punctuation marks but rather used as an abbreviation mark. To solve this dilemma,

the system checked the preceding letters whether it is included in our abbreviation

database. If not, then the period will be considered as a mark that ends a sentence. To

segment sentences into words, blank spaces were located in the text that has been

classified as a sentence. From the text that has been identified as words, the phonemic

representations equivalent to the set of letters of the retrieved word were generated. In

our implementation, segmentation of words into phonemes only allows a maximum

of 16 letters.

Examples

The word: Maghrib (@�C#) ‘when the sun falls in the horizon’

• Text converter:

o CVCCVC (<gh> is bound as one; /�/) • Text segmenter:

o CVC. CVC. • Synthesizer:

o /ma�/ + /rib/ The word: dhahab (��) ‘gold’

• Text converter: o CVCVC (<dh> is bound as one; / θ /)

• Text segmenter: o CV.CVC

• Synthesizer: o /θa/ + /hab/

How the system is functioning?

This section provides the implementation details of Arabic TTS Synthesizer. The Arabic

TTS Synthesizer works as follow:

Once the System's interface is displayed to the user, the user has to write or input any text

to be converted to audio format, then the user clicks the speak button to hear the sound.

The system reads the targeted text backward (from left to right), then compare it against

the database and against the sound folder. If the targeted text is found, it converts the text

to audio format; otherwise, it cuts the text letter by letter and compare against the

database and the sound folder. Figure 6.9 shows the word " "E��F#" , for instance, this word

can be pronounced as described below:

Figure 6.9 The word "�G to be pronounced by the system.

o The system searches for the whole word against the database, then against the

sound folder.

o If the word exists in the database and in the sound folder.

o The system generates or converts the word from written form to audio form.

o Otherwise, the word is not there.

"��F#

o The system cuts the word letter by letter backward, such as "��F#, it becomes H

/ E��F#, then searches for the phone H and the syllable E��F# against the database

and the sound folder.

o If found it converts the word from written form to audio form.

o The system cuts the word letter by letter backward, such as "��F#, it becomes "�#

/ EEF#, then searches for the syllable E�#" and the syllable I EEF# against the

database and the sound folder.


o The system cuts the word letter by letter backward, such as "E��F#, it becomes

"��EJ / EE#, then searches for the syllable "��EEJ and the phone EE# Iagainst the

database and the sound folder.


o And so on for the rest of the written text.

6.3 SUMMARY

This chapter has covered the implementation of the Arabic TTS Synthesizer System. It

also explained the process of creating the TTS, the next section of this chapter has

described in details the algorithm and the coding of the system and how the Arabic TTS

Synthesizer System is working. The next chapter will explain the testing and the

evaluation results of the Arabic TTS Synthesizer System which were done by a group of

participants.

CHAPTER SEVEN

TESTING AND EVALUATING OF ARABIC TTS SYSTEM

7.1 OVERVIEW To test the intelligibility, naturalness and overall quality of the Arabic Text-to-Speech

system developed in this dissertation, a test for the Arabic voice was designed. In this

chapter, test parameters plus design of the test and results are discussed.

7.2 TESTING AND EVALUATING TTS SYSTEMS

Once the system components have been coded, it is time to test them, several testing

approaches exist that lead to delivering a quality system to the end users or customer

(clients). Testing is not the first place where fault finding occurs, but testing is focused on

finding faults. There are several steps in testing a system: Function test, Performance test,

Acceptance test, and Installation test. Each step has a different focus, and a step’s success

depends on its goal or objective.

A function test checks that the integrated system performs its functions as specified in

the requirements. For example, a function test of a TTS system verifies that the system

can correctly convert text to speech, and so on.

The performance test compares the integrated components with the nonfunctional

system requirements. These requirements, including accuracy, speed, and reliability,

constrain the way in which the system functions are performed. For instance, a

performance test of the TTS system evaluates the speed with which the conversion

process is made and the response time to the users.

An acceptance test assures the customers that the system is the system that was built for

them. The acceptance test sometimes run in its actual environment but often is run at a

test facility different from the target location. For this reason, the final installation test

may run to allow users to exercise system functions and document additional problems

that result from being at the actual site.

Users want systems that are easy to learn and use as well as effective, efficient, safe, and

satisfying. Evaluating is the process of determining the usability and acceptability of the

product or design that is measured in terms of a variety of criteria including the number

of errors users make using it, how appealing it is, how well if matches the requirements,

and so on. Carrying out an effective evaluation of any TTS System is not always a simple

task. Some of the main contributing factors that Klatt (1987) believes affect the overall

quality of any TTS System are:

•••• Clearness, that is how much of the spoken output the user grasps, as well as how

quickly a listener gets fatigue by only listening.

•••• Pleasantness / Naturalness, the most subjective evaluation criteria are degree of

pleasantness and degree of naturalness. The two are slightly different. It is

possible for a voice to sound natural and unpleasant; and it is possible for a voice

to be judged pleasant and still have a machine-like quality, which is the case for

modern, top-of-the-line speech synthesizers.

•••• Suitability for used application, different applications have differing needs for a

TTS system. For example, a system for the blind requires higher rates considering

intelligibility, more than the naturalness.

Depending on what kind of information is needed, the evaluation can be made on several

levels; phoneme, word, or sentence level. Many tests help to address these three issues

and others. Several individual test methods for synthetic speech have been developed

during the last decades. Even some researchers complain that there are too many existing

methods that make the comparisons and standardization procedures difficult.

Alternatively, it is clear that there is still not a single test method that gives a final correct

result. This chapter gives a short introduction of the most commonly used methods. This

will be the foundation of the test and the evaluation questionnaire introduced in later

section of this chapter.

Users want systems that are easy to learn and use as well as effective, efficient, safe, and

satisfying. Evaluating is the process of determining the usability and acceptability of the

product or design that is measured in terms of a variety of criteria including the number

of errors users make using it, how appealing it is, how well if matches the requirements,

and so on. Table 7.1 below shows the possible evaluating attributes that can be done in

testing and evaluating the Arabic TTS Synthesizer System.

Table 7.20 Possible evaluating attributes

Attributes Ratings Levels (+ … -)

Naturalness Very natural ………… Very unnatural

Speed Too much fast ………… Too much slow

Sound Quality Very good ………… Very bad

Pronunciation Not annoying ………… Very annoying

Clearness Very easy ………… Very hard

Stress/Intonation Not annoying ………… Very annoying

7.3 TESTING THE ARABIC VOICE To test the clearness, naturalness and overall quality of the Arabic TTS Synthesizer

System developed in this dissertation, a test for the Arabic voice was designed. In this

chapter, test parameters plus design of the test and results are discussed.

7.3.1 Test group The only concern when choosing the test group is that they should be non-speaking of the

Arabic language. In order to decide what a good command is, it was decided that the

participants should have the Arabic language as their second language. The group

consists of 27 people. The majority of the participants are students at International

Islamic University Malaysia, at the Department of Arabic Linguistics. The level of

fluency is varying among the participant, some of them are somehow fluent and the some

of them are not very fluent.

7.3.2 Method The main goal of this evaluation test is to determine how much of the spoken output one

can understand is. The test is divided into three parts. The first part is to evaluate the

system with respect to naturalness, speed, sound quality, pronunciation, clearness and

stress/intonation. The second part is to assess the usage of the system. The last part is to

find the level of errors in the system.

The participant is asked a few questions about these aspects and is asked to mark how

well the voice performs. These simple exercises will asses the overall assessment of this

TTS Synthesizer System.

7.4 TEST AND EVALUATION RESULTS Now that the test is done, a summary of the results is presented in this section. The results

are presented in diagrams and tables with percentage values.

7.4.1 Naturalness Regarding the question whether the voice is nice to listen to or not, 33.3 % (9

respondents out of 27) considered the voice natural, 40.7 % (11 respondents out of 27)

thought that the naturalness of the voice was acceptable and 25.9 % (7 respondents out of

27) considered the voice unnatural. Table 7.2 below shows the outcomes of the

questionnaire in detail.

Table 7.21 Naturalness/ Clearness

Very Natural

Natural OK Unnatural Very unnatural

Total

No. of Respondents

0 9 11 7 0 27

% of Respondents

00.0 33.3 40.7 25.9 00.0 100 %

The results are shown in figure 7.1 below.

Clearness 1

How much the voice is clear?

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

Very

much

Much Neither

much nor

little

Little Very little

Scales

% o

f liste

ners

Series1

Figure 7.19 Naturalness of the voice

7.4.2 Speed The speed of a system is a major concern, if the system speaks too fast or too slow this

may have a negative effect on the concentration of the subjects. They might give up

listening if it is too fast and the speech would not sound as natural as possible if the

speech is too slow. 14.8 % (4 respondents out of 27) of the listeners considered the

system speaks too fast. 18.5 % (5 respondents out of 27) of the listeners thought that the

system speaks adequately fast after listening to the sound. In other words, they considered

the voice to have normal speech speed. 7.4 % (2 respondents out of 27) thought it is slow

and another 22.2 % (6 respondents out of 27) thought it is too slow. Table 7.3 below

shows the outcomes of the questionnaire in detail.

Table 7.22 Sound Speed

Too fast Fast Normal Slow Too slow Total No. of Respondents

4 5 10 2 6 27

% of Respondents

14.8 18.5 37.0 7.4 22.2 100 %

Figure 7.2 below shows the results for listening to the sound.

Speed

Does the system speak adequate fast (Normal)?

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

Too fast Fast Normal Slow Too slow

Scales

% o

f liste

ners

Series1

Figure 7.20 The speed of the speech

7.4.3 Sound quality The question for this part is “Do you consider the system to be of good sound quality?”

After listening to the sound, 11.1 % (3 respondents out of 27) considered the voice has a

very good sound quality. 44.4 % (12 respondents out of 27) considered the voice has a

good sound quality. 33.3 % (9 respondents out of 27) thought the sound quality of the

voice is neither bad nor good and the remaining 11.1 % (3 respondents out of 27)

considered that the sound quality of the system bad. Table 7.4 below shows the outcomes

of the questionnaire in detail.

Table 7.23 Sound quality

Very Good

Good Neither Good nor Bad

Bad Very Bad Total

No. of Respondents

3 12 9 3 0 27

% of Respondents

11.1 44.4 33.3 11.1 00.0 100 %

The results are shown in figure 7.3 below.

Sound Quality Do you consider the system has a good sound quality?

0.05.0

10.015.020.025.030.035.040.045.050.0

Very

good

Good Neither

Good nor

Bad

Bad Very bad

Scales

% o

f li

ste

ne

rs

Series1

Figure 7.21 The sound quality of the voice.

7.4.4 Pronunciation The pronunciation part consists of three questions addressed to the participants to be able

to get an idea of how difficult the speech uttered by the system is to grab/get and to be

able to decide what sounds are the most difficult ones to catch and gradually process

these sounds in some way and improve them. The first question in this category is if the

listeners found it was very hard to grab/get some of the words. I hoped to get some

information about what words were considered hard to grab/get and what sounds these

words contained. 7.4 % (2 respondents out of 27) of the listeners thought it is very hard

to grab/get some of the words. 40.7 % (11 respondents out of 27) of the listeners thought

it was easy to grab/get, while 25.9 % (7 respondents out of 27) thought it is neither hard

nor easy, and 18.5 % (5 respondents out of 27) thought it is hard to grab/get some of the

words. Table 7.5 below shows the outcomes of the questionnaire in detail.

Table 7.24 Pronunciation Question 1

Very Hard Hard Neither Hard nor Easy

Easy Very Easy Total

No. of Respondents

2 5 7 11 2 27

% of Respondents

7.4 18.5 25.9 40.7 7.4 100 %

Figure 7.4 shows the results of the first and the second time of listening.

Pronunciation 1

Was it very hard to grasb/get some of the words?

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

Ver y har d Har d Nei ther har d nor

easy

Easy Ver y easy

Scales

% o

f liste

ners

Series1

Figure 7.22 The pronunciation’s effect on understanding.

The second question, in the pronunciation part, is intended to investigate if the

participants had to concentrate hard to be able to grab/get the speech uttered by the

system. This question can give information about how difficult the voice is to grab/get

and how much the participants had to concentrate to grab/get the voice. The results are

summarized according to the subjects’ own estimations. The results after listening to the

sound show that 11.1 % (3 respondents out of 27) of the participants do not have to

concentrate on the sound. While, 29.6 % (8 respondents out of 27) of the participants

consider the system requires normal concentration. 37.0 % (10 respondents out of 27) of

the participants have to concentrate a little. For 18.5 % (5 respondents out of 27) some

concentration is needed for specific sounds. The remaining 3.7 % (1 respondent out of

27) had to concentrate a lot. Table 7.6 below shows the outcomes of the questionnaire in

detail.

Table 7.25 Pronunciation Question 2 A lot of

concentration Some

concentration Normal

concentration Little

concentration No

concentration Total

No. of Respondents

1 5 8 10 3 27

% of Respondents

3.7 18.5 29.6 37.0 11.1 100 %

The results are shown in figure 7.5.

Pronunciation 2

Did you have to concentrate a lot to grab/get the speech told by the

voice?

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

A lot of

concentr ation

Some concentr ation

at some wor ds

Nor mal

concentr ation

Li ttle concentr ation No concentr ation

Scales

% o

f liste

ners

Series1

Figure 7.23 The concentration needed to hear the pronunciation

The third question, considering pronunciation, is how annoying the participants found the

voice. 44.4 % (8 respondents out of 27) of the participants found the voice slightly

annoying. 33.3 % (6 respondents out of 27) of the participants thought the voice was not

annoying and 22.2 % (4 respondents out of 27) found it annoying. Table 7.7 below

shows the outcomes of the questionnaire in detail.

Table 7.26 Pronunciation Question 3

Not annoying

Little annoying

Annoying Very annoying

Too much annoying

Total

No. of Respondents

9 9 7 1 1 27

% of Respondents

33.3 33.3 25.9 3.7 3.7 100 %

The results of the first and the second time of listening are shown in figure 7.6.

Pronunciation 3

How did you find the pronunciation?

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

Not annoyi ng Li t t l e annoyi ng Annoyi ng V er y annoyi ng T oo much annoyi ng

Scales

% o

f liste

ners

Series1

Figure 7.24 The annoying level of the pronunciation

7.4.5 Clearness Two questions were asked concerning the intelligibility of the system. The question how

much the participants understood the voice or how much of what the voice said the

participants understood, 55.6 % (10 respondents out of 27) of the participants understood

much (well). 22.2 % (4 respondents out of 27) did understand the voice very much (very

well), 11.1 % (2 respondents out of 27) neither much nor little and another 11.1 % (2

respondents out of 27) understood a little, i.e. not very well. As mentioned before these

are the subjects’ own estimations. Table 7.8 below shows the outcomes of the

questionnaire in detail.

Table 7.27 Clearness Question 1

Very much

Much Neither much

nor little

Little Very little Total

No. of Respondents

7 11 4 4 1 27

% of Respondents

25.9 40.7 14.8 14.8 3.7 100 %

The results are shown in figure 7.7.

Clearness 1

How much the voice is clear?

0.05.0

10.015.020.025.030.035.040.045.0

Very

much

Much Neither

much nor

little

Little Very little

Scales

% o

f li

ste

ne

rs

Series1

Figure 7.25 Understanding the voice.

The second question of this part is “Was the voice easy to grab/get?” The reason for this

question was to establish if the difficulty in grabbing/getting is in the voice or in the

listeners’ lack of knowledge and lexicology. After listening to the sound 44.4 % (12

respondents out of 27) of the listeners found the voice easy to understand, while 25.9 %

(7 respondents out of 27) found it neither hard nor easy. 14.8 % (4 respondents out of 27)

considered it very easy to grab the sound. However, 11.1 % (3 respondents out of 27)

considered the sound as hard to grab/get; and only 3.7 % (1 respondent out of 27)

considered that the sound is very hard to grab/get. Table 7.9 below shows the outcomes

of the questionnaire in detail.

Table 7.28 Clearness Question 2

Very Hard Hard Neither Hard nor

Easy

Easy Very Easy Total

No. of Respondents

1 3 7 12 4 27

% of Respondents

3.7 11.1 25.9 44.4 14.8 100 %

Figure 7.8 shows the results of the first and the second time of listening.

Clearness 2

Was the voice easy to grab/get?

0.0

10.0

20.0

30.0

40.0

50.0

Very

hard

Hard Neither

hard nor

easy

Easy Very

easy

Scales

% o

f liste

ners

Series1

Figure 7.26 The level of difficulty in understanding the voice.

7.4.6 Stress/Intonation However, no process concerning the stress and the intonation has been undertaken on the

system, it was decided to survey the participants concerning the voice aspects. The first

question in the stress and intonation part is what the participants think of the intonation of

the voice. The results after listening to the sound are as follows: 40.7 % (11 respondents

out of 27) considered the intonation as good. 25.9 % (7 respondents out of 27) thought

the sound as neither good nor bad. 18.5 % (5 respondents out of 27) considered the sound

as very good; however, 7.4 % (2 respondents out of 27) thought the sound as bad, in

addition to 7.4 % (2 respondents out of 27) considered the sound as very bad. Table 7.10

below shows the outcomes of the questionnaire in detail.

Table 7.29 Stress/Intonation Question 1

Very Good

Good Neither Good

nor Bad

Bad Very Bad Total

No. of Respondents

5 11 7 2 2 27

% of Respondents

18.5 40.7 25.9 7.4 7.4 100 %

The results of listening to the sound are shown in percentages in figure 7.9.

Stress/Intonation 1

What do you think of the intonation of the voice?

0.0

5.010.0

15.0

20.025.0

30.0

35.040.0

45.0

Very

good

Good Neither

Good nor

Bad

Bad Very bad

Scales

% o

f li

ste

ne

rs

Series1

Figure 7.27 The intonation of the system.

“How do you find the stress?” is the second question in this part of the evaluating

questionnaire. 40.7 % (11 respondents out of 27) found the voice as little annoying. 29.6

% (8 respondents out of 27) found the stress as not annoying at all. Though, 18.5 % (5

respondents out of 27) found the stress as annoying; but, 7.4 % (2 respondents out of 27)

found the stress very annoying and only 3.7 % (1 respondent out of 27) found the stress

as too much annoying. Table 7.11 below shows the outcomes of the questionnaire detail.

Table 7.30 Stress/Intonation Question 2

Not annoying

Little annoying

Annoying Very annoying

Too much annoying

Total

No. of Respondents

8 11 5 2 1 27

% of Respondents

29.6 40.7 18.5 7.4 3.7 100 %

The results are listed in figure 7.10 in percentage of the number of subjects.

Stress/Intonation 2

How did you find the stress?

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

Not

annoying

Little

annoying

Annoying Very

annoying

Too much

annoying

Scales

% o

f liste

ners

Series1

Figure 7.28 The stress of the system

This is good considering that the stress and the intonation part of the system were not

processed at all. One can maybe say that the voice has really good stress and intonation.

7.4.7 Error On the question whether the participants think that the voice makes many pronunciation

mistakes, 29.6 % (8 respondents out of 27) considered that the system makes neither

many nor few mistakes. 25.9 % (7 respondents out of 27) of the listeners believed that

the mistakes were few and 25.9 % (7 respondents out of 27) believed that the mistakes

were very few. However, 11.1 % (3 respondents out of 27) considered that there were

many mistakes, finally, only 7.4 % (2 respondents out of 27) thought that the mistakes

were very few. Table 7.12 below shows the outcomes of the questionnaire detail.

Table 7.31 System Error

Too many Many Neither many

nor few

Few Very few Total

No. of Respondents

2 3 8 7 7 27

% of Respondents

7.4 11.1 29.6 25.9 25.9 100 %

Figure 7.11 shows the results.

Errors

Does the system make many pronunciation

mistakes?

0.05.0

10.015.020.025.030.035.0

Too

many

Many Neither

many

nor few

Few Very few

Scales

% o

f li

ste

ne

rs

Series1

Figure 7.29 Pronunciation mistakes.

7.5 SUMMARY In this chapter, the evaluation test for intelligibility, naturalness, and quality testing of the

Arabic TTS synthesizer are presented. Testing the TTS helped in understanding how the

variation in input text can affect the intelligibility, naturalness, and quality of the

synthesized speech. The tests reveal that the synthesizer performs quite good and

satisfactory.

From the above results and analysis, it implies that, when it comes to the intelligibility of

the system, the Arabic TTS Synthesizer System is successful. The participants can hear

what is being said and recognize changes with the synthesized speech. However, there is

a plan for improvements which will be described in the next chapter. The majority of

both words and sentences were correctly recognized and perceived of the majority of the

listeners and the evaluation of the overall quality of the system is satisfying at this stage.

CHAPTER EIGHT

CONCLUSION AND FUTURE WORK

8.1 CONCLUSION

Text-To-Speech Synthesizer has been developed gradually over the last few decades and

it has been integrated into several new applications. For most applications, the

intelligibility and comprehensibility of TTS Synthesizer have reached the acceptable

level. Nevertheless, in prosodic, text preprocessing, and pronunciation fields there is still

much work and improvements to be done to achieve more natural sounding speech.

Natural speech has so many dynamic changes that perfect naturalness may be impossible

to achieve.

However, since the markets of TTS Synthesizer related applications are increasing

gradually, the attention for giving more efforts and funds into this research area is

increasing as well. Current TTS Synthesizer Systems are so complicated that one

researcher cannot handle the whole system. With good modularity it is likely to divide the

system into a number of individual modules whose developing process can be done alone

if the communication between the modules is made carefully.

There are three methods used in TTS Synthesis technology, which have been introduced

in Chapter 2. The most commonly used techniques in current systems are based on

Formant and Concatenative Synthesis. The Concatenative Synthesis method is becoming

more accepted; since the method is to reduce the problems with the discontinuity effects

in concatenation points are becoming more effective. The Concatenative method provides

more natural and individual sounding speech, but the quality with some consonants may

vary considerably and the controlling of pitch and duration may be in some cases

difficult, especially with longer units.

Naturally, some combinations and modifications of these basic methods have been used

with variable success. An interesting approach is to use a hybrid system where the

Formant and Concatenative methods have been applied in parallel to phonemes where

they are the most suitable. In general, combining the best parts of the basic methods is a

good idea, but in practice, controlling of synthesizer may become difficult.

The TTS Synthesizer System is based on concatenation method. The challenges with the

Arabic language when building TTS Systems are addressed. Examples of these problems

and challenges are the diacritization problem, the existing of many dialects in the Arabic

language, and the differences in gender. This mapping of the problems would be helpful

for others who wish to build a TTS Synthesizer System in Arabic and other languages

who have not been extensively studied and processed.

Apart from the aforementioned advantages, the developed TTS Synthesizer System is not

without limitations. To run this system, the computer should support Arabic version of

Microsoft Office. Besides, this system also contains limited numbers of segments in the

wave sound file. There are thousands of Arabic vocabularies with different suffixes and

prefixes. To have all the words in the database, the file would be too large with thousands

of segments. Therefore, this system has the frequently used segments in the database.

Each language has its own rules in which should be considered to have a better

performance of the TTS Synthesizer System. It is a major factor that reflects to the

quality of the output and performance.

This dissertation shows that the creation of TTS Synthesizer System covers a whole range

of processes and that extensive work has to be done in order to build a voice in Festival.

The availability of free and semi-free synthesis systems, such as the Festival Speech

Synthesis System, makes the building of a speech synthesis easier and the costs lower.

Lastly, this dissertation has fulfilled its purpose, through creating a fully working Arabic

TTS Synthesizer System. It can be said that the system provides satisfactory results after

the testing but extensive and continued work is required to develop the system further and

to get a high quality TTS Synthesizer System. The results of this system are very

promising, with high level of intelligibility. Although the questionnaire with the test and

the evaluation of the system is quite simple, guidelines and information of the

intelligibility, naturalness, speed and overall quality of the system can be identified. In

observing a large and diverse group of participants would have enabled a better

evaluation of the system. The small number of participants has affected the test results

and the evaluation of this system negatively. The small number of subjects does not allow

comparing the subjects and the results taking the level of fluency into consideration,

which would have been interesting to see if it would be any differences. Therefore a

better division of the group and a larger amount of people is recommended.

8.2 SUGGESTIONS AND FUTURE WORK

The generation of synthetic speech from text involves many stages of processing which

gradually build an acoustic description of a speech signal. The results can be judged by

whether it is:

� Effective in conveying information

� Believable as a human voice and

� Pleasant to listen to for long periods.

At the time of writing, it must be concluded that there is still work to be done in

improving each of these factors. Some of the possible improvements that can be made

are:

� Record more sounds in the sound database. More sounds can be recorded to have

better performance and more vocabularies. Users can learn more words without

much limitation.

� Build more user friendly interfaces, such as a command to select different voices,

for example, voice of a man and voice of a woman. As well as an interface, this

will allow users to click on the Arabic words rather than typing them – applicable

for users who do not have Arabic keyboard.

� Adding an animation character (Agent). An agent or mount utterance character

can be included to attract user to continue using this software. Humans are more

attracted to animated and attractive interfaces which can create interest and fun in

learning. The characters are able to speak the input text, along with the output

sound with mouth utterances and gestures.

� This system also provides opportunity to develop a new TTS Synthesizer with

different languages. The TTS Synthesizer itself is developed based on existing

online TTS Synthesizer Systems on the Internet. The possible languages that can

be developed and added to the current TTS Synthesizer System are for example,

France, Malay, etc.

� Future considerations to improve the quality of the system should be addressed.

An initial task is to check the speech or diphone segmentation of the problematic

sounds since they are considered hard to understand. A manual checking and

correcting the labels is required. One has to trace back and check the entry in the

diphone index and compare it to the label for the fabricated word.

� Another important issue is signal processing to obtain the required prosody.

Speaker specific intonation and speaker specific duration have to be considered

when building a new voice. The major components of the prosody that can be

recognized are pitch, amplitude and the duration of the concatenated speech.

REFERENCES AND RESOURCES AcuVoice Inc. (1998). Homepage. http://www.acuvoice.com.

Alan D. & Barbara H., (2003). System Analysis and Design. 2nd Edition, John Wiley and

Sons.

Alan D., Janet E., Gregory D., & Russell B. (2003). Human-Computer Interaction, 3rd

Edition. Prentice Hall.

Alan E. & Ryan M. (1999). Visual Basic 6.0: Environment, Programming, and

Applications. Que™ Education and Training. An Imprint of Macmillan Computer

Publishing.

Allen J., Hunnicutt S., & Klatt D. (1987). From Text to Speech: The MITalk System.

Cambridge University Press, Inc.

Amr Y. & Ossama E. (2004). An arabic TTS system based on the IBM trainable speech

synthesizer. Le traitement automatique de l’arabe, JEP-TALN.

Bell Laboratories TTS (1998). Homepage. http://www.bell-labs.com/project/tts/.

Binu k. Mathew (2004). The Perception Processor. Ph.D. Dissertation.

Breen A., Bowers E., Welsh W. (1996). An Investigation into the Generation of Mouth

Shapes for a Talking Head. Proceedings of ICSLP 96, vol. 4.

Browman, C. (1980). Rules for demi-syllable synthesis using LINGUA language

interpreter, Proceedings of International Conference on Acoustics, and signal

Processing, IEEE.

Chris, R. (1992) (Editor). Speech Processing,McGRAW-HILL, London.

Clement H., (1987). A History of Arabic Literature. Draf Publishers Limited. London.

Cornelius T. (2003). Intelligent Systems: Technology and Applications, Signal, Image,

and Speech Processing, vol. 3, pp. 1 – 48. CRC Press, New York.

Craig A. (2000). Digital Audio with JAVA, Prentice Hall PTR, USA.

David A., Guy F., (2003). Information System Development - Methodologies, techniques

and tools. 3rd Edition, Mc-Graw Hill.

Deitel H., Deitel P. & Nieto T. (1999). Visual Basic 6.0 How to Program. Prentice Hall,

Upper Saddle River. New Jersey.

Dettweiler H., Hess W. (1985). Concatenation Rules for Demisyllable Speech Synthesis.

Proceedings of ICASSP 85 (2): pp. 752-755.

Don J. Connexionx - sharing knowledge and building communities.

http://cnx.rice.edu/content/m0088/latest/.

Donovan R. (1996). Trainable Speech Synthesis. PhD. Thesis. Cambridge University

Engineering Department, England. ftp://svr-

ftp.eng.cam.ac.uk/pub/reports/donovan_thesis.ps.Z.

Dutoit T. & Leich H. (1993). MBR-PSOLA: Text-to-speech synthesis based on an

MBEre-synthesis of the segments database. Speech Communication, vol. 13, pp. 432 –

440.

Edward Y. (2003). Techniques of Teaching Pronunciation in ESL, Bilingual & Foreign

Language Classes. Lincom Europa.

Edwards A. (1991). Speech Synthesis: Technology for Disabled People. London: Paul

Chapman Ltd.

F. J. Owens. (1993). Signal Processing of Speech. The Macmillan Press Ltd.

Flanagan J. & Rabiner L. (1973) (Editors) Speech Synthesis. Dowden, Hutchinson &

Ross, Inc., Pennsylvania.

Flanagan J. (1972). Speech Analysis, Synthesis, and Perception. Springer-Verlag, Berlin-

Heidelberg-New York.

HADIFIX (1997). Speech Synthesis Homepage. University of Bonn. http://www.ikp.uni-

bonn.de/~tpo/Hadifix.en.html.

Hamza W. (2000). Arabic Speech Synthesis Using Large Speech Database. PhD. Thesis.

Cairo University, Electronics and Communications Engineering Department.

Haywood J. & Ahmad H. (2003). A new Arabic grammar. Lund Humphries, London.

Holmes J. (1988). Speech Synthesis and Recognition. Van Nostrand Reinhold (UK) Co.

Ltd. England.

http://svr-www.eng.cam.ac.uk/~ajr/SpeechAnalysis/node5.html

http://www.acoustics.hut.fi/.../lemmetty_mst/chap2.html

http://www.csun.edu/cod/conf/1991/proceedings/voice.htm.

http://www.enhance.phon.A.ac.uk/public/examples/copysyn/index.html.

Huang X., Acero A., Hon H., Ju Y., Liu J., Mederith S., & Plumpe M.(1997). Recent

Improvements on Microsoft’s Trainable Text-to-Speech System - Whistler. Proceedings

of ICASSP 97 (2): pp. 959-934.

Husni A., Moustafa E., & Mansour A. (2003). Techniques for high quality Arabic speech

synthesis. College of Computer Science and Engineering, King Fahd University of

Petroleum and Minerals.

Husni Al-Muhtaseb, Moustafa Elshafei, and Mansour Al-Gamdi. (2003). Techniques for

High Quality Arabic Speech Synthesis. College of Computer Science and Engineering,

King Fahd University of Petroleum and Minerals.

INC Speech Works International. http://www.tmaa.com/tts/.

IPA. The international phonetic association. http://www.arts.gla.ac.uk/IPA

Ishizaka K. & Flanagan J. (1972). Synthesis of voiced sounds from a two-mass model of

the vocal cords. Bell System Technology Journal, vol. 51, No. 6, pp. 133-1268.

Jeffery L., Lonnie D., Kevin C., (2004). System Analysis and Design Methods. 6th

Edition, Mc-Graw Hill.

Klatt D. (1987). Review of Text-to-Speech Conversion for English. Journal of the

Acoustical Society of America, JASA. vol. 82, No. 3, pp.737-793.

Klatt D. (1987). Review of text-to-speech conversion for English. Journal of the Acoustic

Society of America. Vol. 82, No. 3, pp. 737 - 793.

Kleijn W. and Paliwal K. (1995) (Editors). Speech Coding and Synthesis. Amsterdam:

Elsevier.

Koay C. (2000). Learning Visual Basic 6.0: Step-by-Step. Venton Publishing. Kuala

Lumpur. Malaysia

Laura Mayfield Tomokiyo, Alan W Black, and Kevin A Lenzo. (2003). Arabic In My

Hand: Small-Footprint Synthesis of Egyptian Arabic. Cepstral LLC, Pittsburgh, USA.

Lernout & Hauspies (L&H) (1998). Speech Technologies Homepage.

http://www.lhs.com/speechtech/.

Li D. & Douglas N. (2003). Speech Processing a dynamic and Optimization Oriented

Approach. Marcel Dekker Inc. New York.

Lowell M. (1999). SAMS Teach Yourself Visual Basic 6 in 10 minutes. SAMS. A division

of Macmillan Computer Publishing, USA.

Maria M. (2004). A Prototype of an Arabic Diphone Speech Synthesizer in Festival.

Master Thesis in Computational Linguistics. Uppsala University

MBROLA. The MBROLA project towards a freely available multilingual speech

synthesizer. http://tcts.fpms.ac.be/synthesis/mbrola.html.

Michael E. & William N. (1999). Programming with Microsoft Visual Basic 6.0 An

Object-Oriented Approach. Course Technology, Inc. ITP.

Michel K., (1970). Arabic Phonology: Implications for Phonological Theory and

Historical Semitic. PhD thesis. Massachusetts Institute of Technology.

Mitchell T., (1993). Pronouncing Arabic. Clarendon Oxford.

Moulines E. & Charpentier F. (1990). Pitch-synchronous waveform processing

techniques for text-to-speech synthesis using diphones. Speech Communication, vol. 9,

No. 5-6, pp. 453-467.

Newell A., Barnett J., Forgie J., Green C., Klatt D., Licklieder J., Munson J., Reddy R., &

Woods W. (1973). Speech Understanding System. Final Report of a study Group. North

Holland, Amsterdam.

Nicholas A. and Putros S. (1986). The Arabic Alphabet: How to Read and Write. London.

Ntsourak's Home Page. http://www.telecom.tuc.gr/.../tutorial_acoustic.htm.

Olive, J.P.. (1996). Concatenative Syllables. Progress in Speech Synthesis. Springer, New

York. pp. 261 – 262.

Panasonic CyberTalk (1998). Homepage.

http://www.research.panasonic.com/pti/stl_web_demo/demo.html.

Peter R. (1992). Computing Linguistic and Phonetics: Introductory Readings, Academic

Press, London.

Pinker S. (1993). The Language Instinct: How the Mind Creates Language. New York:

W. Morrow and Company.

Quartieri T. & McAulay R. (1992). Shape Invariant time-scale and pitch modification of

speech. IEEE Trans. Signal Process, vol. 40, No. 3, pp. 497-510.

Raja T., (1979). The Structure of Arabic: From Sound to Sentence. Beirut.

Robert D. (1999). Computer Speech Technology. Boston, London: Artech House.

Ronald C., & Antonio Z. (1997). Survey of the state of the art in human language

technology. Giardini Editori e Stampatori in Pisa.

Sagisaka Y., Campbell N., & Higuchi N. (1997) (Editors). Computing Prosody -

Computational Models for Processing Spontaneous Speech, Berlin: Springer.

Salman H. & Jacob Y., (1980). Arabic Phonology and Script. International Book Center.

Michigan.

Sami L., (1999). Review of Speech Synthesis Technology. Master thesis,

Santen J., Sproat R., Olive J., Hirschberg J. (editors), Progress in Speech Synthesis,

Springer-Verlag New York Inc., 1997.

Schroeter J. (1996). Articulatory Synthesis and Visual Speech. Progress in Speech

Synthesis. Springer, New York. pp. 179 – 184.

Schroeter M. (1993). A Brief History of Synthetic Speech. Speech Communication. vol.

13, pp. 231-237.

Shafi S., (1978). A Course in Spoken Arabic. Oxford University Press. Bombay.

Stuart R. & Peter N. (2002). Artificial Intelligence: A Modern Approach, 2nd Edition.

Pearson Education Inc. Upper Saddle River, New Jersey.

Stylianou Y. (2001). Applying the harmonic plus noise model in concatenative speech

synthesis. IEEE Transaction on Speech Audio Process, vol. 9, No. 1, pp. 21-29.

The DISC Best Practice Guide. A survey of existing methods and tools for developing and

evaluation of speech synthesis and of commercial speech synthesis systems.

http://www.disc2.dk/tools/SGsurvey.html.

Thierry D. (1996). An Introduction to Text-to-Speech Synthesis. Kluwer Academic

Publishers, Dordrecht.

Todd K. & Stephen C. (2000). Microsoft Visual Basic. South-western Educational

Publishing.

Wael H. & Mohsen R. (2000). Concatenative Arabic speech synthesis using large

database, In Proceedings of ICSLP2000, vol. 2, pp. 182-185, Beijing, China.

Weinschenk, S., Barker, D. (2000). Designing Effective Speech Interfaces, 1st Edition,

Wiley.

Witten I. (1982). Principles of Computer Speech. Academic Press Inc.

Wright W., (1974). A Grammar of Arabic Language. 3rd Edition. Beirut.

Yasser H., Shady Q., Salah H., & Mohsen R. (2000). ARABTALK® An Implementation

for Arabic Text To Speech System. www.nemlar.org/ARAB-TALK-RDI.doc.

APPENDIX A: QUESTIONNAIRE The aim of this questionnaire is to help me testing and evaluating the Arabic TTS System. I would appreciate if you answer freely and as honest as possible. I would like to thank you for your help and participation.

PART ONE: ASSESSING THE QUALITY OF THE SYSTEM � Clearness/Naturalness

� Is the voice nice listening to?

[ ] Very natural [ ] Natural [ ] OK [ ] Unnatural [ ] Very unnatural

� Speed

� Does the system speak adequate fast?

[ ] Too much fast [ ] Too fast [ ] Fast/normal [ ] Too slow [ ] Too much slow

� Sound Quality

� Do you consider the system has a good sound quality?

[ ] Very good [ ] Good [ ] Neither good nor bad [ ] Bad [ ] Very bad

� Pronunciation

� Was it very hard to grab/get some of the words?

[ ] Very hard [ ] Hard [ ] Neither hard nor easy [ ] Easy [ ] Very easy

� Did you have to concentrate a lot to grab/get the speech told by the voice?

[ ] A lot of concentration [ ] Some concentration at some words [ ] Normal concentration [ ] Little concentration [ ] No concentration was needed

� How did you find the pronunciation?

[ ] Not annoying [ ] Little annoying [ ] Annoying [ ] Very annoying [ ] Too much annoying

� Clearness

� How much the voice is clear?

[ ] Very much [ ] Much [ ] Neither much nor little [ ] Little [ ] Very little

� Was the voice easy to grab/get?

[ ] Very hard [ ] Hard [ ] Neither hard nor easy [ ] Easy [ ] Very easy

� Stress/Intonation � What do you think of the intonation of the voice?

[ ] Very good [ ] Good [ ] Neither good or bad [ ] Bad [ ] Very bad

� How did you find the stress?

[ ] Not annoying [ ] Little annoying [ ] Annoying [ ] Very annoying [ ] Too much annoying

PART ONE: USAGE OF THE SYSTEM

� Occupation ________________________. � Does the system help you in your job/work?

[ ] Yes [ ] No

� In what way does the system help you? Please specify: _______________________________________________________________

PART THREE: FINDING ERRORS � Does the system make many pronunciation mistakes?

[ ] Too many [ ] Many [ ] Neither many nor few [ ] Few [ ] Too few

APPENDIX B: GLOSSARY Allophone: An allophone is a phonetic variant of a phoneme in a particular language.

Alveolar: A phone produced when the tongue touches the tooth ridge behind the teeth

(alveolus). See the diagram of a head for the location of the tooth ridge. The “t sound” in

English is an alveolar stop, produced by stopping and then releasing the air flow out of

the mouth by closing the tongue onto the tooth ridge.

Band-pass filter: Filter with a single transmission band or pass-band with relatively low

attenuation extending from a lower band-edge frequency greater than zero to a finite

upper band- edge frequency.

Bilabial: A phone produced by the closure or partial closure of both lips. See the

diagram of a head. The English sounds represented by the letters p in pit and b in bad are

bilabial stops, produced by stopping and then releasing the air flow out of the mouth by

closing the lips. Bilabial and labio-dental phones are together classed as labial.

Consonant: A consonant is a sound made by a partial or complete closure of the vocal

tract.

Dialect: Generally dialects of a language are more similar than different languages.

However, what is a dialect and what is a language is often a political rather than a

linguistic question. The division of Serbo-Croat, the common language of former

Yugoslavia, into two languages, Serbian and Croatian, shows this rather sharply. A

further example of very similar languages which might be called dialects of the same

language are Dutch (spoken in the Netherlands) and Flemish (spoken in north-western

Belgium). On the other hand, in China there are languages which are mutually un-

intelligible when spoken but are often called dialects of one Chinese language. It is

important to note that although some dialects have more social prestige in a country than

others; this says nothing about their linguistic qualities.

Diphthong: A diphthong is a phonetic sequence, consisting of a vowel and a glide that is

interpreted as a single vowel.

Fricative: If during the production of a phone, air is made to pass through a narrow

passage, a “friction” sound or fricative is produced (i.e. a more-or-less “hissing” sound).

English examples are the “f sound” in fee or the “sh sound” in she.

Glottis: The glottis is the space between the vocal folds.

Grapheme: A grapheme is a “spelling unit”. For example, in Spanish the combination ll

represents a different sound from a single l. Thus these are two graphemes. In English,

graphemes may be quite complex. For example -tion behaves more-or-less as a single

grapheme in words like function.

IPA: The International Phonetic Alphabet or IPA is a set of symbols which can be used

to represent the phones and phonemes of natural languages. A subset which can be used

to represent “Standard English” (roughly the dialect of middle-class people from the

south east of England) is given in a separate table.

Intonation: Intonation is the system of levels (rising and falling) and variations in pitch

sequences within speech.

Labio-dentals: A phone produced by the partial closure of the lower lip on the upper

teeth. See the diagram of a head. The English sounds represented by the letters f in fit and

v in van are labio-dental fricatives, produced by restricting the air flow out of the mouth

by touching the lower lip on the upper teeth. Bilabial and labio-dental phones are together

classed as labial.

Nasal: A nasal is a phone made by allowing air to flow out of the nose while possibly

stopping it in the mouth. Allowing air to flow out of the mouth is achieved by opening

the uvula (see the diagram of a head). English has three such phones: the nasal stops

which end the words rum, run and rung.

Onset: An onset is the part of the syllable that precedes the vowel of the syllable.

Palatal: A phone produced when the top of the tongue touches the hard palate. See the

diagram of a head for the location of the hard palate. The English sounds represented by

the letters sh in ship and s in measure are palatal fricatives, produced by partially

stopping the air flow out of the mouth by touching the top of the tongue on the hard

palate.

Phone: A phone is an unanalyzed sound of a language. It is the smallest identifiable unit

found in a stream of speech that is able to be transcribed with an IPA symbol.

Phoneme: A phoneme is the smallest contrastive unit in the sound system of a language.

Phonetics: Phonetics is the study of human speech sounds.

Pitch: Pitch is the rate of vibration of the vocal folds.

Stress: Stress is an increase in the activity of the vocal apparatus of a speaker.

Syllable: A syllable is a unit of sound composed of (1) a central peak of sonority (usually

a vowel), and (2) the consonants that cluster around this central peak.

Velar: A phone produced when the top of the tongue touches the soft palate or velum.

See the diagram of a head for the location of the soft palate. The English sounds

represented by the letters k in kit and g in got are velar stops, produced by stopping and

then releasing the airflow out of the mouth by touching the top of the tongue on the soft

palate.

Vowel: A vowel is a sound made when the impedance of the air through the vocal tract is

minimal and the vocal tract is completely open.

ARABIC TEXT-TO-SPEECH SYNTHESIZER - UM …repository.um.edu.my/142/1/Arabic TTS Synthesizer.pdf ·...

Documents

Transcript of ARABIC TEXT-TO-SPEECH SYNTHESIZER - UM …repository.um.edu.my/142/1/Arabic TTS Synthesizer.pdf ·...