Gujarati Text-to-Speech Presentation

59
01:49 01:49 Text-to-Speech System for Gujarati Project Presentation by Samyak Bhuta

description

Presentation regarding development of text-to-speech system for Gujarati. Input would be arbitrary Gujarati unicode text while output would equivalent speech sound.

Transcript of Gujarati Text-to-Speech Presentation

Page 1: Gujarati Text-to-Speech Presentation

09:39 09:39

Text-to-Speech System for GujaratiProject Presentation by Samyak Bhuta

Page 2: Gujarati Text-to-Speech Presentation

09:39 09:39

* PROJECT PROFILE *

Objective : Developing a Text-to-Speech

System for Gujarati

Page 3: Gujarati Text-to-Speech Presentation

09:39 09:39

* PROJECT PROFILE *

Under the guidance of

Prof. Ram Mohan Shri Jignesh Dholakia

Page 4: Gujarati Text-to-Speech Presentation

09:39 09:39

* PROJECT PROFILE *

At Resorce Centre for Indian Language Technology Solutions in Gujarati,

Faculty of Arts,The M. S. University of Baroda, BARODA.

Page 5: Gujarati Text-to-Speech Presentation

09:39 09:39

Next 25 minutes …

> Sound and Speech Sound> ABC of TTS Systems> Pilot Project> GTTS from scratch > Speech , Syllable and Partneme> Speech Sounds in detail> Core Engine> Language Dependent Components

Page 6: Gujarati Text-to-Speech Presentation

09:39 09:39

Sound : a flow of air

Source EarAir flows

Sound♫♪♫

Page 7: Gujarati Text-to-Speech Presentation

09:39 09:39

What makes different sounds ? The factors, responsible for perceptual

difference between one kind of sound from the another are

Amplitude (or volume) which tells how much power the air-flow holds within

Frequency (or pitch) which tells at what rate the air-flow is repeating itself

Page 8: Gujarati Text-to-Speech Presentation

09:39 09:39

The “Source” doesn’t matters

An air-flow of kind A will sound same

weather it has generated from source X

or source Y.

Page 9: Gujarati Text-to-Speech Presentation

09:39 09:39

Speech Sound

A kind of sound whose source is

Human Vocal Organism and who

finds its place in human speech. e.g. ક્� , સ્� , અ , ઈ A standard called International Phonetic

Alphabet (IPA) is used to depict such sounds

Page 10: Gujarati Text-to-Speech Presentation

09:39 09:39

IPA

IPA comprises almost all the speech sounds

of all languages in the world. Speech sounds are more formally known as

Phones IPA uses set of symbols to represent them

e.g. k , s , ə , i , ʤ IPA Chart …

Page 11: Gujarati Text-to-Speech Presentation

09:39 09:39

IPA Chart

Page 12: Gujarati Text-to-Speech Presentation

09:39 09:39

Synthesized Speech Sound

If we can produce the same pattern of

air-flow as it is produced by Human Vocal

Organism, representing a speech sound,

we can say that we have synthesized the

speech sound

Page 13: Gujarati Text-to-Speech Presentation

09:39 09:39

Speech Synthesizer

A mechanism which is capable of producing

synthesized speech sound in controlled

manner.

Page 14: Gujarati Text-to-Speech Presentation

09:39 09:39

Text-to-Speech Systems

A Speech Synthesizer which is smart enough

to produce equivalent Speech output of the

given text. The smartness accounts for making the

output as natural and intelligible as

possible.

Page 15: Gujarati Text-to-Speech Presentation

09:39 09:39

Text-to-Speech Systems

Usually, the TTS Systems are specific to

only one human language and takes input

text from only that language

Page 16: Gujarati Text-to-Speech Presentation

09:39 09:39

Basic structure of TTS Systems Function of any TTS System is, generally,

divided into three subtasks or phases. I. PreprocessingII. Phonetic-Prosodic TranslationIII. Speech Production The text input travels through these

phases, one by one, and eventually ends up in a speech .

Page 17: Gujarati Text-to-Speech Presentation

09:39 09:39

Preprocessing

“Dr. Ajay Shah will come to clinic on 23 ,Jan.” We read it …

“DOCTOR Ajay Shah will come to clinic on

TWENTY THIRD OF JANUARY”. The Preprocessing is meant to convert

the input text, from raw condition, to

pronounceable word text.

Page 18: Gujarati Text-to-Speech Presentation

09:39 09:39

Phonetic-Prosodic Translation This phase can be logically divided into two

different phases, • Phonetic Translation• Prosodic Translation Real TTS Systems may implement these

phases separately or as a unit but together

they provide data for the next phase of TTS.

Page 19: Gujarati Text-to-Speech Presentation

09:39 09:39

Phonetic Translation

In human languages, the script under use

doesn’t necessarily posses the one to one

mapping with speech. e.g. enough is pronounced as INAF / inəf IPA

છો�ક્રો� is pronounced as છો�ક્રો� / ʧokɾo IPA

Page 20: Gujarati Text-to-Speech Presentation

09:39 09:39

Phonetic Translation

A Phonetic Translation is used to provide

information, to the next phase, about exactly

what kind of speech sounds (phones) to be

produced for the given text. Phonetic Translation is also regarded as

Letter-to-Sound rules.

Page 21: Gujarati Text-to-Speech Presentation

09:39 09:39

Prosodic Translation

Mapping from letter-to-sound rules only

provides information about kind of speech

sound to be generated. To convey the

emotions and expressions residing in the

input text , Prosody needs to be applied. By Prosody we mean,

Amplitude + Pitch + Duration

Page 22: Gujarati Text-to-Speech Presentation

09:39 09:39

Speech Production

This phase is responsible for actual output

of the speech. The phase uses the phonetic and prosodic

information provided from the previous

phase. Various approaches exist for production of

speech.

Page 23: Gujarati Text-to-Speech Presentation

09:39 09:39

Different ways for Speech Production Three widely used approaches for speech production are • Articulatory Synthesis• Source-Filter Synthesis• Concatenative Synthesis

Speech production part of the TTS System is generally regarded as speech engine.

Page 24: Gujarati Text-to-Speech Presentation

09:39 09:39

Usecases

As we understood the structure of the TTS

Systems we realized that all three phases is

required in order to develop complete TTS

for Gujarati. At the top most abstraction level a use case

can be conceived for fulfilling the requirement

of having a TTS System for Gujarati.

Page 25: Gujarati Text-to-Speech Presentation

09:39 09:39

Usecases

The topmost use case, then, can be divided

into three further use cases each fulfilling

the requirement of three different phases

During the project we tried to realize each

use case one by one.

Page 26: Gujarati Text-to-Speech Presentation

09:39 09:39

Pilot Project

As we approached various requirements

and usecases to be realized, we found that

developing a Preprocessor is not so much

significant as developing the other two

phases. So we decided to develop later on. We decided to develop Phonetic-Prosodic

Translation phase first as if it can be easily

plugged into any already build ….speech

Page 27: Gujarati Text-to-Speech Presentation

09:39 09:39

Pilot Project

… speech engine who takes input in terms of

of IPA. FreeTTS, IBMJS, Dhvani, Narad were

studied We used Java Speech API along with IBMJS

as a speech engine to be used. The input to the engine was provided through

Java Speech Markup Language (JSML)

Page 28: Gujarati Text-to-Speech Presentation

09:39 09:39

Pilot Project : Objective

To develop a TTS System using already

available Speech Engine and supplying

transcripted (equivalent ) IPA text of target

Gujarati Unicode text to the engine.

Page 29: Gujarati Text-to-Speech Presentation

09:39 09:39

Pilot Project : S/W Requirement A Speech Engine Component which takes

IPA and speaks it out .

Page 30: Gujarati Text-to-Speech Presentation

09:39 09:39

Pilot Project : Design

No of usecases were conceived and its

implementation was provided as different

java classes.

Page 31: Gujarati Text-to-Speech Presentation

09:39 09:39

Pilot Project : Conclusion

We cannot continue developing a TTS

System with “outsider” speech engine as

the accent and other things need to be

Gujarati in nature.

Page 32: Gujarati Text-to-Speech Presentation

09:39 09:39

Starting of GTTS from Scratch From the result of the Pilot Project we

concluded that it is required to develop the

Speech Engine keeping Gujarati in mind. Concatenative approach was to be used

since it provides naturalness and has proven

track record.

Page 33: Gujarati Text-to-Speech Presentation

09:39 09:39

Concatenation

In Concatenative approach, already stored

segments of sounds are joined together to

produce the complete speech. Such segments are known as concatenation

unit. We used Partnemes as our concatenation

unit.

Page 34: Gujarati Text-to-Speech Presentation

09:39 09:39

Partnemes

Partneme is a very small segment of sound

whose typical length ranges from 8 ms to

100 ms. We get the partnemes by cutting

the recorded speech. But before understanding what is partneme

we have to understand human speech in

greater detail. Especially the relation

between speech and syllable.

Page 35: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ?

At time of normal breathing the period we

devote to breath-in is longer than that of

breath-out in a complete breath cycle. But when we start speaking, the breath-in

period becomes shorter paving the way for

a longer breath-out period. It is so because to speak out (anything) we

need some air-flow. We use the air-flow …

Page 36: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ? : Human Vocal Tract … powered by lungs, during breath-out. This air-flow is modified at various points

of Human Vocal Tract, ending up in a one

or another kind of speech sound (phones). Human Vocal Tract comprises of various

organs which, in one or another way,

changes the air-flow. Human Vocal Tract …

Page 37: Gujarati Text-to-Speech Presentation

09:39 09:39

Hu

man

V

oca

l T

ract

Page 38: Gujarati Text-to-Speech Presentation

09:39 09:39

Page 39: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ? : Syllable and Speech During the one complete breath cycle

we can speak out more than one phones. These all phones, spoken out in just one

breath cycle, constitutes a syllable . Sequence of such syllables in their

continuity forms a speech.

Page 40: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ? : Syllable Structure It is important to know the structure of

syllable in order to understand partnemes. Typically a syllable is made up of vowel as a

nucleus with consonants around it. Gujarati employees the following syllable

structure.

< C + C + C + V + V ̯ + C + C >

Page 41: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ? : Syllable Structure < C + C + C + V + V ̯ + C + C >

where C - consonants

V - vowel

V ̯ - unsyllablized vowel An utterance ( spoken word ) is made up

series of such syllables.

Page 42: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ? : Syllable Structure રો�મ - ɾam is made up of single syllable. here the structure becomes < ɾC + aV + mC > . પત્ર - pətɾ is also made up of single syllable. here the structure becomes < pC + əV + tC + ɾC > લશ્ક્રો - ləʃkəɾ is made up of two syllables. here the structure becomes < lC + əV + ʃC > < kC + əV + ɾC >

Page 43: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ? : Consonants and Vowels Consonants and vowels are two different

kind of speech sounds with different

acoustic parameters. To know the exact difference between

consonants and vowels we have to

understand how the single vocal tract is

capable of producing so many different

sounds.

Page 44: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ? : Articulation Modification of the air-flow is achieved by

articulation of various speech organs of the

vocal tract. The exact nature of speech sound that will

come up during the breath-out is determined

by

1 Place of Articulation

2 Manner of Articulation

Page 45: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ? : Place of articulation Place of articulation refers to the exact point,

in human vocal tract, where articulation happened.

e.g. [p] - two lips

[k] - back of tongue with velum

[ɾ] - tip of tongue with alveolar

Page 46: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ? : Manner of articulation Manner of articulation refers to the degree

of constriction made, during the articulation.

e.g. [p] - stop or plosive

[ʧ] - affricate

[ɾ] - tapped

[ j ] - glide

[ o ] - vowel ( no constriction )

Page 47: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ? : Voicedness

If, during the traveling of the air-flow from the

glottis, vocal cords are vibrating (and thus

changing the air-flow) we get a voiced

sound.

e.g. [g] - voiced

[k] - unvoiced

Page 48: Gujarati Text-to-Speech Presentation

09:39 09:39

How we speak ? : Aspiration

Aspiration refers to the state of vocal cords,

during the final stage of process, when

speaking out phones. When we speak out

aspirated phones the vocal cords

approaches, itself to vibrating state, as

time goes ( irrespective of their voicednees ).

e.g. [kʰ ] - aspirated

[ k ] - unaspirated

Page 49: Gujarati Text-to-Speech Presentation

09:39 09:39

Segmentation and Partneme

Segmentation of partnemes is achieved by

separating the recorded syllable. Given is sound wave form for ગમન build with

partnemes. Red lines mark the separation.

Page 50: Gujarati Text-to-Speech Presentation

09:39 09:39

Partnemes

As shown syallable is logically divided into null sound to consonant transition core consonant consonant to vowel transition core vowel vowel to consonant transition core consonant consonant to null sound transition

Page 51: Gujarati Text-to-Speech Presentation

09:39 09:39

Partnemes

If we can provide the partnemes for each

vowel and consonant we can join them

accordingly to produce any complete syllable

and hence any utterance.

e.g.

ક્રોણ - kəɾə ɳ

0_k;k;k_ə;ə;ə_ɾ;ɾ;ɾ_ə;ə;ə_ɳ;ɳ;ɳ_0

Page 52: Gujarati Text-to-Speech Presentation

09:39 09:39

ભા�રોત - bʰaɾə t

0_bʰ;bʰ;bʰ_a;a;a_ɾ;ɾ;ɾ_ə;ə;ə_t;t;t_0

Page 53: Gujarati Text-to-Speech Presentation

09:39 09:39

Core Engine

The speech engine, we developed to concatenate such partneme sequence based on given IPA, uses pair of files. One, called Voice File , contains the audio data of all the partnemes. The other serves as a reference to the Voice File and is called Voice Info File . It contains the place and length of partnemes in the Voice File .

Page 54: Gujarati Text-to-Speech Presentation

09:39 09:39

Core Engine

The Core Engine realizes the usecase for

having a speech engine.

Page 55: Gujarati Text-to-Speech Presentation

09:39 09:39

Language Dependent Components Since Core Engine only understands IPA sequence we have to provide a component which translate the Gujarati text to IPA sequence . The Preprocessing capabilities need also be developed for a complete TTS System. Unlike Core Engine, both aforementioned components would be specific to particular language and …

Page 56: Gujarati Text-to-Speech Presentation

09:39 09:39

Language Dependent Components … therefore kept aside as language dependent

components. Preprocessor :

As preprocessing should be highly

customizable from the end user end we

have provided a text file which can be

edited to control the functionality of the

preprocessor.

Page 57: Gujarati Text-to-Speech Presentation

09:39 09:39

IPATranscriptor : This component currently

provides only phonetic translation of the given

Gujarati text as complete rules for prosodic

translation are not available.

Page 58: Gujarati Text-to-Speech Presentation

09:39 09:39

Thanks

Prof. Bhartiben Modi Mr. Ajay Sarvaiya Mr. Irshad Shaikh Mr. Mihir Trivedi

Page 59: Gujarati Text-to-Speech Presentation

09:39 09:39

Sloka

બુ� દ્ધિ� વડે� અર્થો��ન�� ગ્રહણ ક્રો", આત્મા� મનન� ઉચ્ચા�રોણન" ઇચ્છો� સ્�ર્થો� યો�જે� છો� . મન ક્�યો�ગ્નિ,ન� પ્રજ્વદ્ધિલત ક્રો� છો� , અન� ત� (ક્�યો�ગ્નિ, ) પ્ર�ણવ�યો� ન� પ્ર�રો� છો� . ત� પ્ર�રિરોત વ�યો� , મ0 ર્ધા�� ( શી"ર્ષ� ) સ્�ર્થો� અભિભાઘા�ત પ�મ"ન� , મ�ખન� પ્ર�પ્ત ક્રો"ન� , ત� ત� સ્થા�ન�મ�� ર્થો" પસ્�રો ર્થોત�� , સ્વરો, ક્�ળ , સ્થા�ન , બુ�હ્ય અન� આભ્યો� તરો પ્રયોત્નો�ન� અન� પ્રદા�નર્થો" પ�� ચા પ્રક્�રોન� વણ��ન� પ્ર�દા� ભા�� વ ક્રો� છો� .

- પ�ભિણન"યો દ્ધિશીક્ષા�, દાસ્મ� અધ્યો�યો, ક્�રિરોક્� ૬, ૯ .