MobAppDev (Fall 2014): Basics of Speech Synthesis

24
MobAppDev Basics of Speech Synthesis Vladimir Kulyukin www.vkedco.blogspot.com

description

 

Transcript of MobAppDev (Fall 2014): Basics of Speech Synthesis

Page 1: MobAppDev (Fall 2014): Basics of Speech Synthesis

MobAppDev

Basics of Speech Synthesis

Vladimir Kulyukin

www.vkedco.blogspot.com

Page 2: MobAppDev (Fall 2014): Basics of Speech Synthesis

Outline

● Background on Speech Synthesis● Text-to-Speech Synthesis (TTS)● TTS on Android● Practical Approaches to TTS: Phonetic Spelling &

Human Recording

Page 3: MobAppDev (Fall 2014): Basics of Speech Synthesis

Background on

Speech Synthesis

Page 4: MobAppDev (Fall 2014): Basics of Speech Synthesis

What is Speech Synthesis

● Speech synthesis refers to the artificial production of human speech

● Speech synthesizers take a symbolic linguistic representation and generate audio streams

● The input to speech synthesizers can be a script in some language or a phonetic transcription

Page 5: MobAppDev (Fall 2014): Basics of Speech Synthesis

Phonetic Transcription

● Phonetic transcription is a formal system that describes how words are pronounced in a specific language (the language's phonology)

● There are many formalisms used for phonetic transcriptions

● International Phonetic Alphabet (IPA) is one of the better known systems of phonetic notation

Page 6: MobAppDev (Fall 2014): Basics of Speech Synthesis

International Phonetic Alphabet

Page 7: MobAppDev (Fall 2014): Basics of Speech Synthesis

Phones, Allophones, & Phonemes

● Phone is a unit of speech sound● Allophone is a member of a set of multiple possible

spoken phones to pronounce a single phoneme● In English, /p/ is a phoneme with two allophones: [ph]

and [p]● [ph] is an aspirated allophone of /p/ (e.g., paper)● [p] is an unaspirated allophone of /p/ (e.g., speak)

Page 8: MobAppDev (Fall 2014): Basics of Speech Synthesis

Intonation, Tone, & Prosody

● Intonation is a function of pitch used to specify the speaker's emotion (joyful, calm, angry, etc.), utterance types (question vs. statement)

● Tone is another element of intonation used to distinguish individual words

● Prosody is the combination of rhythm, stress, and intonation

Page 9: MobAppDev (Fall 2014): Basics of Speech Synthesis

Text to Speech Synthesis (TTS)

Page 10: MobAppDev (Fall 2014): Basics of Speech Synthesis

TTS: Text To Speech

● The General Problem: Take a sequence of characters and generate a waveform

● Words are pronounced as a sequence of individual units called phones

● Phonetic alphabets describe how phones are pronounced● Phonological rules specify how phones combine into

speech

Page 11: MobAppDev (Fall 2014): Basics of Speech Synthesis

TTS Engine Anatomy

● A typical TTS engine consists of three components: text analyzer, language analyzer, waveform generator

● Text Analysis – parse text (after transliterating it if necessary) and identify words and utterances

● Linguistic Analysis – identify phrases and assign prosodies (accents, emphasis, duration, pauses, etc)

● Waveform Generation - generate a waveform from a fully specified linguistic description

Page 12: MobAppDev (Fall 2014): Basics of Speech Synthesis

TTS Approaches

● Full Automation – machine does everything● Mixed Initiative – human records a set of known

texts; machine learning is used to extract the rules● Human-Based Recording – human records

words/sentences/texts; machine plays them as needed

Page 13: MobAppDev (Fall 2014): Basics of Speech Synthesis

Android TTS

● Android TTS is an multi-lingual speech synthesis engine

● Android TTS can be used as a black box: text in, speech out

● Android TTS can be parameterized

Page 14: MobAppDev (Fall 2014): Basics of Speech Synthesis

Starting TTS

● It is best practice to check if TTS is available on the device

● This is done via Intent to check TTS data● If the check is successful, a instance of TTS can be

created● Activity that uses TTS implements OnInitListener

interface

Page 15: MobAppDev (Fall 2014): Basics of Speech Synthesis

Checking TTS Availability

In Activity.onCreate(): Intent ttsi = new Intent();

ttsi.setAction(TextToSpeech.Engine.ACTION_CHECK_TTS_DATA);

startActivityForResult(ttsi, CHECK_TTS_REQ);

Page 16: MobAppDev (Fall 2014): Basics of Speech Synthesis

Creating TTS Instance

In Activity.onActivityResult():switch ( result_code ) {

case TextToSpeech.Engine.CHECK_VOICE_DATA_PASS:

mTTS = new TextToSpeech(this, this);

break;

}

Page 17: MobAppDev (Fall 2014): Basics of Speech Synthesis

Handling Missing TTS Data

In Activity.onActivityResult():

Intent insti = new Intent();

insti.setAction(TextToSpeech.Engine.ACTION_INSTALL_TTS_DATA);

startActivity(insti);

Page 18: MobAppDev (Fall 2014): Basics of Speech Synthesis

Handling TTS Unavailability

In Activity.onActivityResult():

switch ( result_code ) {

case TextToSpeech.Engine.CHECK_VOICE_DATA_FAIL:

// Let the user know that TTS is not available – Toast, Log Message, Notification, etc.

}

Page 19: MobAppDev (Fall 2014): Basics of Speech Synthesis

What To Do When TTS Is Ready

Override Activity.onInit():

@Override

public void onInit(int status) {if ( status == TextToSpeech.SUCCESS ) {

// do something when you know that TTS is available

}

}

Page 20: MobAppDev (Fall 2014): Basics of Speech Synthesis

Overriding onPause() and onDestroy()

● When your Activity is paused (e.g., it loses focus), have TTS stop synthesizing

● When your Activity is destroyed, shut TTS down to notify Android that the resources can be released and given to other activities or applications

Page 21: MobAppDev (Fall 2014): Basics of Speech Synthesis

Overriding onPause() and onDestroy()

public void onPause() {super.onPause();

if ( mTTS != null ) mTTS.stop();}

Page 22: MobAppDev (Fall 2014): Basics of Speech Synthesis

Overriding onPause() and onDestroy()

public void onDestroy() {super.onPause();

if ( mTTS != null ) mTTS.shutdown();}

Page 23: MobAppDev (Fall 2014): Basics of Speech Synthesis

Overcoming TTS Limitations

● Every TTS engine mispronounces some words● There are two ways of overcoming this limitation:

Phonetic spelling: spell mispronounced words the way they sound, generate waveforms, and associate words with waveforms

Human recording: have a human record mispronounced words and use the files

Page 24: MobAppDev (Fall 2014): Basics of Speech Synthesis

References & Reading Suggestions

● http://developer.android.com/reference/android/speech/tts/TextToSpeech.html