first_seminar

39
Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Paus Pause Modeling for Storytelling Style Speech Synthesis First Seminar by Parakrant Sarkar [Roll No: 12IT72P08] Under the Supervision of Dr. K.Sreenivasa Rao School of Information Technology Indian Institute of Technology Kharagpur October 7, 2015

Transcript of first_seminar

Page 1: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

Pause Modeling for Storytelling Style SpeechSynthesis

First Seminarby

Parakrant Sarkar[Roll No: 12IT72P08]

Under the Supervision ofDr. K.Sreenivasa Rao

School of Information TechnologyIndian Institute of Technology Kharagpur

October 7, 2015

Page 2: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

OUTLINE

Introduction

Objective

Motivation

Literature Review

Story Corpus

Story Synthesis Framework

Predicting the Position of Pause

Prediction of Pause Duration

Future Work

Acknowledgments

References

Page 3: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

INTRODUCTION

Page 4: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

INTRODUCTION

What is Storytelling ?

“Storytelling is the art of using language, vocalization and/orphysical movement and gesture to reveal the elements of astory to a specific, live audience.” – National StorytellingAssociation

Source: http://www.eldrbarry.net/roos/st defn.htm

Page 5: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

OBJECTIVE

Page 6: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

OBJECTIVE

Neutral TTS Story Text Incorporating the

Story Specific Information

I Development of Story Text-to-Speech (TTS) using theneutral TTS for Hindi.

I Modeling the pause for storytelling style speech.

Page 7: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

MOTIVATION

Page 8: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

MOTIVATION

I Pauses play a vital role in the synthesis of storytelling stylespeech.

I Variation in the duration of the pause enhances the qualityof the story.

I Pauses are used for separating phrases and emphasizingkeywords to introduce suspense and climax in the story.

Page 9: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

LITERATURE REVIEW

Page 10: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

LITERATURE REVIEW

Authors Year Model Database ContibutionPaul Tayloret al.

1998 HiddenMarkovModels

Spoken EnglishCorpus

Based on the sequence ofpart-of-speech (POS)-tag,phrase breaks are assigned

KyuchulYoon

2006 DecisionTree

KoreanNewswire Cor-pus

Korean ToBI labels are usedas features

SeungwonKim et al.

2006 MaximumEntropy

SynthFemale01Corpus

POS tags, a lexicon, lengths ofeojeols are used a features.

P. Zervas etal.

2003 BayesianInduction

Modern GreekCorpus

Similar set of features areused.

K Gosh etal.

2012 NeuralNetwork

Neutral BengaliCorpus

Morphological, positionaland structural features areused

A Parlikaret al. [2]

2012 Grammarbasedapproach

ARCTIC-A,Europarl,F2B, Obama andEmma corpus

Grammar is derived to learnthe phrase breaks.

A Vadapalliet al.

2013 CART IIIT-H Indic, IIIT-MCIT

Explored the use of word-terminal syllables to modelphrase breaks

Page 11: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

STORY SPEECHCORPUS

Page 12: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

SPEECH CORPUS

I 100 story texts are collected: Panchatantra andAkbar-Birbal.

I # sentences per story: 20-25I # total words: 24400I Duration of the speech corpus: 3 hours (approx.)

Page 13: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

STORY SYNTHESISFRAMEWORK

Page 14: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

STORY SYNTHESIS FRAMEWORK

I Story-specific emotion detection (SSED) moduleI Story-specific prosody generation (SSPG) moduleI Story-specific prosody incorporation (SSPI) module

Page 15: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

STORY SYNTHESIS INTEGRATED WITH SSPP MODULE

NEUTRAL TTS SSPI module

Storytelling style

speech

Story text

SSED module

Story-specific

emotion tagged

text

Prosody Rule-set

Neutral Speech

SSPP module

SSPG module

I Story-specific emotion detection (SSED) moduleI Story-specific prosody generation (SSPG) moduleI Story-specific prosody incorporation (SSPI) moduleI Story-specific pause prediction (SSPP) module

Page 16: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

PREDICTING THEPOSITION OF PAUSE

Page 17: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

METHODS FOLLOWED

I Rule-based

I Simple Rule-based model

I Grammar-based model

I Data-driven Model

Page 18: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

SIMPLE RULE-BASED MODEL

Rule: Words containing the terminal syllables with highconditional probability should have a pause after it.

P(pause

terminal syllable) =

N(pause, terminal syllable)N(terminal syllable)

(1)

∀N(terminal syllable) > 50

Page 19: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

Table: Terminal syllables with conditional probabilities

Terminal Syllable Conditional Probabilityjuur 0.875chii 0.75

chaat 0.77kho 0.69ho 0.60ta 0.56yo 0.53me 0.51trii 0.40kar 0.39

Page 20: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

RESULTS:SIMPLE RULE-BASED MODEL

Table: F1 Measure for Simple Rule-based model

Recall Precision F1 ScoreNP 0.62 0.68 0.65P 0.28 0.33 0.30

Page 21: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

GRAMMAR-BASED APPROACH

Rule: Chunks/Phrases with high conditional probabilityshould have a pause after it.

P(pausechunk

) =N(pause, chunk)

N(chunk)(2)

Page 22: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

Table: Production rules of grammar based on POS tags

Production Rules Conditional ProbabilityNP→ QC RP NN PSP 0.32NP→ QC NN PSP 0.18VGF→ VM VAUX 0.25VGNF→ VM VAUX 0.12JJP→ JJ 0.44CCP→ CC 0.14BLK→ VAUX 0.2RBP→ RB PSP 0.49

Page 23: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

RESULTS:GRAMMAR-BASED APPROACH

Table: F1 measure for Grammar-based approach

Recall Precision F1 ScoreNP 0.69 0.82 0.74P 0.67 0.49 0.57

Page 24: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

DATA-DRIVEN MODEL

Features used for training: CARTI Five gram window is followed.I Features at word-level.

1. Positional:I Position of the current word.I # Total words

2. Structural:I # Total phones in the word.I # Total syllables in the word.I # Total phones in the utterance.

3. Morphological:I Parts of Speech of the wordI Phonetic strength of current word.

4. EmotionI Emotion of the current word.

Page 25: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

RESULTS:DATA-DRIVEN MODEL

Table: F1 Measures for Data-driven Model

Recall Precision F-1 ScoreNon-pause 0.89 0.94 0.91

Pause 0.68 0.81 0.74

Page 26: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

ANALYSIS OF RESULTS FOR PREDICTING THE POSTION

OF PAUSE

1. In Rule-based model, Grammar-based approach givesbetter prediction of pauses as compared to the simplerule-based approach.

2. Limitations of Rule-based model:

I It is time consuming to derive the rules manually.I Applicability of derived rules are limited to a particular

corpus.I Rules may therefore not be general enough for unseen text.

3. These shortcomings are mitigated by Machine learningapproaches.

Page 27: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

Prediction of PauseDuration

Page 28: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

FEATURES USED FOR TRAINING:CART

1. Morphological:I Terminal syllable of the current, previous and following

two words.2. Structural

I Position of the vowel in the terminal syllableI Number of segments (i.e. consonants) before and after the

nucleus (i.e. vowel) in the terminal syllable.3. Positional

I Total number of phones in the terminal syllable of thecurrent, previous and following two words.

I Position of the current word from the beginning andending of the utterance.

Page 29: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

RESULTS:SINGLE CART MODEL

Table: Performance of Single CART model based on µ,σ and γx,y

x (in ms) y (in ms) µ (in ms) σ (in ms) γx,ySingle CART 241.60 247.21 107.12 155.87 0.67

Page 30: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

Is it possible to reduceaverage prediction µ

Page 31: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

PROPOSED PAUSE PREDICTION MODEL

Break/ Non-break

Story-speech

Corpus

Short / Medium / Long Pause

Short Pause

Duration Predictor

Medium Pause

Duration Predictor

Long Pause

Duration Predictor

Story text

First Stage

Second Stage

Third Stage

Figure: Three stage pause prediction model

Page 32: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

ANALYSIS OF PAUSES BASED ON DURATIONPAUSES ARE CLASSIFIED AS:

I Long pause (>250 ms)I Medium pause (150− 250 ms)I Short pause (<150 ms)

Table: Pause Pattern in the story-speech corpus

Pause Type Mean(ms) StdDev(ms) %in originallong pause 455.07 125.99 17

medium pause 210.12 32.33 25short pause 92.97 29.81 38

Page 33: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

RESULTS:SECOND STAGE

Table: F-1 measure for Short, Medium and Long pause prediction

Recall Precision F-1 Scorelong pause 0.73 0.62 0.67

medium pause 0.50 0.52 0.51short pause 0.53 0.63 0.58

Page 34: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

RESULTS:THIRD STAGE

Table: Performance of CART trees for long, medium and shortpauses using objective measures (µ,σ andγx,y)

x (in ms) y (in ms) µ (in ms) σ (in ms) γx,yCART long 347.96 368.71 76.13 55.05 0.65

CART medium 208.43 199.30 26 17.03 0.66CART short 87.33 91.01 34.05 22.19 0.77

Overall 144.44 147.08 32.38 22.04 0.57

Page 35: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

SUMMARY AND CONCLUSION

1. A story synthesis framework is developed using theneutral Hindi TTS system for synthesizing story stylespeech.

2. Modeling of pause pattern for Story TTS.

Page 36: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

FUTURE WORK

1. Subjective listening test need to be carried out.2. Non-linear classifiers like neural networks and support

vector machines can be used.3. Using unsupervised word-level features for modeling the

pauses4. Analysis and modeling of the pause pattern based on

discourse modes.5. Analysis and modeling for the prosodic parameters (i.e.

pitch, duration, intensity) for various modes of discoursein storytelling.

6. Prosody modeling of the TTS system built on the storyspeech corpus.

Page 37: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

Acknowledgments

The authors would like to thank the Department ofInformation Technology, Government of India, for fundingthe project, Development of Text-to-Speech synthesis for IndianLanguages Phase II, Ref. no. 11(7)/2011HCC(TDIL). The authorswould also like to thank all the participants for the listeningtests.

Page 38: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

DISSEMINATION OF RESEARCH

Conference:

1. P Sarkar, A. Haque, A. Kr.Dutta, G. Reddy M, Harikrishna DM,P. Dhara,R. Verma,Narendra NP, Sunil SB, J. Yadav, K.S. Rao,“Designing prosody rule-set for converting neutral TTS speechto storytelling style speech for Indian languages: Bengali, Hindiand Telugu,” in Seventh International Conference ContemporaryComputing (IC3), 2014 on , pp.473-477, 7-9 Aug. JIIT Noida, 2014

2. R. Verma, P Sarkar, K. S Rao, “Conversion of neutral speech tostorytelling style speech,” in 2015 Eighth International ConferenceAdvances in Pattern Recognition (ICAPR), on, pp.1-6, 4-7 Jan. ISIKolkata, 2015

3. P Sarkar, K. S Rao, “Data-driven pause prediction for speechsynthesis in storytelling style speech,” in 2015 Twenty FirstNational Conference on Communications (NCC) , pp.1-5, 27 Feb. - 1Mar, IIT Bombay, 2015

Page 39: first_seminar

Introduction Objective Motivation Literature Review Story Corpus Story Synthesis Framework Predicting the Position of Pause Prediction of Pause Duration Future Work Acknowledgments References

REFERENCES I

[1] A. Vadapalli, P. Bhaskararao, and K. Prahallad, “Significance of word-terminalsyllables for prediction of phrase breaks in Text-to-Speech systems for IndianLanguages,” in 8th ISCA Speech Synthesis Workshop. Barcelona, Spain: ISCA,August 31– September 2 2013, pp. 189 – 194.

[2] A. Parlikar and A. W. Black, “A grammar based approach to style specific phraseprediction,” in Interspeech, 2011, pp. 2149–2152.

[3] N. S. Krishna and H. A. Murthy, “A New Prosodic Phrasing Model for IndianLanguage Telugu,” in INTERSPEECH. ISCA, 2004.

[4] K. Ghosh and K. Sreenivasa Rao, “Data-Driven Phrase Break Prediction for BengaliText-to-Speech System,” in Contemporary Computing - 5th International Conference,IC3 2012, Noida, India, August 6-8, 2012. Proceedings, ser. Communications inComputer and Information Science. Springer Berlin Heidelberg, 2012, vol. 306,pp. 118 – 129.