Hidden Markov Model-based speech...

Hidden Markov Model-based speech synthesis

Junichi Yamagishi, Korin Richmond, Simon King and many othersCentre for Speech Technology ResearchUniversity of Edinburgh, UK

www.cstr.ed.ac.uk

• I did not invent HMM-based speech synthesis!

• Core idea: Tokuda (Nagoya Institute of Technology, Japan)

• Developments: many other people

• Speaker adaptation: Junichi Yamagishi (Edinburgh) and colleagues

Background

Speech synthesis mini-tutorial

• Text to speech

• input: text

• output: a waveform that can be listened to

• Two main components

• front end: analyses text and converts to linguistic specification

• waveform generation: converts linguistic specification to speech

Speech synthesis mini-tutorial

• Text to speech

• input: text

• output: a waveform that can be listened to

• Two main components

• front end: analyses text and converts to linguistic specification

• waveform generation: converts linguistic specification to speech

From words to linguistic specification

"the cat sat"

DET NN VB

"the cat sat"

((the cat) sat)

DET NN VB

"the cat sat"

sil dh ax k ae t s ae t sil

((the cat) sat)

DET NN VB

"the cat sat"

((the cat) sat)

DET NN VB

phrase finalphrase initialpitch accent

"the cat sat"

sil^dh-ax+k=ae, "phrase initial", "unstressed syllable", ...

((the cat) sat)

DET NN VB

phrase finalphrase initialpitch accent

"the cat sat"

Full context models used in synthesis

aa^b-l+ax=s@1_3/A:1_1_3/B:0-0-3@2-1&3-3#2-2$2-3!1- .....

phonetic

aa^b-l+ax=s@1_3/A:1_1_3/B:0-0-3@2-1&3-3#2-2$2-3!1- .....

phonetic

aa^b-l+ax=s@1_3/A:1_1_3/B:0-0-3@2-1&3-3#2-2$2-3!1- .....

prosodic

Example linguistic specification

pau^pau-pau+ao=th@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$.....pau^pau-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....pauâo-th+er=ah@2_1/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....ao^th-er+ah=v@1_1/A:1_1_2/B:0-0-1@2-1&2-6#1-4$.....thêr-ah+v=dh@1_2/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....erâh-v+dh=ax@2_1/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....ah^v-dh+ax=d@1_2/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....v^dh-ax+d=ey@2_1/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....

“Author of the ...”

From linguistic specification to speech

• Two possible methods

• Concatenate small pieces of pre-recorded speech

• Generate speech from a model

From linguistic specification to speech

• Two possible methods

• Concatenate small pieces of pre-recorded speech

• Generate speech from a model

HMM mini-tutorial

• HMMs are models of sequences

• speech signals

• gene sequences

• etc

• a HMM consists of

• sequence model: a weighted finite state network of states and transitions

• observation model: multivariate Gaussian distribution in each state

• can generate from the model

• can also use for pattern recognition (e.g., automatic speech recognition)

HMMs are generative models

HMM-based speech synthesis mini-tutorial

• HMMs are used to generate sequences of speech (in a parameterised form)

• From the parameterised form, we can generate a waveform

• The parameterised form contains sufficient information to generate speech:

• spectral envelope

• fundamental frequency (F0) - sometimes called ‘pitch’

• aperiodic (noise-like) components (e.g. for sounds like ‘sh’ and ‘f’)

Trajectory HMMs

• Using an HMM to generate speech parameters

• because of the Markov assumption, the most likely output is the sequence of the means of the Gaussians in the states visited

• this is piecewise constant, and ignores important dynamic properties of speech

• Trajectory HMM algorithm (Tokuda and colleagues)

• solves this problem, by correctly using statistics of the dynamic properties during the generation process

Generation

• Generate the most likely observation sequence from the HMM

• but take the statistics of not only the static coefficients, but also the delta and delta-delta too

• Maximum Likelihood Parameter Generation Algorithm

Trajectory HMMs

eech p

Trajectory HMMs

eech p

Trajectory HMMs

eech p

Trajectory HMMs

eech p

Trajectory HMMs

eech p

Trajectory HMMs

eech p

Trajectory HMMs

eech p

Trajectory HMMs

Constructing the HMM

• Linguistic specification (from the front end) is a sequence of phonemes, annotated with contextual information

• There is one 5-state HMM for each phoneme, in every required context

• To synthesise a given sentence,

• use front end to predict the linguistic specification

• concatenate the corresponding HMMs

• generate from the HMM

Constructing the HMM

• Linguistic specification (from the front end) is a sequence of phonemes, annotated with contextual information

• There is one 5-state HMM for each phoneme, in every required context

• To synthesise a given sentence,

• use front end to predict the linguistic specification

• concatenate the corresponding HMMs

• generate from the HMM

Sparsi

Example linguistic specification

pau^pau-pau+ao=th@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$.....pau^pau-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....pauâo-th+er=ah@2_1/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....ao^th-er+ah=v@1_1/A:1_1_2/B:0-0-1@2-1&2-6#1-4$.....thêr-ah+v=dh@1_2/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....erâh-v+dh=ax@2_1/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....ah^v-dh+ax=d@1_2/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....v^dh-ax+d=ey@2_1/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....

“Author of the ...”

HMM-based speech synthesis

• Differences from automatic speech recognition include

• Synthesis uses a much richer model set, with a lot more context

• For speech recognition: triphone models

• For speech synthesis: “full context” models

• “Full context” = both phonetic and prosodic factors

• Observation vector for HMMs contains the necessary parameters to generate speech, such as spectral envelope + F0 + multi-band noise amplitudes

Sparsity

• In practically all speech or language applications, sparsity is a problem

• Distribution of classes is usually long-tailed (Zipf-like)

• We also ‘create’ even more sparsity by using context-dependent models

• thus, most models have no training data at all

• Common solution is to merge classes or contexts

• i.e., use the same model for several classes or contexts

• for HMMs, we call this ‘parameter tying’

Decision-tree-based clustering

Clustering

Context Dependent HMMs

Yes No

Voiced?

Vowel?Description length for

State occupancy probability for node

Covariance matrix for node

Dimension

Model parameter estimation from ‘labelled’ data

• Actually, we only have word labels for the training data

• Convert these to full linguistic specification using the front end of our text-to-speech system (text processing, pronunciation, prosody)

• these labels will not exactly match the speech signal (we do a few tricks to try to make the match closer, but it’s never perfect)

• We still only know the model sequence, but no information about the state alignment

• So, we use EM (we could call this ‘semi-supervised’ learning)

Model adaptation

• Training the models needs 1000+ sentences of data from one speaker

• What if we have insufficient data for this target speaker?

• Adaptation:

• Train the model on lots of data from other speakers

• Adapt the trained model’s parameters using a small amount of target speaker data

• estimate linear transforms to maximise the likelihood (MLLR)

• also in combination with MAP

Training, adaptation, synthesis

speech labels

...Train

speech labels

...Train

speech labels

Average

voice model

...Train

speech labels

Average

voice model

Adaptbdl bdl

speech labels

Average

voice model

Adaptbdl bdl

speech labels

Average

voice model

Adaptbdl bdl

speech labels

Average

voice model

Transforms

Adaptbdl bdl

speech labels

Average

voice model

Recognise

Transforms

Adaptbdl bdl

speech labels

Average

voice model

Transforms

speech labels

Average

voice model

Transforms

Synthesise

sentence

labels

Synthetic

speech

speech labels

Average

voice model

Transforms

Synthesise

sentence

labels

Synthetic

speech

speech labels

Average

voice model

Transforms

Synthesise

sentence

labels

Synthetic

speech

speech labels

Average

voice model

Transforms

Synthesise

sentence

labels

Synthetic

speech

speech labels

Average

voice model

Transforms

Synthesise

sentence

labels

Synthetic

speech

speech labels

Average

voice model

Transforms

Evaluation

• Objective measures that compare synthetic speech with a natural example (e.g., spectral distortion) have their uses, but don’t necessarily correlate with human perception

• main problem: there is more than one ‘correct answer’ in speech synthesis

• a single natural example does not capture this

• So, we mainly rely on playing examples to listeners

• opinion scores for quality & naturalness, typically on 5 point scales

• objective measures of intelligibility (type-in tests)

A J S B O V M C L E G Q T F H D R I N0

198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n

Word error rate for voice B (All listeners)

System

Intelligibility (WER), English

A natural speechB Festival benchmarkC HTS 2005 benchmarkV HTS 2008 (aka HTS 2007’)

A J S K B P O V M C L E G Q T F H D R I N

245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n

Word error rate for voice A (All listeners)

System

198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n

System

245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n

System

No significant difference between

A, V and T

198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n

System

245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n

System

A, V and T

A, C, V and T

198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n

System

245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n

System

A, V and T

A, C, V and T

HTS is as intelligible as human speech

Recent extensions

Articulatory-controllable HMM-based speech synthesis

• can manipulate articulator positions explicitly

• ability to synthesise new phonemes, not seen in training data

• requires parallel articulatory+acoustic corpus, which we have in CSTR

default

Dirichlet process HMMs

a e i o u

Japanese vowel

tion [

aa ae ah ao aw ax ay eh el em en er ey ih iy ow oy uh uw

English vowel

tion [

a ai an ang ao e ei en eng er i ia ian iang iao ic ich ie in ing iong iu o ong ou u ua uai uan uang ui un uo v van ve vn

Mandarin final

tion [

• Fixed number of states may not be optimal

• Cross-validation, information criteria (AIC, BIC, or MDL) or variational Bayes can be used for determining the number of states

• Or use Dirichlet process (HDP-HMM or infinite HMM)

Summary

• HMM-based speech synthesis has many opportunities for using machine learning:

• learning the model from data

• parameters (alternatives to maximum likelihood such as minimum generation error)

• model complexity (context clustering, number of mixture components, number of states, ...)

• semi-supervised and unsupervised learning (labels for data are unreliable or missing)

• adapting the model, given limited new data

• generation algorithms

Hidden Markov Model-based speech...

Documents

Transcript of Hidden Markov Model-based speech...

Hidden Markov Models./awm/tutorials/hmm14.pdf · Hidden Markov Models ... 14)

Hidden Markov Models - uml.edugrinstei/91.510/HMM/Class Sildes - Hidden Markov... · Hidden Markov Models. ... particular hidden state. • Thus a hidden Markov model is a standard

Hidden Markov Models · Hidden Markov Models 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Nov. 7, 2018 ... Hidden Markov Model 28 A Hidden Markov Model (HMM)

Hidden Markov Models - AUusers-cs.au.dk/cstorm/courses/PRiB_f12/slides/hidden-markov-model… · Hidden Markov Models Markov Model Hidden Markov Model If the latent variables are

Bioinformatics Introduction to Hidden Markov Models - … · Bioinformatics Introduction to Hidden Markov Models Hidden Markov Models and Multiple Sequence Alignment Slides borrowed

Interacting Hidden Markov Models for Video Understandingdraper/papers/narayana_ijprai18.pdfKeywords: Hidden Markov models; video analysis; Segre variety. 1. Introduction Hidden Markov

Hidden Markov Models in Bioinformaticscsatol/mach_learn/bemutato/Mate_Korosi_HMMpr… · Outline ˜ Markov Chain ˜ HMM (Hidden Markov Model) ˜ Hidden Markov Models in Bioinformatics

Hidden Markov Models and Gaussian Mixture Models · Hidden Markov Models and Gaussian Mixture Models ... Hidden Markov Model ... ASR Lectures 4&5 Hidden Markov Models and Gaussian

Hidden Markov Model Nov 11, 2008 Sung-Bae Cho. Hidden Markov Model Inference of Hidden Markov Model Path Tracking of HMM Learning of Hidden Markov Model.

Sequence Classification - with emphasis on Hidden Markov ...cs.unc.edu/~lazebnik/fall09/sequence_classification.pdf · Hidden Markov Models ... with emphasis on Hidden Markov Models

Inference in Mixed Hidden Markov Models and Applications ... · 2. From hidden Markov models to mixed hidden Markov models Many authors suggested applying hidden Markov models to

Hidden Markov Models

Markov Chains and Hidden Markov Models - Rice University · are “hidden”; hence, we have a hidden Markov model, or HMM. ... In Markov chains and hidden Markov models, the probability

L13: hidden Markov models - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/sp/l13.pdf · L13: hidden Markov models • Discrete Markov processes • Hidden Markov models

9 Markov chains and Hidden Markov Models - Freie … · 9 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Algorithms: Viterbi, forward,

Hidden Markov Models - Home | Princeton Universityrvan/orf557/hmm080728.pdf · 1 Hidden Markov Models..... 1 1.1 Markov Processes ..... 1 1.2 Hidden Markov Models..... 4 ... able

Hidden Markov

Hidden Markov Model

L13: hidden Markov modelscourses.cs.tamu.edu/rgutier/csce630_f14/l13.pdf · L13: hidden Markov models • Discrete Markov processes • Hidden Markov models • Forward and Backward

EE365: Hidden Markov Models - Stanford Universityee266.stanford.edu/lectures/hmm.pdf · EE365: Hidden Markov Models Hidden Markov Models The Viterbi Algorithm 1. Hidden Markov Models