GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do...
Transcript of GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do...
1
Olov Engwall, Speech synthesis, 2008
Speech synthesis
Olov Engwall, Speech synthesis, 2008
PresentationsWork in pairs in 6 minutes mini-interviews (3 minutes each).
Ask questions around the topics:• What is your previous experience of speech synthesis?• Why did you decide to take this course?• What do you expect to learn?Write down the answers of your partner. Present during the presentation roundSubmit the answers to me
Why?To let me know more about your background and expectations to be able to adapt the course content.To get to know each other.To ”start you up”…
Olov Engwall, Speech synthesis, 2008
The courseThis is what the course book will look like…Until then, refer to http://svr-www.eng.cam.ac.uk/~pat40/book.html
Course pages: www.speech.kth.se/courses/GSLT_SS
Lecture content (impossible to cover the entire book):1) History, Concatenative synthesis, Unit selection, HMM synthesis,
Text issues, Prosody 2) Vocal tract models, Formant synthesis, Evaluation3) Term paper presentations, assignment correction
To Do until next time:1) Assignment 1: Unit selection calculations2) Term paper topic selection
Olov Engwall, Speech synthesis, 2008
Definition & Main scope
The automatic generation of synthesized sound orvisual output from any phonetic string.
Olov Engwall, Speech synthesis, 2008
Synthesis approaches
By ConcatenationElementary speech units are stored in a database and then concatenated and processed to produce the speech signal
By RuleSpeech is produced by mathematical rules that describe the influence of phonemes on one another
Olov Engwall, Speech synthesis, 2008
History
2
Olov Engwall, Speech synthesis, 2008
van KempelenWolfgang von Kempelen’s bookMechanismus der menschlichen
Sprache nebst Beschreibung einersprechenden Maschine (1791).
The essential parts• pressure chamber = lungs,• a vibrating reed = vocal cords,• a leather tube = vocal tract.
The machine was• hand operated• could produce whole words andshort phrases.
Olov Engwall, Speech synthesis, 2008
Wheatstone’s version
Why is it of interest to us?
Charles Wheatstone’s version of von Kempelen's speaking machine
Parametric features!
Olov Engwall, Speech synthesis, 2008
First electronic synthesis
• Homer Dudley presented VODER (Voice Operating Demonstrator) at the World Fair in New York in 1939
• The device was played like a musical instrument, with voicing/noise source on a foot pedal and signal routed through ten bandpass filters.
Olov Engwall, Speech synthesis, 2008
First formant synthesizers1950’s PAT (Parametric Artificial Talker), Walter Lawrence 3 electronic formant resonators input signal (noise)6 functions to control 3 formant frequencies, voicing, amplitude, fundamental frequency, and noise amplitude.
1950’s OVE (Orator Verbis Electris) by Gunnar Fant
From 1950’s: other synthesizers including the first articulatory synthesis DAVO (Dynamic Analog of the Vocal tract)
An excellent historical trip of speech synthesis:Dennis Klatt's History of Speech Synthesis athttp://www.cs.indiana.edu/rhythmsp/ASA/Contents.html
Olov Engwall, Speech synthesis, 2008
• OVE I (1953)• On your computer today, and the original next
time + OVE II (1962)
Let us take a look at OVE
Olov Engwall, Speech synthesis, 2008
OVE Instructionshttp://www.speech.kth.se/courses/GSLT_SS/ove.html
1. Test how the five different source models change the output. What is the difference in the formant pattern between different sources? Look at the number of formants, the peak amplitude, the bandwidth.
2. Alter a) the frequency and b) the shape of the source signal. What happens with the formant frequencies in the two cases? Relate these changes to human speech production.
3. Change the Frequency values F1-F4. Start with a neutral vowel (F1=500 Hz, F2=1500 Hz, F3=2500 Hz, F4=3500 Hz). Explain the attenuation in formant peak amplitude for higher frequencies (hint: try a rectangle source and change the shape to 99).
Now move one of the formant peaks so that it is about 200 Hz from the closest peak. What happens with the neighbour peak? Change the bandwidth of the formants. What is the relation between the bandwidth and the formant peak amplitude?
4. Move around the cursor in the vowel space and see how the shape of the output waveform (green curve in the bottom panel) changes.
If you have time, try to generate the sentences "How are you?" and "I love you!".
3
Olov Engwall, Speech synthesis, 2008
Formant amplitudes
Olov Engwall, Speech synthesis, 2008
Speech analysis & manipulation
Olov Engwall, Speech synthesis, 2008
Why signal processing?• Need to separate the source from the filter
for modelling (linear predictive analysis)
• Need to model the sound source (prosody, speaker characteristics)
• Need to alter speech units in concatenative synthesis (amplitude, cepstrums)
• Need to make concatenations smooth in concatenative synthesis (PSOLA)
Olov Engwall, Speech synthesis, 2008
The source-filter theoryThe signal (c-d) is the result of a linear filter(b) excited by one or several sources (a).
Olov Engwall, Speech synthesis, 2008
filter(vocal tract)
radiation(lips)
source(glottis)
TIME:
FREQUENCY:
The source-filter theory
More to come on the vocal tract filter in lecture 7Olov Engwall, Speech synthesis, 2008
• The voiced quasi-periodic source (glottis pulses) – vowels
Parameters: – on/off, – Fundamental frequency F0, – intensity, – shape
• Frication source – fricatives
• Transient noise – plosives
• No source – voiceless occlusions
Source functions
t
4
Olov Engwall, Speech synthesis, 2008
High aspiration levels. Greater pulse asymmetry Less time in open state.
Low glottal tension. Triangular glottal opening. High medial compression Medium longitud. tension.
Whispery
Very low F0. Irregular F0 & amplitude
High adductive tension and medial compressionLittle longitudinal tension
Creaky
Audible aspirationSlow “glottal return”. Glottal pulse symmetry. Higher F0 intensity.
Lack of tension. Never close completely.
Breathy
Standard source. Steep spectral slope.
Normal, efficientComplete glottis closures
ModalAcousticArticulatory
Voice source types
Olov Engwall, Speech synthesis, 2008
The quasi-periodic source
TimeFrequency
t
t
f
f
T0 f= nF0 =n/T0
Why is there a damping slope inthe transfer function?
Olov Engwall, Speech synthesis, 2008
Simple vowel synthesis
Source Filter
Waveform F1 F2 F3 F4
Triangle source and formant filters in cascade:Bandpass filters with frequency, bandwidth, level
So, how do we find the source from a speech signal?
Olov Engwall, Speech synthesis, 2008
A method to separate the source from the filter
Predicts the next sample as a linear combination of the past psamples
The coefficients a1 … ap describe a filter that is the inverse of the transfer function
Linear Prediction (LP)
• Minimization of the prediction error results in an all-pole filter which matches the signal spectrum
• This inverse filter removes the formants and can hence be used to find the source.
∑=
−=p
kk knxanx
1
][][~
Olov Engwall, Speech synthesis, 2008
Spectral Fourier analysis• A Fourier transform of the filter coefficients
a1 … an give the frequency response of the inverse filter • A periodic waveform can be described as a sum of
harmonics• The harmonics are sine waves with different phases,
amplitudes and frequencies. • The frequencies are multiples of the fundamental
frequency.• A periodic signal has a discrete spectrum
tf
tf
Olov Engwall, Speech synthesis, 2008
Fourier Transforms• Fourirer transform (FT): A non-period signal has a
continous frequency spectrum
• Discrete FT (DFT): Fourier transform of a sampled signal• N samples in both the time and frequency domain.• The spectrum is mirrored around the sampling frequency
• Fast FT (FFT): Clever algorithm to calculate DFT • Reduces the number of multiplications:
DFT: ~N2 FFT: ~(N/2) * 2log(N)
t f
tf
5
Olov Engwall, Speech synthesis, 2008
Windowing• The analysis of a long speech signal is made on short frames:
• The truncation of the signal results in artefacs (sidelobes)• The artefacts are reduced if the signal is multifplied with a
window that gives less weight to the sides.
Analysis window 10 – 50 ms (20 ms in example)
Olov Engwall, Speech synthesis, 2008
Effect spectrum• The FFT analysis gives complex values: amplitude and
phase for each frequency component• The phase is often not interesting, only the signal’s energy
at different frequencies.• The effect spectrum shows the power spectrum for a short
section of the signal
WindowingFFT
SquareLogarithm
Olov Engwall, Speech synthesis, 2008
Cepstrum Analysis• The dominating method for ASR, used in HMM synthesis
•Inverse Fourier transform of logarithmic frequency spectrum
“Spectral analysis of spectrum”
•The coarse structure of the spectrum is described with a small number of parameters
•Orthogonal coefficients (uncorrelated)
•Anagram: Spectrum-cepstrum, filtering-liftering, frequency-quefrency, phase-saphe
Olov Engwall, Speech synthesis, 2008
Cepstrum from filterbank
-1,5-1
-0,50
0,51
1,5
-2
-1
0
1
2
-1,5-1
-0,50
0,51
1,5
-2
-1
0
1
2
30
50
70
90
1 2 3 4
1 2 3 4
30
50
70
90
110
Spectrum of /a:/
Spectrum of /s/
Cepstrum of /a:/
Cepstrum of /s/
Weight functions
W1
W2
W3
W4
C1 C2 C3 C4
C1 C2 C3 C4
* =
* =
∑=
−=N
iNij
ij AN
C1
)5.0( )cos(2 π
Olov Engwall, Speech synthesis, 2008
Mel filter bankMel filter bank
Mel Frequency Cepstral Coefficients
FFTFFT
30
50
70
90
110
Mel-Spectrum of /a:/
-200
-100
0
100
1 2 3 4
Mel-Cepstrum of /a:/
C1 C2 C3 C4Mel
dB
~6000 Hz
The Mel scale is perceptually motivated
Cepstrum transformLinear < 1000 HzLog > 1000 Hz
MFCC
Olov Engwall, Speech synthesis, 2008
Concatenative synthesis
6
Olov Engwall, Speech synthesis, 2008
Nothing new under the sun…
• Peterson et al. (1958)
• Dixon and Maxey (1968)
• “Diadic Units”, (Olive, 1977)
Olov Engwall, Speech synthesis, 2008
Let’s get the terms straightConcatenative synthesisDefinition: All kinds of synthesis based on the concatenation
of units, regardless of type (sound, formant trajectories, articulatory parameters) and size(diphones, triphones, syllables, longer units).
(Everyday use: Concatenation of same-size sound units.)
Unit selectionDefinition: All kinds of synthesis based on the concatenation
of units where there are several candidates to choose from, regardless of if the candidates have the same, fixed size or if the size is variable.
(Everyday use: Concatenation of variable sized sound units.)
Olov Engwall, Speech synthesis, 2008
Why has concatenation conquered?
• Storing the segment database is no longer an issue• Advances in ensuring smoothness in concatenations• Rule-based synthesis output used to be smoother• Unit selection provides (piece-wise) high quality speech.• Change of applications.• Certain sounds are too hard to be produced by rule
• Vowels are easy to create by rule• Bursts, voiceless stops are too difficult, we do not fully
understand their production mechanisms
Concatenative Synthesis is the state-of-the-art
Olov Engwall, Speech synthesis, 2008
Database preparation
• Choose the speech units (Phone, Diphone, Sub-word unit, Cluster based unit selection)
• Compile and record utterances• Segment signal and extract speech units• Store segment waveforms (along with context) and information in a database: Dictionary, waveform, pitch marke.g. “ch-l r021 412.035 463.009 518.23”
diphone file Start time Middle time End • Pitch mark file: a list of each pitch mark position in the file• Extract parameters; create parametric segmentdatabase (for data compaction and prosody matching)• Perform amplitude equalization (prevents mismatches)
Olov Engwall, Speech synthesis, 2008
Diphone & Triphone synthesis
s ɑː k
*s1 s2ɑ1 ɑ2l1 l2*
s ɑː l
*sɑ1 ɑ2l*
*s1 s2ɑ1
*r1 r2ɑ1 ɑ2k1 k2*
r ɑː k
*rɑ1 ɑ2k*
ɑ2k1 k2*
*sɑ1 ɑ2k*
Diphone
Triphone
Olov Engwall, Speech synthesis, 2008
Diphone synthesis
Sequences of a particular sound/phone in all its environmentsof occurrence or all/most two-phone sequences occurring in alanguage: auto ’car’ -> _a, au, ut, to, o_
• Rationale: the ’center’ of a phonetic realization is the moststable region, whereas the transition from one segment toanother contains the most interesting phenomena, and is thusthe hardest to model.
7
Olov Engwall, Speech synthesis, 2008
Diphone synthesis
• 1200 diphones can already create a quite good sounding synthesis
-Speaker dependence (one set from one speaker)
-Various digital signal processing techniques -> ’robotic’ sound
- Segmental quality, transition between diphones
- Only partial covery of co-articulation
MBROLA BT, Laureate Festival
Olov Engwall, Speech synthesis, 2008
Diphone ”synthesis” labhttp://www.speech.kth.se/courses/GSLT_SS/lab1.html
1. Record the "database", the word list: "Dockad, yttern, töm, flöde, möta, lätt, blomster, lyssnarna."in one go, in that order and without pausing.
2. Segment the wordlist into diphones: Cut out each diphone and put them in a new Wavesurfer window, but with pauses separating each diphone.
3. Identify the diphones that you need to create the sentence"Dom flyttade möblerna.“
4. Copy and paste diphones from the database window into a new synthesis window.
5. Play the sentence, fine tune durations and concatenations.
Olov Engwall, Speech synthesis, 2008
Equalization• Segments extracted from different words, with differentphonetic contexts, have amplitude and timbre mismatches.
• Equalization: Related endings of segments are imposed similar amplitude spectra.
• Amplitude equalization: smooth modification of the energy levels at the beginning and at the end of segments. The energy of all the phones of a given phoneme is given the average value. The difference is distributed on the neighbourhood.
• Timbre conflicts are tackled at run-time, by smoothing individual couples of segments when necessary, so that some of the phonetic variability is still maintained.
Olov Engwall, Speech synthesis, 2008
Concatenation with PSOLA• Time-Domain Pitch-Synchronous-OverLap-Add (TD-PSOLA) • High speech quality
• Very low computational cost (7 operations/sample).• A window (2-pitch periods long) is multiplied with the signal
• The signal is broken into a set of localized signals (non-zero only at the window intervals)
Olov Engwall, Speech synthesis, 2008
Altering pitch with PSOLA• Relative shifting of localized signals• Spacing reflects pitch duration• Good result for modification factor [0.6 – 1.5]
Spaced futher apartOlov Engwall, Speech synthesis, 2008
Altering duration & amplitudeIncrease number of PSOLA iterations (overlaps) to increase duration
• Decrease number of PSOLA iterations(overlaps) to decrease duration
• Multiplying the signal by a constant• If constant > 1, amplitude increase• If constant < 1, amplitude decrease
Frame duplication
8
Olov Engwall, Speech synthesis, 2008
MBROLA• Algorithm: Multi-Band Resynthesis OverLap and Add
• A time-domain PSOLA-like algorithm with efficientsmoothing of the spectral envelope
• Very high data compression ratios (up to 10)
• Synthesizer: Concatenation of diphones.In: List of phonemes and prosodic info (duration of phonemes and
a piecewise linear description of pitch),Out: speech samples on 16 bits (linear), at the sampling frequency
of the diphone database.
• Project goal: generate a set of speech synthesizers for as many languages as possible, free for non-commercialapplications.
Olov Engwall, Speech synthesis, 2008
Unit Selection
• Larger database of recorded units: e.g. diphones, phones, syllables, words, etc.• Multiple occurrences of the units cover a wide space of the spectral and prosodic parameters• Units nearest in this space to the targets will be chosen and will require only minor modification• The corpus is segmented into phonetic units, indexed, and used as-is• Selection is made on-line
• The trend is towards longer and longer units2005200420032002200120001999
Olov Engwall, Speech synthesis, 2008
Best Unit SelectionTarget cost
– Prosodic and spectral closeness to target
Concatenation cost– Units occurring beside each other in the recorded database being given a zero
Cost function: – Target + Concatenation cost (weighted sum)
Viterbi algorithm used to find the overall minimum cost path.
Assignment 1: Practical exercises with the calculation of target and concatenation cost.
Olov Engwall, Speech synthesis, 2008
Target & Concatenation costTarget cost = The difference in each frame between the target
and candidates for– target pitch – power– duration
• Manhattan (City block) distance
• Euclidean distance
∑ −=i
ii yxD ||
∑ −=i
ii yxD 2)(
• Concatenation cost = The difference between the end of diphone 1 and the start of diphone 2
• Mahalanobis distance
• Kullback-Leibler distance
∑ −= 2
2)(
i
ii yxDσ
i
iN
i ii yxyxD log)(
1∑=−=
Oh no! A different number of frames!
BEWARE OF PITFALL
Olov Engwall, Speech synthesis, 2008
Viterbi – best path search
Time
Pho
ne1
Pho
ne2
Utterance
• All possible sequences are hypothesized in parallel• Threshold excludes improbable hypothesesBased on• previous path probability (getting to state i)• transition probability (getting from i to j)• observation likelihood (state j matches input)
ij
Olov Engwall, Speech synthesis, 2008
Pros & cons of Unit selectionAdvantages:• Piece-wise very high waveform quality, thanks to minimal
signal manipulation• Non-linguistic features of the speakers voice built in
Disadvantages:• Discontinuities between units• Hit or miss for target selection• Quality differences between different sized units• Fixed voice• Fixed non-linguistic features
Are there any valid alternatives?
9
Olov Engwall, Speech synthesis, 2008
HMM synthesis
Olov Engwall, Speech synthesis, 2008
An example of voice
conversion
Model Estimation
LPModel
FormantTrajectory
Source Speech
TargetSpeech
LPModel
FormantTrajectory
Mapped SpeechWarping
FactorsTarget
SpeakerHMMModel
Source SpeakerHMMModel
Formant Tracking
Formant Mapping
SpeechRecon
struction
Speech Reconstruction
LPC-
Spec
trum
War
ping
/ Po
le R
otat
ion
Model Estimation
LPModel
FormantTrajectory
Source Speech
TargetSpeech
LPModel
FormantTrajectory
Mapped SpeechWarping
FactorsTarget
SpeakerHMMModel
Source SpeakerHMMModel
Formant Tracking
Formant Mapping
SpeechRecon-
struction
Speech Reconstruction
LPC
Spe
ctru
m W
arpi
ng /
Pole
Rot
atio
n
Transformed(AM M to F)American male American female
Olov Engwall, Speech synthesis, 2008
HMM synthesis
A speech synthesis technique based on HTK (Hidden Markov Model Toolkit)
Developed by the HTS working group at the Department of Computer Science Nagoya
Institute of Technology Interdisciplinary Graduate School of Science and
Engineering Tokyo Institute of Technology http://hts.sp.nitech.ac.jp
Olov Engwall, Speech synthesis, 2008
Hidden Markov Models
• A HMM is a machine, with a limited number of possible states.
• The transition between two states is regulated by probabilities.
• Every transition results in an observation with a certain probability.
• The states are hidden, only the observations are visible.
Pii
Pij
Pjj
Pjk
Pjk
Pkl
Pll
Oi OjOk Ol
Olov Engwall, Speech synthesis, 2008
HMM in speech synthesis1. Transcription & segmentation of speech databases2. Construction of inventory of speech segments3. Run-time selection of speech segments
High quality speech can be synthesized using waveformconcatenation algorithms (e.g., PSOLA).
However, to obtain various voice qualities, a large amount of speech data is necessary.
→ Speech synthesis from HMMs themselves.Voice quality can be changed by transforming HMM
parameters appropriately.The output is vocoded, but it is always smooth and stable
Olov Engwall, Speech synthesis, 2008
Basic idea
Mel-Log-Spectrum-Approximation
Start the training of the HMMs with a good guesson the parameters.
The guess is improvedthrough comparison with training observations.
In the synthesis we shouldfind the optimal sequenceof states, throughconcatenation of HMMs
10
Olov Engwall, Speech synthesis, 2008
The training part• The training is automatic. You need:
– The text + recordings of about 1000 sentences• The training of 1000 sentences
– takes 24 hours and generates a voice of less than 1 MB
• Separate HMMs for: Spectrum, F0, Duration• Training in two steps:
1.Context independent models2.Use these models to create context dependent models.
• Clustering:– Storing all contexts requires much space– It may be difficult to find alternatives for missing models– Many models are very similar = redundancy
Olov Engwall, Speech synthesis, 2008
Clustering• Groups a large database into clusters• Three trees: Duration, F0 and Spectrum• Division based on yes/no questions
– Grouping acoustic similar phonemes– Features.– Context.
Olov Engwall, Speech synthesis, 2008
Synthesis
For each phoneme we need: • Mel-Cepstrum, with first and
second derivative (mcep, Δ, Δ²)
• (F0, Δ, Δ²) + information aboutvoicing
• Duration. Can be generated implicitly by F0 and spectrum HMMs, but the result is morenatural with explicit modeling.
• Δ och Δ² are used to smooth the parameter sequences.
Olov Engwall, Speech synthesis, 2008
Delta, delta-delta...
Olov Engwall, Speech synthesis, 2008
Use of HMM synthesis
• Various voices:– Speaker adaptation– Speaker interpolation– Eigenvoices
• Very low bit rate speech coder• Security of speaker identification systems
Olov Engwall, Speech synthesis, 2008
Speaker adaptation
11
Olov Engwall, Speech synthesis, 2008
Speaker interpolation
www.sp.nitech.ac.jp/~tokuda/HTS_demo/speaker_inter/index.html
Olov Engwall, Speech synthesis, 2008
Test of speaker verification
Olov Engwall, Speech synthesis, 2008
Very low bit-rate speech coding
Olov Engwall, Speech synthesis, 2008
Swedish HMM synthesisMaster thesis by Anders Lundgren
Language specific parts:
• Text to phoneme transcription (RulSys or Festival)
• Translation of the phonemic transcription to HTK SAMPA-Festival
• Module to generate contextual information(syllable division, word accent placement)
• Decision tree paths for the clustering of HMMs– Features– contextual information
Olov Engwall, Speech synthesis, 2008
Listening test• Separate evaluation of prosody and spectrum
• Six voice variants:– HTS– Prosody from HTS, spectrum from MBROLA– Prosody from RULSYS, spectrum from HTS
• TMH’s synthesis reference system – Prosody from RULSYS, spectrum from MBROLA
Olov Engwall, Speech synthesis, 2008
Clarity
0%10%20%30%40%50%60%70%80%90%
100%
M HTS
M HTS
pros
ody
M HTS
spec
trum
F HTS
F HTS
pros
ody
F HTS
spec
trum
Much worseWorseNo differenceBetterMuch better
12
Olov Engwall, Speech synthesis, 2008
Naturalness
0%10%20%30%40%50%60%70%80%90%
100%
M HTS
M HTS pr
osod
y
M HTS sp
ectru
mF H
TS
F HTS pr
osod
y
F HTS sp
ectru
m
Much worseWorseNo DifferenceBetterMuch better
Olov Engwall, Speech synthesis, 2008
Previous TTS experience
0%10%20%30%40%50%60%70%80%90%
100%
0%10%20%30%40%50%60%70%80%90%
100%
Much worseWorseNo DifferenceBetterMuch better
Yes
No
p s p s
p sp s
More on how to evaluate in lecture 9!
Olov Engwall, Speech synthesis, 2008
The automatic generation of synthesized sound from any text string.
From text
Olov Engwall, Speech synthesis, 2008
Text-to-speech”The automatic generation of synthesized sounds...”
texttext
Linguistic analysisLinguistic analysis
Prosodic analysisProsodic analysis
Phonetic descriptionPhonetic description
Sound generationSound generation
Morphologic analysisLexicon and rulesSyntax analysis
Rules and lexicon
Rules and choice of units
Joining partsJoining parts Rules
“hello”
Olov Engwall, Speech synthesis, 2008
Text Analysis Challenges
• Homographs– My latest project is to learn how to better project my
voice.– The girl with the bow in her hair was told to bow deeply
when greeting her superiors.
• Numbers (models, dates)– On May 5 2005, the university bought 2005 computers– a Boeing model 747 can contain 747 people
• Abbreviations– Yesterday it rained 3 in. Take 1 out, then put 3 in.– St. John St.
Let us try!Olov Engwall, Speech synthesis, 2008
Preprocessor• Sentence end detection (semicolon, period – ratio, time
and decimal point, sentence ending respectively)• Abbreviations (e.g. – for instance)
Changed to their full form with the help of lexicons• Acronyms (I.B.M – these can be read as a sequence of
characters, or NASA which can be read following the default way)
• Numbers (Once detected, first interpreted as rational, time of the day, dates and ordinal depending on their context)
• Idioms (e.g. “In spite of”, “as a matter of fact”– these are combined into single FSU using a special lexicon)
13
Olov Engwall, Speech synthesis, 2008
Morphological AnalysisTask is to propose all possible parts of speech categories to
each word taken individually on the basis of their spelling.Function words
(determiners, pronouns, prepositions, conjunctions..)
– limited number.
• Can be stored in lexicon• Word he:
<spel> = he<syn cat> = pronoun<syn num> = <syn gen> = masc<phon> = /hΙ/
Content words – infinite in number
• Needs Morphology – describes words using a reduced set of abstract semantically bearing units called morphemes.
• Inflectional, derivational and compound words are decomposed into morphemes
• Uses regular grammars with lexicons of stems and affixes
Olov Engwall, Speech synthesis, 2008
Contextual Analysis• Considers words in their context
• Reduces the list of their parts of speech categories to a very restricted number of highly probable hypotheses, given the corresponding possible parts of speech of neighboring words.
• Achieved by N-grams, multi-layer perceptrons (neural networks), local stochastic grammars (provided by expert linguistics) etc
Olov Engwall, Speech synthesis, 2008
Letter-to-phonemes• Module responsible for the automatic determination of
the phonetic transcription of the incoming text
• Cannot just look up in a pronunciation dictionary– Do not follow the rule “one character = one phoneme”– Single character correspond to two phonemes — x as /ks/– Several characters producing one phoneme — th in thought– Single character pronounced in different ways — c in ancestor,
ancient, epic
• Rule based – applied based on spelling, sentence analysis
• Dictionary based – a large dictionary of correct spellings
• Hybrid Approach – combines the above, usually used
Olov Engwall, Speech synthesis, 2008
Dictionary or Rule BasedDictionary:Store a maximum of phonological knowledge into a lexicon.Compounding rules describe how the morphemes of
dictionary items are modified. Hand-corrected, expensiveThe lexicon is never complete:
needs out of vocabulary pronouncer, transcribed by rule.
Rules:A set of letter to sound (grapheme to phoneme) rules. Words pronounced in a such a particular way that they
have their own rule are stored in exceptions directory.Fast & easy, but lower accuracy
Olov Engwall, Speech synthesis, 2008
Letter-to-sound difficulties• Consonants reduced or deleted in clusters (eg. /t/ in softness)
• Assimilation leads to a change of some phonological features of a given phoneme (eg. obstacle)
• Homographs pronounced differently (eg. record, contrast)
• Phonetic liaisons (e.g. in French words immediately followed by a vocalic sound results in pronunciation of characters that otherwise disappear)
• Unstressed vowels transformed into schwas (short central phonetic elements) or deleted (e.g. interesting)
• New words, proper nouns dependent on the language of origin (e.g. in Swedish “jeans”, “comme il faut”)
Olov Engwall, Speech synthesis, 2008
Creating rules• Writing rules by hand is difficult• Automatic process built from
lexicon– Find alignments:
• Provides phone string plus stress
WordsLetters
68.76%95.60%Thai89.38%98.79%DE-CELEX93.03%99.00%BRULEX57.80%91.99%CMUDICT74.56%95.80%OALD
CorrectLexicon
k
k
-
e
t-eh-ch
dcehc
14
Olov Engwall, Speech synthesis, 2008
Phrasing
Determines where phrase boundaries occur– insert pauses on phrase boundaries– determined by CART tree trained on big
data corpus
Olov Engwall, Speech synthesis, 2008
Intonation: Word accentWord Accent: Decided depending on word class,
position in the sentence and in the phrase, word classes of preceding and following words.
For each syllable of each word: if and which(e.g. Swedish ‘tomten’, ‘stegen’).
Olov Engwall, Speech synthesis, 2008
Intonation: F0 contourLarge pitch range (female)Authoritive (final fall)Emphasis for Finance (H*)Final has a raise – more information to come
• Word stress and sentence intonation– each word has at least one syllable which is spoken with
higher prominence– in each phrase the stressed syllable can be accented
depending on the semantics and syntax of the phrase• Prosody relies on syntax, semantics, pragmatics: personal
reflection of the reader.
Olov Engwall, Speech synthesis, 2008
Pitch contour modeling
• Tonetics (the British school)– tone groups composed of syllables {unstressed,
stressed, accented or nuclear}. – nuclear syllables have nuclear tones {fall, rise, fall-rise,
rise-fall}
• ToBI (Tones and Break Indices)– Phrases split into intermediate phrases composed of
syllables. – Relative tone levels: high (H) or low (L) (plus diacritics)
at every intonational or intermediate phrase boundary (%) and on every accented syllable
• Stylization method (prosodic pattern measured from natural speech)
Olov Engwall, Speech synthesis, 2008
Prosody modeling
• Prosody targets (to put emphasis, stress) typically include:– Pitch– Phone durations– Energy
• Prosody parameters can be trained
• Fixed durations, flat F0.• Decline F0• “hat” accents on stressed syllables• accents and end tones• statistically trained
Prosody is critical for obtaining the right intonation (or else speech may sound unnatural or unintelligible)
Olov Engwall, Speech synthesis, 2008
Prosody modeling
15
Olov Engwall, Speech synthesis, 2008
[<SABLE> <SPEAKER NAME="male1">
The boy saw the girl in the park <BREAK/> with the telescope.The boy saw the girl <BREAK/> in the park with the telescope.
Some English first and then some Spanish.<LANGUAGE ID="SPANISH">Hola amigos.</LANGUAGE><LANGUAGE ID="NEPALI">Namaste</LANGUAGE>
Good morning <BREAK /> My name is Stuart, which is spelled<RATE SPEED="-40%"> <SAYAS MODE="literal">stuart</SAYAS> </RATE>though some people pronounce it <PRON SUB="stoo art">stuart</PRON>. My
telephone number is <SAYAS MODE="literal">2787</SAYAS>.
I used to work in <PRON SUB="Buckloo">Buccleuch</PRON> Place, but no one can pronounce that.
By the way, my telephone number is actually<AUDIO SRC="http://att.com/sounds/touchtone.2.au"/> …
Synthesis markup
Olov Engwall, Speech synthesis, 2008
SABLE: marking emphasis
What will the weather be like today in Boston?It will be <emph>rainy</emph> today in Boston.
When will it rain in Boston?It will be rainy <emph>today</emph> in Boston.
Where will it rain today?It will be rainy today in <emph>Boston</emph>
Olov Engwall, Speech synthesis, 2008
Vocal tract models
Olov Engwall, Speech synthesis, 2008
Articulatory synthesis
Benefits:• Produce speech in the same way as humans• Can be made with few parameters• The changes are intuitive
(raise the tongue tip, round the lips)
Disadvantages:• Computationally demanding• Problems with consonants• Articulatory measurements required• State-of-the-art articulatory synthesis still sounds bad
Articulation as filter
Olov Engwall, Speech synthesis, 2008
Articulatory models
FunctionalGeometric parameters control the different parts of the tongue, jaw, lips etc.
PhysiologicalMuscle model. Articulations are created through activation of different muscles.
Olov Engwall, Speech synthesis, 2008
Articulatory basisMeasurements (X-rays, MRI etc) are used to model the dimensions of the tube.
In the midsagittalplane, and to get the relation between midsagittaldistance and area in each plane.
16
Olov Engwall, Speech synthesis, 2008
3D articulatory syntesisWhy?• Two-dimensional models simulate the third dimension as area=a•(distance)d.a & d are decided empirically and vary through the tube.
A three-dimensional model gives• the cross-sectional area directly • lateral modeling (/l/)• visual synthesis (pronunciation training)
Olov Engwall, Speech synthesis, 2008
3D MRI measurements
3*18 slices orthogonal to the midsagittal plane in 43 s.
Supine position
CorpusOne neutral reference and 43 Swedish articulations. 13 vowels: /ɑ:, e:, æ:, i:, y:, u:, ʉ:, o:, ø:, œ:, a, u, ɔ/
10 consonants: /p, t, k, l, r, s, f, ʂ, ɧ, ɕ/in VCV contexts: /aɪʊ/
Olov Engwall, Speech synthesis, 2008
3D ReconstructionOne contour per image.
Reconstruct a 3D shape for each articulation
⁄ akÉa alÉa UžU
Olov Engwall, Speech synthesis, 2008
Tongue bodyJaw height Tongue dorsum
Articulatory model
Six articulatory parameters defined using a component analysis of the 3D tongue shapes.
Olov Engwall, Speech synthesis, 2008
Tongue advance Tongue widthTongue tip
Articulatory model
Add vocal tract walls:Symmetric walls, extracted from the MR Images.Collision handling for the tongue against walls, palate and jaw.
Olov Engwall, Speech synthesis, 2008
Movetrack Electromagnetic Articulograph:6 coils; upper lip , upper & lower incisors , three tongue coils, 8 , 20 and 52 mm from the tip.
Multimodal articulatory synthesis
Qualisys optical motion tracking: 4 IR cameras28 reflectors3 reference reflectors on headmount
C
C C
C
C
RR
Audio & video recorders V
V
Rf
RfRfRf
UL
T1 T2 T3
T3
T1
T2 JT2
UL
UL
T2T1J
17
Olov Engwall, Speech synthesis, 2008
3D Articulatory data Ai
Pitch Pi
Training:
Linear estimator
or
Neural network
Dat
a pr
oces
sing
&
Trai
ning
Speech signal Si
LSP Li ∀i∈C, i≠k
∀i∈C, i≠k
∀i∈C, i≠k
LPC analysis
Model fitting
14 Articulatory parameters APi
Resampling
Olov Engwall, Speech synthesis, 2008
Pitch Pk
Syn
thes
is Synthesis:Linear estimator
orNeural network
14 Articulatory parameters APk
Synthetic Speech Sk*
LSP filter
LSP Lk*
Olov Engwall, Speech synthesis, 2008
Multimodal synthesis
Olov Engwall, Speech synthesis, 2008
From articulation to acoustics
Electric circuit equivalent
Vocal tract model
Tubes
2D airflow dynamics
Waveform
Cross-sections
3D air flow calculations
Area function
Olov Engwall, Speech synthesis, 2008
Area & transfer functions
Arti
cula
tory
mod
el Area function →Area vs. distance
Formants
Transfer function:Amplitude vs. frequency
Para
met
er s
ettin
gs
Olov Engwall, Speech synthesis, 2008
Formants
18
Olov Engwall, Speech synthesis, 2008
Vocal tract models labhttp://www.speech.kth.se/courses/GSLT_SS/lab2.html
• Synthesize /aa, ii, uu/ witha) a two-tube model b) a three-parameter model c) an area function model d) an articulatory model
• Investigate what happens if a nasal tract is added for each model.
• Compare the four methods regarding flexibility, complexity, intuitivity. What are the advantages and disadvantages of each of them?
• Use the articulatory model to investigate how the seven parameters influence the vocal tract shape and acoustics. Start from a neutral vocal tract (set all values to 0) and vary each parameter.
• Move or place your own articulators in the same way; do your intuitive thoughts about the effect of your articulatory movements correspond to the results in the model?
• Experiment with the parameters in 'Tract Configuration' and 'Physical Constants'. What influence do they have on the synthesis?
Formant values are: /aa/: 650 1000 2500 /ii/: 290 2050 2400 /uu/: 300 700 2100
Olov Engwall, Speech synthesis, 2008
Equivalent circuit
GC
LR
The tube has a acoustic mass ~ L = ρ/ΑThe air functions as a spring ~ C = A/(ρc2)There are frication losses ~ R, G
A is the cross-sectional area of the tube, ρ is the air densityand c is the speed of sound in air
A
Acoustically Mechanically Electricallyflow speed currentpressure force voltageac. mass mass inductanceac. spring mech. spring capacitance
Olov Engwall, Speech synthesis, 2008
• The tube has rigid walls. • Since the cross-sections are small compared to the tube
length we have a plane wave.• Two-directional waves with reflections between the tubes
• The “current” and “voltage” are sinusoidal:
Assumptions
A1A2 r = (A1-A2)/(A1+A2)
( )U x U e U ex x= ++−
−γ γ
( )I x I e I ex x= ++−
−γ γ
with γ = ( )( )R j L G j C+ +ω ω The index + and – indicate the direction.
Olov Engwall, Speech synthesis, 2008
In an electric circuit:
Uout
Iout
Uin
Iin
Iin – Iout
Zb
Za Za
The transfer function is the quota
in
out
IIH =
2tanh
sinh1
0
0
lZZ
lZZ
a
b
γγ
=
=
Long calculations give: If we assume that the tube is loss-less, Z0 and γ are simplified:
ZR j LG j C
LC
AA c
cA0 2=
++
= = =ωω
ρρ
ρ//
γ = ( )( )R j L G j C j LCjc
+ + = =ω ω ωω
( )( )baoutbinout
boutbainin
ZZIZIUZIZZIU
+−=−+=
Olov Engwall, Speech synthesis, 2008
Example: the neutral vowel
Iout
Iin
Iout - Iin
Zb
Za Za
glottis lips
( )
ba
b
in
out
inoutbouta
ZZZ
IIH
IIZIZ
+==⇒
=−+ 0
( )( ) ( )
( )
( )
( )( )( )( )
( )lllZ
lZ
lZ
H
lZZ
lllZlZZ
b
a
γγγ
γ
γ
γ
γγγγ
cosh1
sinh1cosh
sinh
sinh
sinh
sinh1
sinhcosh
2tanh
00
0
0
00
=−
+=⇒
=
−=⎟⎠⎞
⎜⎝⎛=
( ) ,...3,2,1,124
,...2,1,0,22
0cos
:
=−=⇒=⎟⎠⎞
⎜⎝⎛ +=⇒+=⇒=⎟
⎠⎞
⎜⎝⎛ nn
lcFnn
lcn
cl
cl
Poles
nn ππωππωω
,...}3500,2500,1500,500{get we,350m/sc and 17.5cm,l speaker, male typicalaFor
HzHzHzHzF ===
⎟⎠⎞
⎜⎝⎛
=⎟⎠⎞
⎜⎝⎛
=⇒=
cl
clj
HcjLossless
ωωωγ
cos
1
cosh
1
Olov Engwall, Speech synthesis, 2008
Two tubesConnect two homogene tubes
Iout
I1 Iin
I1 – Iout Iin - I1
Zb2 Zb1
Nod 2Nod 1
Za2 Za1 Za1 Za2
HII j l
cj l
cAA
j lc
j lc
ut
in
= =+
⎛⎝⎜
⎞⎠⎟
1
11 2 2
1
1 2cosh cosh tanh tanhω ω ω ω
Poles when AA
lc
lc
2
1
1 2 1tan tanω ω
=
Node 1: ( )I I Z I Z ZZ Z
Z Zin b a aa b
a a− = + + +
⎛⎝⎜
⎞⎠⎟1 2 2 1
1 1
1 21
Node 2: ( )21
111121
aa
baoutaoutbout ZZ
ZZIIZIZII+
=⇒=−
Even longer calculations give:
19
Olov Engwall, Speech synthesis, 2008
Example with two tubes
A male speaker produces a vowel with a constricted pharynx (the pharyngeal area is one eighth of that in the oral cavity). Calculate the first two formant frequencies.
smccmlAAlll
/350 ,speaker male afor 5.1781,
2 1
221
==
===
( )[ ] HzFHzFnF
FAA
cl
cl
cl
AA
n
n
1216,7848arctan2000
83502
175.02tantan
1tantan
21
2
12
1
21
1
2
==⇒+±=
=⎟⎟⎠
⎞⎜⎜⎝
⎛⇒=⎟
⎠⎞
⎜⎝⎛
=⎟⎠⎞
⎜⎝⎛
⎟⎠⎞
⎜⎝⎛
ππ
πω
ωω
Olov Engwall, Speech synthesis, 2008
Consonants
The source is somewhere else than at the glottis
Some cavities may be closed:
e.g. the mouth cavity for nasals
Olov Engwall, Speech synthesis, 2008
Two homogene tubes in series
Iut
I
Iin
I1 - IutI
Zb2 Zb1
Za2 Za1Za1Za2U
Power source
The same poles as before, but now zeroes as well, when |sinhθ2|=0 !!
⎟⎟⎠
⎞⎜⎜⎝
⎛+
=−
212
121
21
2
tanhtanh1coshcosh
sinh
θθθθ
θ
ZZ
ZII
in
ut
Consonants
Olov Engwall, Speech synthesis, 2008
Formant Synthesis
Olov Engwall, Speech synthesis, 2008
OVE II
Model the poles directly instead!Olov Engwall, Speech synthesis, 2008
Digital resonators
20
Olov Engwall, Speech synthesis, 2008
Formants and bandwidths
An all-pole model:resonances when thedenominator is zero.
Bandwidths:Function of energy lossesdue to heat conduction,viscosity, cavity-wallmotions, radiation of soundfrom the lips and the realpart of the glottal sourceimpedance.
Olov Engwall, Speech synthesis, 2008
Synthesis by rule
Lab exercise 3
Olov Engwall, Speech synthesis, 2008
Formant synthesis labhttp://www.speech.kth.se/courses/GSLT_SS/lab3.html
• The task is to adapt the synthesis of "Dom flyttade möblerna" to sound as a target speaker.
• Start Wavesurfer and open the reference sentence in the "Speech Analysis" configuration.
• Create an transcription pane: right-click > "Create Pane > Transcription". Download the automatic transcription and load it by right-clicking in the transcription pane > "Load transcription“.
• Start a new Wavesurfer (File > New), choose "Formant Synthesis sw".
• Type "Dom flyttade möblerna" into the Text slot and Synthesize.
• Edit the parameters that are displayed: F0, F1-F4.- left-click > and drag the parameter track. - Insert new control points: right-click on a parameter track.
• To make a phoneme longer/shorter: click in the transcription window and drag to the left/right.
• How close do you get by just editing pitch, formants and duration? Olov Engwall, Speech synthesis, 2008
Data-driven formant synthesis
Keeps the flexibility of the formant synthesis
More natural sounding than rule-driven synthesis
Speaker adaption
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
500
1000
1500
2000
2500
3000
3500
4000 M O B I: L sil
Olov Engwall, Speech synthesis, 2008
Formant unit selection
Formants are chosen through unit selection from a formant diphone library of about 2000 diphones.
Formant trajectories are scaled and interpolated to fit rule-generated durations.
Olov Engwall, Speech synthesis, 2008
Synthesis comparison
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
500
1000
1500
2000
2500
3000
3500
4000 M O B I: L sil Data−driven
Rule−driven
21
Olov Engwall, Speech synthesis, 2008
Listening test evaluations• 15 subjects, 20 sentences, continuous scale from
”Unnatural” to ”Natural”.
• 4 types of stimuli:1. Rule-based syntes2. Data-driven formant synthesis3. + Data-driven fricative synthesis4. + Replace the voiceless fricatives (/f/, /s/, /sj/,/tj/, /rs/)
and plosives (/k/, /p/, /t/, /rt/) with recorded versions.
• 12 subjects, 10 sentences, binary scale
• Data-driven synthesis with manually correctedformant data was preferred in 73 % of thecases over rule-driven synthesis
Olov Engwall, Speech synthesis, 2008
Evaluation resultsRule-driven and Data-driven
Overall Sentences without critical errors
Sentences with critical errors
Hand-corrected sentences
Olov Engwall, Speech synthesis, 2008
Evaluation
Olov Engwall, Speech synthesis, 2008
Evaluation: Why?
• Monitoring the development– Initial: choosing a ”good” voice, a good inventory.– Progress evaluation– Diagnostic evaluation: find out where things go
wrong and why.
• Performance Evaluation
intelligibilitycomprehensibility
qualityadequacyusability
For developers:Overall quality evaluation
For users:Comparative evaluation
Olov Engwall, Speech synthesis, 2008
Diagnostic evaluation:
• Segmental: intelligibility tests on the ability to distinguishindividual sounds.– Diagnostic Rhyme Test (DRT) – Modified Rhyme Test (MRT)– Minimal Pair Intelligibility Tests (MPIT)– Phonetically Balanced Word List (PB)– Nonsense words
• Sentence: comprehension of words or short sentences
• Comprehension: more than one sentence.
• Prosody: assessment of intonation and emotions
• Subjective opinions
Different levels
Standard procedures are available only for segmental evaluationOlov Engwall, Speech synthesis, 2008
• Consonant intelligibility in word initial position.
• 96 word-pairs to test 6 characteristics:Voicing: veal-feelNasality: reed, deedSustension: vee-bee, sheat- cheatSibilation: sing-thingGraveness: weed-reedCompactness: key-tea
• Forced choice
• Intelligibility = number of correct identifications compared to all words.
• Diagnostic information given in confusion matrices.
Diagnostic Rhyme Test
22
Olov Engwall, Speech synthesis, 2008
Pros and cons of DRTPros• Limited number of stimuli,
not too time consuming
• Naive listeners can takepart
• Easy to interpret the results
• Confusion matrices helpto localise the problems
Cons• Consonantal intelligibility
only in word initial position
=> Modified Rhyme test: inital and final
• isolated contexts
• closed response format
Choice: din, sin, fin, pin, win, tin. Winamp media file
Olov Engwall, Speech synthesis, 2008
• Nonsense sentences. Forced choice (two alternatives).a) ”the uniform towels snitch a sniffer” / b) ”the uniformed towels snitch a sniffer”– Forced choice, between: uniform - uniformed
• Phonetic features: – Consonant and vowel substitution: copper-chopper, tutor-teeter– Consonant insertion/deletion: attitudes-altitudes– One-feature substitutions: ringers-riggers– Two-features substitutions: burnish-furnish– Word initial: gasket- basket– Word internal: musty - musky– Word final: familiar- familial
• Segment location: stressed, unstressed
• Word location, initial, medial, final
Minimal pairs intelligibility test
Olov Engwall, Speech synthesis, 2008
Evaluation problems• It is unrealistic to test one level at a time: they are not
independent.
• Can we really evaluate the intelligibility of TTS at segmental level?
• Is intelligibility more important than naturalness?
• Limitations of subjective tests– Learning effects– Concentration problems– Choise of listeners: naive or expert?
• Is it possible to build objective tests?
Olov Engwall, Speech synthesis, 2008
Comprehension test
Single-task performance measure: listen and understand 2 passages and answer ten multiple choicequestions.
Subjects: 2 groups, onelistening to syntheicspeech, one listening to natural speech.
Results: no significantdifferences betweenunderstanding syntheticand natural speech.
Multiple-task performance measure: listen and understand 1 passage and at the same time detectclicks occurring in the passage.
Subjects: same
Results: Subjects who listenedto synthetic speech tooklonger to identify the clicks.
Check mental load De Logu et al. 1998
Comprehension tests are difficult to construct due to the intervention of cognitive factors.
Olov Engwall, Speech synthesis, 2008
Subjective opinion tests• Listeners are presented with a set of stimuli to be rated on:
Overall impression & acceptance (quality)
Listening effort, comprehension (intelligibility)
Pronunciation, speaking rate & voice quality (naturalness).
Mean Opinion Score: Evaluates the general
speech quality5 excellent, 4 good, 3 fair, 2 poor, 1 bad
Degradation Mean Opinion Score:Evaluates how disturbances are
perceived.
5 Inaudible, 4 Audible, not annoying3 Slightly annoying, 2 Annoying, 1 Very annoying
Olov Engwall, Speech synthesis, 2008
Comparing systems
• No standard procedures are available to carry out comparative evaluations of systems.
• Most common is to use preference scores:(- System A much better)- System A better- No difference- System B better(- System B much better)
0%10%20%30%40%50%60%70%80%90%
100%
M HTS
M HTS pr
osod
y
M HTS sp
ectru
m
F HTS
F HTS pr
osod
y
F HTS sp
ectru
m
Much worseWorseNo DifferenceBetterMuch better
23
Olov Engwall, Speech synthesis, 2008
Synthesis of the future
Olov Engwall, Speech synthesis, 2008
Speaker adaptation• Why?
– Make the synthesis more human-sounding, more diverse, more personalized.
– Synthesize ordinary speech of ordinary people!• What?
– The non-linguistic (?) features of the acoustic signal: voice quality, gender, age, dialect, sociolect.
• How? – Record the speaker as target or adapt the synthesis
(by statistics or rules)
• Various contexts: low, raspy voices, strong, commanding voices, children’s and old persons’ voices, promotional voices, emotional voices, etc.
Olov Engwall, Speech synthesis, 2008
Speaker characteristics
The linguistic component: semantic information that is part of the speaker’s language (e.g. question intonation)
• The paralinguistic component: the speaker’s attitudinalor emotional states, sociolect and regional dialect.
• The extralinguistic component: the individuality, genderand age of a certain speaker. It can be judgedindependently of the language.
To adapt a speech synthesizer to a certain speaker, weneed both the para- and extralinguisitic components.
Linguistic vs. Individual components
Olov Engwall, Speech synthesis, 2008
Speaker Variability: Dialect• Different dialects use different
phonemes for the same word – e.g.: British vs American “better”
–Brittish vs. Australian ”say”
• Different dialects use different allophones for the same phoneme:– Swedish: Öga/Öra, Äga/Ära
(Värmland-Östergötland)• Differences in prosody and accent.
Olov Engwall, Speech synthesis, 2008
Speaker Variability:
Within-Speaker Variability• Can change F0 and voice quality
Between-Speaker Variability• Cannot change basic physiology (lungs, vocal folds,
vocal tract…), which limits ranges of F0 and voicequalities
• Difficult to change the– Sociolect: Level of education/social environment– Personal history
Individual Differences
Olov Engwall, Speech synthesis, 2008
SociolectNew York City department
store study
[r] in 'fourth floor'
low20%Klein's
middle51%Macy's
high62%Saks 5th Av.
STATUS[r]%SHOP
Swedish: Liiidingö, sju
0,50,60,70,80,9
11,11,2
RP
Eng
lish
Fren
ch
Am
Eng
lish
Swed
ish
Dut
ch
F0
Mean normalized F0 in vowels (in Bark) for different languages
24
Olov Engwall, Speech synthesis, 2008
Sounding Gay
• Crist (1997) - 5 out of 6 speakers exhibited longer /s/ in gay stereotyped speech
• Linville (1998) - gay speakers had longer /s/• Rogers, Smyth, and Jacobs (2000) - both /s/ and /z/ were
longer in gay-sounding speech• Levon (2004) - altering sibilant duration alone insufficient
to change perception of gayness
Fricative duration
Olov Engwall, Speech synthesis, 2008
Creating emotional speech synthesis of a text requires:a. Signal Processing: algorithms for altering the
acoustic prosodic parameters of the speech.
b. Prosody Modeling: Creating typical patterns corresponding to different emotions.
c. Text Analysis: finding textual cues to prosody and the expressive intention of a text.
Emotions
Olov Engwall, Speech synthesis, 2008
Two approaches1. To design a general method of assigning a given
expressive intention to any text.– An ongoing and challenging task, involving research on signal
processing, speech acoustics and human communication.
2. Enriching synthetic messages with expressive phrases and sounds, which convey expressive intentions.
– A commercially available solution: e.g., Loquendo, IBM.
<speak version=”1.0" xml:lang="en-US">Yes sir, the package will be on your desk tomorrow. And I say that with the utmost confidence. I will take care of it.How will I take care of it? I don’t know how I’m going to take care of it. If I knew how to take care of it </speak>
<prosody emotion=“calm” attitude=“confident”>Yes sir <mm_hmm/> the package will be on your desk tomorrow. And I <prosody ToBI=“H*”/> say that with the utmost confidence. <emphasis> I </emphasis> will take <emphasis> care </emphasis> of it. </prosody><prosody emotion=“despair”> <creakiness=“high”> <groan/> How will I take care of it? I don’t<prosody ToBI=“H*”/> know how I’m going to take care of it. <sigh/> If I knew how to take care of it <sobbing/></prosody>
Olov Engwall, Speech synthesis, 2008
Emotion analysisHow to determine synthesis parameters for different emotions?
– Professional acting– Amateur acting – Read a text with different emotions
• Acted and read speech is widely used, but…Does it reflect the way emotions are expressed in
spontaneous speech?
Alternatives:• Wizard of Oz scenarios• Customer calls to call centres
– Lots of real emotional speech– But, permissions?
• TV shows (Oprah, Ricki Lake, Dr. Phil etc)
Olov Engwall, Speech synthesis, 2008
Emotion databasesAgain, two approaches: 1. Create large databases for each emotion you want to
synthesize and use the entries as such• E.g. diphones• Duplicate the database for each emotion…
2. Modify the default output signal from the synthesizer using emotion rules
• Small set of phonetically balanced sentences (25 or so)• Sentences without emotional content,
e.g. The competitor has made twenty five offers, closing only five contracts
• Compare with a Neutral style.
Olov Engwall, Speech synthesis, 2008
-10.00-8.00-6.00-4.00-2.000.002.004.006.008.00
10.00
%
angry happy sad
Syllable duration
FSSLSU
Emotion correlates
-20.00
-10.00
0.00
10.00
20.00
30.00
%
angry happy sad
Mean F0
FS
S
LS
U
-40,00
-20,00
0,0020,00
40,00
60,00
80,00
100,00
120,00
%
angry happy sad
F0 Range
FSSLSU
-6.00
-4.00
-2.00
0.002.00
4.00
6.00
8.00
10.00
%
angry happy sad
RMS Energy
FSSLSU
FS the first stressed syllable of the sentence or after a speech pause S stressed syllableLS the last stressed syllable of the sentence or before a speech pause U unstressed syllable
Loquendo
25
Olov Engwall, Speech synthesis, 2008
Emotion synthesis schemeAcoustic unit selectionInput Text
signalAnalysis prosodic
parameters
Time and Pitch Scaling + Gain function
Output Waveform
Expressive style
Energy Duration Pitch
Synthesis prosodic
parametersEnergy Duration Pitch
“E” rules “D” rules “P” rules
PSOLA
Olov Engwall, Speech synthesis, 2008
Examples
50
100
150
200
250
300
0 0,2 0,4 0,6 0,8 1 1,2 1,4time (sec.)
Hz
neutralsadhappyangry
Many more on http://emosamples.syntheticspeech.de/
Loquendo’s Susan
Olov Engwall, Speech synthesis, 2008
Evaluation results• Texts without emotional content.
- “The competitor has made twenty five offers, closing only five contracts”
• Volunteers listened to samples in random order and evaluated from 0 to 5 how much sad, angry, happy or neutral each stimuli sounded.
00.5
11.5
22.5
33.5
44.5
neutral angry happy Sad
neutral (TTS) angry (TTS) happy (TTS) sad (TTS)
Olov Engwall, Speech synthesis, 2008
Emotional questions
But, the closer we get to “real” emotions, the more difficult it is to recognize them!
Up to 95% correct identification on acted speechUp to 79% on read speechUp to 73% on lab-recorded dialogue data
What is the goal of expressive synthesis?Convey an emotion?
Make the synthesized emotion sound natural?
And, how many emotions do we have?Four?Seven? (Ekman: neutral + sadness, happiness,
anger, fear, disgust, surprise)
Olov Engwall, Speech synthesis, 2008
Why 'real speech’ synthesis?
• 'Yeah!', 'Right on!', 'Fantastic!', 'Hi!'• Why can’t we synthesize ‘real speech’?
– Because we assume that words alone carry most of the meaning in speech
– But the '85%' (?) of speech which is non-verbal is largely monosyllabic
– Monosyllables can be very repetitive - unless they vary in another dimension
• Voice quality: spectral features, voice source features and temporal features (e.g., voice on-/offsets, jitter, creak, etc.).
Olov Engwall, Speech synthesis, 2008
But how to synthesize?
The
NATR a
ppro
ach • 1000 hours of everyday
conversation • Recorded with head-mounted
mic to DAT and Minidisc• Analyzed acoustically,
manually transcribed, & perceptually labeled
• No studio use, no recording constraints
• Japanese native-language speakers, mixed ages, in everyday situations => A paralinguistic speech corpus
26
Olov Engwall, Speech synthesis, 2008
Acoustic AnalysisBoundaries of quasi-syllabic
Nuclei
Quasi-syllable
boundaries
F0 contourSonorant Energy contour
(a) Variance in delta-
Cepstrum(b)
Formant / FFT
Cepstraldistance Composite
(a & b)measure
of reliability
Glottal AQ pressed
breathy Estimated vocal-tract area-functions
Phonetic labels
(if available)
Olov Engwall, Speech synthesis, 2008
Discourse Act Labellingo 反論 arguep 提案、申し出 suggest, offerq 気づき notices つなぎ connectorr 依頼、命令 request-actiont 文句 complainu 褒める flatterw 独り言 talking-to-selfx 言い詰まり disfluencyy 演技 actingz 繰り返し repeatr* 要求 request (a~z)v* 確認を与える verify (a~z)
a あいさつ greetingb 会話終了 closingc 自己紹介 introduce-selfd 話題紹介 introduce-topice 情報提供 give-informationf 意見、希望 give-opiniong 応答肯定 affirmh 応答否定 negatei 受け入れ acceptj 拒絶 rejectk 了解、理解、納得 acknowledgel 割り込み, 相づち interjectm 感謝 thankn 謝罪 apologize
Speaking style (and voice) vary greatly, depending upon (a) the situation(b) who we are speaking to ...(c) how we feel about what we are saying!
Olov Engwall, Speech synthesis, 2008
Concept-to-speech: why?
105 question about my bill 63 question on my bill 57 calling about my bill 43 talk to somebody about my bill 41 talk to someone about my bill 32 questions about my bill 30 problem with my bill 23 speak to someone about my bill 22 calling about a bill 20 calling about my phone bill 16 questions on my bill 16 question about a bill 15 talk about my bill 11 question about my phone bill 11 question about my billing 11 discuss my bill 10 speak with someone about my bill
10 calling about my billing 9 problem with my phone bill 9 calling about my telephone bill 8 speak to someone in billing 8 question about the bill 7 speak to somebody about my bill 7 speak to a billing 7 question on my phone bill 7 calling regarding my bill 7 calling concerning my bill 6 talk to somebody in billing 6 questions about my billing 6 question on my billing
5 talk to someone about a bill 5 talk to somebody about my billing 5 talk to somebody about a bill 5 speak to someone in the billing 5 speak to someone about a bill 5 questions on my billing 5 question on the bill 5 question on a bill 5 question my bill 5 calling in regards to my bill 5 calling about the bill 4 talk to someone about my telephone bill 4 talk to somebody about my account 4 talk to billing 4 speak with someone in billing 4 question about my telephone bill 4 information on my bill 4 calling regarding my statement .............. 1 talk to someo- to someone about my moms telephone bill 1 question about the new A T and T billing
Total 1083 variations in 1912 matches
Ways to say “question about my bill” to AT&T:
Humans do not read a text aloud, we talk!
6 problem with my billing 6 information about my bill 6 calling about my A T and T bill 5 talk to someone about my phone bill
So is future text-to-speech synthesis just cut and paste from an enormous database?
Olov Engwall, Speech synthesis, 2008
Concept-to-speech: what?
• Input: Abstract presentation Goal or a machine generated message.• Output: Syntactic Structure for Concept-To-Speech Synthesis• Language-independent text planning component• Language-specific domain-grammars• Enriched information passed to synthesis
Olov Engwall, Speech synthesis, 2008
Slot filling or generation?
Concept-to-speech: how?
Either:Put key information into different carrierphrases
Or:Generate utterancesbased on content.