Embedded Concatenative Text-to-Speech - IBM...
Transcript of Embedded Concatenative Text-to-Speech - IBM...
IBM Labs in Haifa © 2004 IBM Corporation
Embedded Concatenative Text-to-Speech
Ron Hoory, Zvi Kons, Dan Chazan, Slava ShechtmanMedia Services and Technologies GroupOctober 14, 2004
IBM Labs in Haifa
© 2004 IBM Corporation2
Why Text-to-Speech ? Why Concatenative ?
� Text-to-speech eliminates the need to prerecord all possible messages� The alternative - recorded prompts - has much less flexibility:
� Cannot synthesize words/phrase outside inventory� Adding new prompts is expensive� No expression: prosody cannot be controlled –
especially when combining prompts:
� Text-to-speech (TTS) can synthesize arbitrary text
� In Concatenative text-to-speech (CTTS), small segments of speech are selected from a large speech database and concatenated together
IBM Labs in Haifa
© 2004 IBM Corporation3
The role of TTS in a conversational system
� Critical component of the conversational interface� Only way to present information in eyes-busy/unavailable situations
(car, phone)� Quality of conversation system often equated to TTS quality
Dialog Manager
Natural Language Understanding
Speech Recognition
Natural Language Generation
Speech Synthesis
voice text ‘meaning’
‘meaning’textvoice
IBM Labs in Haifa
© 2004 IBM Corporation4
How does a CTTS system work ?
Normalization
Text to Unit Conversion
Text to ProsodyTargets
Segment Selection
Post-SearchModification
text Prosodymodels
Database
We visited Rodeo Dr. We visited Rodayo Drive.
speech
Front-End
Back-End
IBM Labs in Haifa
© 2004 IBM Corporation5
Language dependency and other considerations
� The front-end is mostly language dependent, relying on languages rules, pronunciation dictionaries etc.
� The back-end is mostly language independent, except for the speech database (a.k.a., “voice”)
� The voice needs to be recorded:� In the desired language and accent, e.g., “Canadian French”� With the desired speaker (“voice talent”) :
� male/female� low/high pitch� slow/fast speaking rate
� In a professional recording studio and equipment� With sampling rate above the target sampling rate (usually 22KHz)
IBM Labs in Haifa
© 2004 IBM Corporation6
Text Normalization
� Language-independent text cleaning (html tags, etc.)� Language-dependent normalization for dates, time, numbers, currency,
phone numbers, addresses, abbreviations
� Examples:� St. Martin St. becomes Saint Martin Street� Dr. King Dr. becomes Doctor King Drive� 1 oz. becomes one ounce� 2 oz. becomes two ounces� $5 million becomes five million dollars
IBM Labs in Haifa
© 2004 IBM Corporation7
Possible Concatenation Units
� Words� Syllables� Demi-syllables� Diphones� Augmented diphones� Phone Units� Subphone Units
IBM Labs in Haifa
© 2004 IBM Corporation8
Concatenation Units in the IBM CTTS system
� HMM state-sized segments (3 states per phone)
� Segments are classified according to their phonetic context:� Phonetic context determined by a binary decision
tree with questions on neighboring phones� Segments are labeled according to the
leaves of the context dependent decision tree.� Typically 10-20 database occurrences per leaf label
� Text is first converted to phones using a pronunciation dictionary and then to leaves
S1 S2 S3
L3L2
L1
IBM Labs in Haifa
© 2004 IBM Corporation9
Prosody modeling
� Prosody is critical for obtaining the right pronunciation and intonation� Wrong prosody can cause speech to sound unnatural or even
unintelligible
� Prosody targets typically include:� Pitch� Phone durations� Energy
� Prosody parameters can be trained to match the target speaker prosody
IBM Labs in Haifa
© 2004 IBM Corporation10
How can prosody effect naturalness
� Expressiveness is a very important factor in speech naturalness.Controlling prosody can generate expression
� Neutral prosody:
� Expressive prosody:
������������������������������
���������� ��������������������
IBM Labs in Haifa
© 2004 IBM Corporation11
Segment Selection and Post-Search Modification
� Each segment is selected from the all database candidates labeled with the target leaf label
� Dynamic programming used to optimize the series of segments selected by minimizing a cost function
� Cost function weights:� Proximity to prosody targets (pitch, duration)� Continuity between consecutive segments chosen
� Spectrum continuity � Pitch continuity
� Post-search modifications carried out to modify the pitch, duration and energy to match the target prosody
IBM Labs in Haifa
© 2004 IBM Corporation12
IBM high quality CTTS with super-voices
� Building of CTTS voices includes:
� Voice recording using a predefined script
� Limited manual work for “cleaning”
� Intensive automatic processing
� IBM Super-voices
� Very large recording script:
� Usually 10000 sentences are read by the speaker� 15 hours of audio, 11 hours of speech excluding silence
� Script reflects typical scenarios
� Professional recording studio and professional speakers
� Three stage audition process of final voices
IBM Labs in Haifa
© 2004 IBM Corporation13
Footprint and environments
� Size of the voice dataset is a crucial parameter for quality andnaturalness
� The CTTS system can operate in various environments;• Server : typical footprint of 500-1000MB• Desktop : typical footprint of 50-100MB• Embedded : typical footprint of 5-10MB
� The Embedded concatenative text-to-speech (eCTTS) challenge:Can we reduce the size of the voice by two orders of magnitude without severely degrading the quality ?
IBM Labs in Haifa
© 2004 IBM Corporation14
Why eCTTS ?
� Server based CTTS requires a connection (wired or wireless) to the server, which is not always available
� Device manufacturers and car manufacturers usually prefer embedded applications running locally
� Even with growing amount of resources available on embedded devices and in-car systems, small footprint eCTTS is required:� Memory and processing power are important factors for the price of
embedded devices� Typically, the system includes many other components� Sometimes several languages should be supported
IBM Labs in Haifa
© 2004 IBM Corporation15
eCTTS in the automotive market
� IBMS’s eCTTS part of the Honda speech interface in 2005 high-end cars� Embedded Viavoice includes embedded TTS and speech recognition,
providing a full conversational system� Main usage is for navigation applications
IBM Labs in Haifa
© 2004 IBM Corporation16
How does eCTTS work ?
TTS
Front
Endleaves Segment
Selection
Segmentadjustment &
Concatenation
Feature
Reconstruction
Speech Dataset
Feature vectors
Features
Pitch
Energy
Durationspeech
IBM Labs in Haifa
© 2004 IBM Corporation17
How is the x100 size reduction achieved ?
� Reducing the number of segments by segment preselection� Reusing of the same data for several purposes� Using a more efficient speech model� Data compression
Voice dataset
���MB�MB
IBM Labs in Haifa
© 2004 IBM Corporation18
Segment Preselection
� The process:1. A voice dataset is built with all the segments in place (typically 1M
segments)2. A large number (100K) of sentences are synthesized and the
selected segments statistics is collected.3. A fraction of the segments that were the most frequently selected is
chosen.
� Typically 7-10% are chosen, resulting in ~100K segments and a coverage of 70-80% of the selections made during the statistics collection.
IBM Labs in Haifa
© 2004 IBM Corporation19
Speech Model
� Frequency domain sinusoidal model, with amplitude and phase computed for every pitch harmonic (voiced frames)
� Accurate representation of the spectral envelope that is used in segment selection, pitch modification and reconstruction
Spectral envelope
frequencypitch
ijieA �
IBM Labs in Haifa
© 2004 IBM Corporation20
Demonstration
German
Italian
UK English
US English
Example *Language
* All voices are 22KHz/10MB except the German male which is 11KHz/8MB