Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000...
Transcript of Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000...
![Page 1: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/1.jpg)
Acknowledgement:
Yvonne Lee, Paul Chan, Dong Minghui
International Symposium on Next-Generation Artificial Intelligence
3 March 2016, Tokyo
Personalized Singing Synthesis - Science Meeting Arts
Haizhou Li
Institute for Infocomm Research, A*STAR
Singapore
![Page 2: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/2.jpg)
1. Speech
2. Singing
3. Music
4. Speech to Singing
Agenda: Personalized Singing Synthesis
![Page 3: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/3.jpg)
• Speech is the most natural way of human communication.
• Singing is to augment regular speech by the use of both tonality and rhythm.
• Music is an artistic way of expression, with instrumental sounds and vocals.
Speech, Singing and Music
Speech
Singing
Music
![Page 4: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/4.jpg)
Speech
4
![Page 5: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/5.jpg)
• Who you are
Elements in Human Voice
Human Language Technology
• what you want to say
Speech
Prosody
Timbre Content
• Expression of affective
state/emotion
![Page 6: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/6.jpg)
6
• Speech and singing can be modelled as two processes: source generation and filtering
• Air flow passes through the vocal folds as the sound source (periodic pulse & noise)
• It is then filtered by our vocal tract to produce the sounds
Vocal Tract (Filter)
Air Flow (Source)
Speech
Speech Theory
![Page 7: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/7.jpg)
7
• Kratzenstein’s acoustic resonators – Apparatus created in St. Petersburg (1779)
– Figure shown from Schroeter (1993)
• Vowel tubes (SF Science Museum)
First Set of Vowel Tubes
![Page 8: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/8.jpg)
8
Homer Dudley (1896-1987) , 1939 World Fair in New
York City – Bell Labs VODER
Electronic Synthesizer: VODER
Source: 120 Years of Electronic Music: The history of electronic music from 1800 to
2015, http://120years.net/the-voder-vocoderhomer-dudleyusa1940/
![Page 9: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/9.jpg)
9
• Haskins, 1959
• KTH – Stockholm, 1962
• Bell Labs, 1973
• MIT, 1976
• MIT-talk, 1979
• Speak ‘N Spell, 1980
• BELL Labs, 1985
• DECtalk (voice morphing), 1987
• I2R Abacus Engine 2013
Continuing Evolution (1959-2013)
![Page 10: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/10.jpg)
Elements in Human Voice
Human Language Technology
speech
Prosody
Timbre Content
![Page 11: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/11.jpg)
11
Analysis and Synthesis
Content
Prosody
Timbre
Analysis
Synthesis
speech speech
![Page 12: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/12.jpg)
12
Singing
![Page 13: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/13.jpg)
• Singing is emotional musical vocalization. It is an emotional expression of feelings which has the power to alter the mood of both the singer and the listener.
• Singing requires expert knowledge and operations. Vocalists acquire singing skills by extensive training.
• By singing, we – Reduce stress and improve mood
– Lower blood pressure
– Help improve sleep
– Reduce perceived pain
– Motivate and empower …
13
Singing – a part of our life
![Page 14: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/14.jpg)
14
• Almost all singers develop vibrato during voice training. Vibrato is a periodic modulation of the pitch frequency. Vibrato reduces the demand of accuracy in pitch frequency and the singer can use vibrato artistically for expressive purposes [Sundberg87]
[Sundberg87] J. Sundberg, The Science of the Singing Voice, Northern Illinois University Press, 1987.
Singing Theory – Vibrato
![Page 15: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/15.jpg)
15
• By varying the vocal tract to achieve different voice spectra, singers may produce different voice timbres, styles, and achieve efficient sound transmission
[Wolfe09] J. Wolfe, M. Garnier, and J. Smith, Voice Acoustics: An Introduction to the Science of Speech and Singing, 2009. [Sundberg77] J. Sundberg, The Acoustics of the Singing Voice, Scientific American, 1977.
• Singing formant
By adjusting the larynx and the vocal tract near glottis, singing formant is made and the singing can be projected further far away [Wolfe09]
Singing Theory – Singing Formant (Tenor)
![Page 16: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/16.jpg)
La Lore Loo Ler Lee
low pitch
high pitch
high pitch, order changed
16
[Joliveau04] E. Joliveau, J. Smith, and J. Wolfe, “Tuning of vocal tract resonance by sopranos”, Nature, vol. 427, pp. 116, Jan. 2004. [Wolfe09] J. Wolfe, M. Garnier, and J. Smith, Voice Acoustics: An Introduction to the Science of Speech and Singing, 2009. [Wolfe12] J. Wolfe, Sopranos: Resonance Tuning and Vowel Changes, 2012.
[Wolfe12] [Joliveau04]
[Wolfe09]
Singing Theory – Resonance Tuning (Soprano)
![Page 17: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/17.jpg)
17
Music
![Page 18: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/18.jpg)
18
• A piece of music is described by a sequence of notes, showing the pitch and the relative durations
• In Western music, 12 notes of fixed frequencies are used. They are different from each other by a semitone (a ratio of ). An octave span over 12 semitones.
12 2
name whole half quarter eighth sixteenth thirty-second
sixty-fourth
pitched note
rest note
name C4 C4# D4 D4# E4 F4 F4# G4 G4# A4 A4# B4
freq. (Hz)
262 277 294 311 330 349 370 392 415 440 466 494
Music Theory
![Page 19: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/19.jpg)
19
Speech to Singing Synthesis
![Page 20: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/20.jpg)
Are these possible? – Projecting vocal techniques in singing
– Personalizing a voice [Kenmochi10]
– Automating the accompaniment
[Kenmochi10] H. Kenmochi, “VOCALOID and Hatsune Miku phenomenon in Japan,” in Proc. Intersinging, pp. 1-4, 2010.
Singing Synthesis
Singing
Prosody
Timbre Content
Synchronization
![Page 21: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/21.jpg)
21
Analysis and Synthesis
Content
Prosody
(F0)
Timbre
Synthesized
Singing
Music
Accompaniment
Speech Singing
Analysis
Synthesis
![Page 22: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/22.jpg)
• Input: speech or singing
• Output: time-aligned vocal and music
• Changing prosody and timbre from speech to singing vocal
Siu Wa Lee, Ling Cen, Haizhou Li, Yaozhu Paul Chan, Minghui Dong, Method and system for template-based
personalized singing synthesis, US Patent: 20150025892 A1
Personalized Singing Synthesis
![Page 23: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/23.jpg)
(shift to a higher pitch)
pitch
time
pitch
time
pitch
time
pitch
time
Singing
Syncopated Singing
(horizontal nudge
time warping)
Transposed Singing
(shift to a lower pitch)
pitch
time
pitch
time
pitch
time
Example of Speech to Singing Synthesis
Harmonized Singing
(Sum of Melody
& Accompaniment
Vocals)
Speech
pitch
time
![Page 24: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/24.jpg)
24
Template singing voice
time (s)
frequ
ency
(Hz)
0 0.5 1 1.5 20
2000
4000
6000
8000
Speaking voice under conversion
time (s)
frequ
ency
(Hz)
0 0.5 10
2000
4000
6000
8000
Converted singing voice
time (s)
frequ
ency
(Hz)
0 0.5 1 1.5 20
2000
4000
6000
8000
Personalized Singing Synthesis – voice alignment
pitch
time
Singing
pitch
time
Speech
Siu Wa Lee, Ling Cen, Haizhou Li, Yaozhu Paul Chan, Minghui Dong, Method and system for template-based
personalized singing synthesis, US Patent: 20150025892 A1
![Page 25: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/25.jpg)
25
• Generalized Pitch Modeling
– To learn and generate these fluctuations note-by-note [Lee12]
• Various types of fluctuation are implicitly modeled using the same representation
[Lee12] S. W. Lee, S. T. Ang, M. Dong, and H. Li, “Generalized F0 modeling with absolute and relative pitch features for signing voice synthesis,” in Proc. ICASSP, pp. 429-432, 2012.
Generalized Pitch Modeling
Content Independent Method
Pitch Modeling
![Page 26: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/26.jpg)
• Context Independent Method – A fixed formula to assign pitch
fluctuation patterns
• However, pitch fluctuations are context dependent, e.g.
overshoot1 in high frequency notes is different from that in low frequency notes. Vibrato is present more often in long notes, etc.
1. Overshoot is a deflection exceeding the target note frequency after a note change 2. LEE S.W.Y, , DONG M.H., "Singing voice synthesis: Singer-dependent vibrato modeling and coherent processing of spectral envelope", INTERSPEECH 2011
Singing with Vibrato
![Page 27: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/27.jpg)
Personalized Singing Voice Synthesis
(NDP 2013 Mobile App)
![Page 28: Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000 6000 8000 Personalized Singing Synthesis – voice alignment h time Singing h time](https://reader030.fdocuments.in/reader030/viewer/2022041000/5ea09b33ee9fda616f2aae4f/html5/thumbnails/28.jpg)
• Thank you!
28