Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf ·...
Transcript of Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf ·...
![Page 1: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/1.jpg)
Neural Voice Cloningwith a Few Samples
Sercan O. Arik, Jitong Chen, Kainan Peng*, Wei Ping, Yanqi Zhou
![Page 2: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/2.jpg)
Motivations• Text-to-speech (TTS) models can be conditioned on text and speaker identity. • Text: linguistic information, content of the generated speech.• Speaker identity: speaker information (accent, pitch, speech rate…).
![Page 3: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/3.jpg)
Motivations• Text-to-speech (TTS) models can be conditioned on text and speaker identity. • Text: linguistic information, content of the generated speech.• Speaker identity: speaker information (accent, pitch, speech rate…).
• Limitations:• Can only generate speech for observed speakers during training.• Require lots of speech samples per speaker (e.g., Deep Voice 2).
![Page 4: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/4.jpg)
Voice Cloning• Voice cloning: synthesize the voices of new speakers from a few speech
samples (few-shot generative model).
• Applications: personalized speech interfaces, content creation, assistive technology…
![Page 5: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/5.jpg)
Voice Cloning• Voice cloning: synthesize the voices of new speakers from a few speech
samples (few-shot generative model).
• Applications: personalized speech interfaces, content creation, assistive technology…
• Challenges: • Generalization: learn the voice of a new speaker.• Efficiency: extract the speaker characteristics from a few speech samples.• Computational cost: cloning with low latency and small footprint.
• Two approaches:• Speaker adaptation.• Speaker encoding.
![Page 6: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/6.jpg)
Speaker Adaptation• Fine-tune a pre-trained multi-speaker model for a new speaker.
• Training data: a few text and audio pairs.
![Page 7: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/7.jpg)
• Two options for speaker adaptation:
Fine-tune the whole model Fine-tune the speaker embedding only
Speaker Adaptation• Fine-tune a pre-trained multi-speaker model for a new speaker.
• Training data: a few text and audio pairs.
![Page 8: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/8.jpg)
Speaker Adaptation Analysis
ApproachesSpeaker Adaptation
Embedding-only Whole-model
Cloning time 8 h 5 min
# of parameters per speaker 128 25 million
![Page 9: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/9.jpg)
Speaker Encoding• Directly predict a new speaker embedding for a multi-speaker model.
• Train a speaker encoder with audio and speaker embedding pairs.
![Page 10: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/10.jpg)
Speaker Encoding• Directly predict a new speaker embedding for a multi-speaker model.
• Train a speaker encoder with audio and speaker embedding pairs.
• Cloning time: a few seconds, more favorable for low-resource deployment.
![Page 11: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/11.jpg)
Results• Vocoder: classical Griffin-Lim algorithm.
• Demo website: http://audiodemos.github.io
ApproachesSpeaker Adaptation Speaker
EncodingEmbedding-only Whole-model
Mean Opinion Score (MOS)
Naturalness (5-scale) 2.67 3.16 2.99
Similarity (4-scale) 2.95 3.16 2.85
![Page 12: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/12.jpg)
Voice Morphing via Embedding Manipulation• BritishMale + AveragedFemale - AveragedMale = BritishFemale
• BritishMale + AveragedAmerican - AveragedBritish = AmericanMale
![Page 13: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech](https://reader033.fdocuments.in/reader033/viewer/2022052723/5f0dd4e87e708231d43c4e17/html5/thumbnails/13.jpg)
Thank you!
Welcome to our poster,and listen to samples!
Today, Session B, #91