SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion...
-
Upload
angel-newman -
Category
Documents
-
view
215 -
download
0
Transcript of SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion...
SRINIVAS DESAI,
B. YEGNANARAYANA,
KISHORE PRAHALLAD
A Framework for Cross-Lingual Voice Conversion
using Artificial Neural Networks
1
International Institute of Information Technology, Hyderabad, India
Voice Conversion Framework2
Conversion of speech of speaker A into speaker B’s voice.
Conversion achieved through transformation of spectral and excitation parameters.
Spectral parameters: MFCC, LPCC, Formants etc.
Excitation parameters: , residual etc
Voice Conversion
Speaker A Speaker B
0F
Modes of VC
Intra-Lingual Voice Conversion (ILVC) Parallel data:
The source and the target speaker record a same set of utterances.
Non-parallel data: The source and the target speaker record different sets
of utterances.
Cross-Lingual Voice Conversion (CLVC) The source speaker and the target speaker record
utterances in two different languages.
3
VC with parallel training data
Parallel Data
SourceSpeaker
TargetSpeaker
TargetSpeaker
Feature Extraction
Feature Extraction
Alignment
MappingFunction
Feature Extraction
Conversion
Synthesis
TRAINING
TESTING
Alignment
o Plot of speech files after alignment
VC with non-parallel training data
Non-parallel data
Feature Extraction
Feature Extraction
Clustering Mapping
Function
Feature Extraction
Conversion
Synthesis
TRAINING
TESTING
Clustering
SourceSpeaker
TargetSpeaker
TargetSpeaker
Limitations
Requires parallel/pseudo-parallel data. Hence, training data from both the speakers is always
needed.
Model trained on such data can be used to transform speech between the trained speaker pairs only. Hence, any arbitrary speakers’ speech cannot be
transformed.
4
Capturing speaker-specific characteristics (Hypothesis)
Target Speaker
Data
Formants
& B.Ws
VTLN
ANN
MCEP
Source Speaker
Data
Formants
& B.Ws
VTLN
ANN
TRAINING
TESTING
Vocal Tract Length Normalization (VTLN)
Ntt
tN
tN
t
ttt
t ffFFf
kFf
Ff
kf
kff
00
0
0'
:0
:
Graph of LP spectrum, before and after VTLN
- Formant / BW Frequency
- Pitch value for frame i
- Sampling Frequency
)(002.01 0 meani fFk
tf
iF0
Nf
Artificial Neural Networks (ANN)10
ANN consists of interconnected processing nodes Each node represents model of an artificial neuron Interconnection between nodes has a weight associated
with itDifferent topologies perform different pattern
recognition tasks Feedforward networks for pattern mapping Feedback networks for pattern association
This work uses feedforward networks for mapping source speaker’s spectral features onto target speaker’s spectral space.
X Y
N
2
11 1
M M
1
2
N
Hypothesis Testing
Three type of experiments Use of parallel data (ILVC)
Formant related features from source speaker and MCEPS from target speaker.
Use of non-parallel data (ILVC) Both the formant related features and MCEPs from the
target speaker.
CLVC Both the formant related features and MCEPs from the
target speaker.
Evaluation
Objective Mel-Cepstral Distortion
12
25
1
2210/ln10=i
ei
ti )mc(mc)(=MCD
• Subjective– Mean Opinion Score
(5: excellent, 4:good, 3:fair, 2:poor, 1:bad)
– Similarity Scores(5: Same speakers, 1: different speakers)
12
Database13
ILVC CMU ARCTIC databases
SLT, CLB (US Female) BDL, RMS (US Male) JMK (Canadian Male) AWB (Scottish Male) KSP (Indian Male).
CLVC NK (Telugu Female) PRA (Hindi Female)
ILVC with parallel training data
No. Features ANN architecture MCD [dB]
1 4 F 4L 50N 12L 50N 25L 9.786
2 4 F + 4 B 8L 16N 4L 16N 25L 9.557
3 4 F + 4 B + UVN 8L 16N 4L 16N 25L 6.639
4 4 F + 4 B + Δ + ΔΔ + UVN 24L 50N 50N 25L 6.352
5 F0 + 4 F + 4 B + UVN 9L 18N 3L 18N 25L 6.713
6 F0 + 4 F + 4 B + Δ + ΔΔ + UVN 27L 50N 50N 25L 6.375
7 F0 + Prob. of voicing + 4 F + 4 B + Δ + ΔΔ + UVN
30L 50N 50N 25L 6.105
8 F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN
42L 75N 75N 25L 5.992
9 (F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN) + (3L3R MCEP to MCEP error correction)
(42L 75N 75N 25L) + (175L 525N 525N 175L)
5.615
ILVC with non-parallel training data
Speaker pairs MCD [dB]
SLT to SLT 3.966
BDL to SLT 6.153
RMS to SLT 6.650
CLB to SLT 5.405
JMK to SLT 6.754
AWB to SLT 6.758
KSP to SLT 7.142
Speaker pairs MCD [dB]
BDL to BDL 4.263
SLT to BDL 6.887
RMS to BDL 6.565
CLB to BDL 6.444
JMK to BDL 7.023
AWB to BDL 7.017
KSP to BDL 7.444
Target Speaker MOS Similarity Score
BDL 2.926 2.715
SLT 2.731 2.47
CLVC
Source Speaker
Target Speaker MOS Similarity Score
NK (Telugu) BDL (English) 2.88 2.77
PRA (Hindi) BDL (English) 2.62 2.15
Conclusion17
The proposed algorithm could be used to capture speaker-specific characteristics.
Hence, can be used in both ILVC and CLVC tasks.
Thank You
18