SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion...

SRINIVAS DESAI,

B. YEGNANARAYANA,

KISHORE PRAHALLAD

A Framework for Cross-Lingual Voice Conversion

using Artificial Neural Networks

1

International Institute of Information Technology, Hyderabad, India

Voice Conversion Framework2

Conversion of speech of speaker A into speaker B’s voice.

Conversion achieved through transformation of spectral and excitation parameters.

Spectral parameters: MFCC, LPCC, Formants etc.

Excitation parameters: , residual etc

Voice Conversion

Speaker A Speaker B

0F

Modes of VC

Intra-Lingual Voice Conversion (ILVC) Parallel data:

The source and the target speaker record a same set of utterances.

Non-parallel data: The source and the target speaker record different sets

of utterances.

Cross-Lingual Voice Conversion (CLVC) The source speaker and the target speaker record

utterances in two different languages.

3

VC with parallel training data

Parallel Data

SourceSpeaker

TargetSpeaker

TargetSpeaker

Feature Extraction

Feature Extraction

Alignment

MappingFunction

Feature Extraction

Conversion

Synthesis

TRAINING

TESTING

Alignment

o Plot of speech files after alignment

VC with non-parallel training data

Non-parallel data

Feature Extraction

Feature Extraction

Clustering Mapping

Function

Feature Extraction

Conversion

Synthesis

TRAINING

TESTING

Clustering

SourceSpeaker

TargetSpeaker

TargetSpeaker

Limitations

Requires parallel/pseudo-parallel data. Hence, training data from both the speakers is always

needed.

Model trained on such data can be used to transform speech between the trained speaker pairs only. Hence, any arbitrary speakers’ speech cannot be

transformed.

4

Capturing speaker-specific characteristics (Hypothesis)

Target Speaker

Data

Formants

& B.Ws

VTLN

ANN

MCEP

Source Speaker

Data

Formants

& B.Ws

VTLN

ANN

TRAINING

TESTING

Vocal Tract Length Normalization (VTLN)

Ntt

tN

tN

t

ttt

t ffFFf

kFf

Ff

kf

kff

00

0

0'

:0

:

Graph of LP spectrum, before and after VTLN

- Formant / BW Frequency

- Pitch value for frame i

- Sampling Frequency

)(002.01 0 meani fFk

tf

iF0

Nf

Artificial Neural Networks (ANN)10

ANN consists of interconnected processing nodes Each node represents model of an artificial neuron Interconnection between nodes has a weight associated

with itDifferent topologies perform different pattern

recognition tasks Feedforward networks for pattern mapping Feedback networks for pattern association

This work uses feedforward networks for mapping source speaker’s spectral features onto target speaker’s spectral space.

X Y

N

2

11 1

M M

1

2

N

Hypothesis Testing

Three type of experiments Use of parallel data (ILVC)

Formant related features from source speaker and MCEPS from target speaker.

Use of non-parallel data (ILVC) Both the formant related features and MCEPs from the

target speaker.

CLVC Both the formant related features and MCEPs from the

target speaker.

Evaluation

Objective Mel-Cepstral Distortion

12

25

1

2210/ln10=i

ei

ti )mc(mc)(=MCD

• Subjective– Mean Opinion Score

(5: excellent, 4:good, 3:fair, 2:poor, 1:bad)

– Similarity Scores(5: Same speakers, 1: different speakers)

12

Database13

ILVC CMU ARCTIC databases

SLT, CLB (US Female) BDL, RMS (US Male) JMK (Canadian Male) AWB (Scottish Male) KSP (Indian Male).

CLVC NK (Telugu Female) PRA (Hindi Female)

ILVC with parallel training data

No. Features ANN architecture MCD [dB]

1 4 F 4L 50N 12L 50N 25L 9.786

2 4 F + 4 B 8L 16N 4L 16N 25L 9.557

3 4 F + 4 B + UVN 8L 16N 4L 16N 25L 6.639

4 4 F + 4 B + Δ + ΔΔ + UVN 24L 50N 50N 25L 6.352

5 F0 + 4 F + 4 B + UVN 9L 18N 3L 18N 25L 6.713

6 F0 + 4 F + 4 B + Δ + ΔΔ + UVN 27L 50N 50N 25L 6.375

7 F0 + Prob. of voicing + 4 F + 4 B + Δ + ΔΔ + UVN

30L 50N 50N 25L 6.105

8 F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN

42L 75N 75N 25L 5.992

9 (F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN) + (3L3R MCEP to MCEP error correction)

(42L 75N 75N 25L) + (175L 525N 525N 175L)

5.615

ILVC with non-parallel training data

Speaker pairs MCD [dB]

SLT to SLT 3.966

BDL to SLT 6.153

RMS to SLT 6.650

CLB to SLT 5.405

JMK to SLT 6.754

AWB to SLT 6.758

KSP to SLT 7.142

Speaker pairs MCD [dB]

BDL to BDL 4.263

SLT to BDL 6.887

RMS to BDL 6.565

CLB to BDL 6.444

JMK to BDL 7.023

AWB to BDL 7.017

KSP to BDL 7.444

Target Speaker MOS Similarity Score

BDL 2.926 2.715

SLT 2.731 2.47

CLVC

Source Speaker

Target Speaker MOS Similarity Score

NK (Telugu) BDL (English) 2.88 2.77

PRA (Hindi) BDL (English) 2.62 2.15

Conclusion17

The proposed algorithm could be used to capture speaker-specific characteristics.

Hence, can be used in both ILVC and CLVC tasks.

Thank You

18

SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion...

Documents

Transcript of SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion...