SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion...

18
SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD Lingual Voice Conversion using Artificial Neural Networks 1 International Institute of Information Technology, Hyderabad, India

Transcript of SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion...

Page 1: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

SRINIVAS DESAI,

B. YEGNANARAYANA,

KISHORE PRAHALLAD

A Framework for Cross-Lingual Voice Conversion

using Artificial Neural Networks

1

International Institute of Information Technology, Hyderabad, India

Page 2: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Voice Conversion Framework2

Conversion of speech of speaker A into speaker B’s voice.

Conversion achieved through transformation of spectral and excitation parameters.

Spectral parameters: MFCC, LPCC, Formants etc.

Excitation parameters: , residual etc

Voice Conversion

Speaker A Speaker B

0F

Page 3: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Modes of VC

Intra-Lingual Voice Conversion (ILVC) Parallel data:

The source and the target speaker record a same set of utterances.

Non-parallel data: The source and the target speaker record different sets

of utterances.

Cross-Lingual Voice Conversion (CLVC) The source speaker and the target speaker record

utterances in two different languages.

3

Page 4: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

VC with parallel training data

Parallel Data

SourceSpeaker

TargetSpeaker

TargetSpeaker

Feature Extraction

Feature Extraction

Alignment

MappingFunction

Feature Extraction

Conversion

Synthesis

TRAINING

TESTING

Page 5: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Alignment

o Plot of speech files after alignment

Page 6: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

VC with non-parallel training data

Non-parallel data

Feature Extraction

Feature Extraction

Clustering Mapping

Function

Feature Extraction

Conversion

Synthesis

TRAINING

TESTING

Clustering

SourceSpeaker

TargetSpeaker

TargetSpeaker

Page 7: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Limitations

Requires parallel/pseudo-parallel data. Hence, training data from both the speakers is always

needed.

Model trained on such data can be used to transform speech between the trained speaker pairs only. Hence, any arbitrary speakers’ speech cannot be

transformed.

4

Page 8: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Capturing speaker-specific characteristics (Hypothesis)

Target Speaker

Data

Formants

& B.Ws

VTLN

ANN

MCEP

Source Speaker

Data

Formants

& B.Ws

VTLN

ANN

TRAINING

TESTING

Page 9: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Vocal Tract Length Normalization (VTLN)

Ntt

tN

tN

t

ttt

t ffFFf

kFf

Ff

kf

kff

00

0

0'

:0

:

Graph of LP spectrum, before and after VTLN

- Formant / BW Frequency

- Pitch value for frame i

- Sampling Frequency

)(002.01 0 meani fFk

tf

iF0

Nf

Page 10: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Artificial Neural Networks (ANN)10

ANN consists of interconnected processing nodes Each node represents model of an artificial neuron Interconnection between nodes has a weight associated

with itDifferent topologies perform different pattern

recognition tasks Feedforward networks for pattern mapping Feedback networks for pattern association

This work uses feedforward networks for mapping source speaker’s spectral features onto target speaker’s spectral space.

X Y

N

2

11 1

M M

1

2

N

Page 11: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Hypothesis Testing

Three type of experiments Use of parallel data (ILVC)

Formant related features from source speaker and MCEPS from target speaker.

Use of non-parallel data (ILVC) Both the formant related features and MCEPs from the

target speaker.

CLVC Both the formant related features and MCEPs from the

target speaker.

Page 12: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Evaluation

Objective Mel-Cepstral Distortion

12

25

1

2210/ln10=i

ei

ti )mc(mc)(=MCD

• Subjective– Mean Opinion Score

(5: excellent, 4:good, 3:fair, 2:poor, 1:bad)

– Similarity Scores(5: Same speakers, 1: different speakers)

12

Page 13: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Database13

ILVC CMU ARCTIC databases

SLT, CLB (US Female) BDL, RMS (US Male) JMK (Canadian Male) AWB (Scottish Male) KSP (Indian Male).

CLVC NK (Telugu Female) PRA (Hindi Female)

Page 14: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

ILVC with parallel training data

No. Features ANN architecture MCD [dB]

1 4 F 4L 50N 12L 50N 25L 9.786

2 4 F + 4 B 8L 16N 4L 16N 25L 9.557

3 4 F + 4 B + UVN 8L 16N 4L 16N 25L 6.639

4 4 F + 4 B + Δ + ΔΔ + UVN 24L 50N 50N 25L 6.352

5 F0 + 4 F + 4 B + UVN 9L 18N 3L 18N 25L 6.713

6 F0 + 4 F + 4 B + Δ + ΔΔ + UVN 27L 50N 50N 25L 6.375

7 F0 + Prob. of voicing + 4 F + 4 B + Δ + ΔΔ + UVN

30L 50N 50N 25L 6.105

8 F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN

42L 75N 75N 25L 5.992

9 (F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN) + (3L3R MCEP to MCEP error correction)

(42L 75N 75N 25L) + (175L 525N 525N 175L)

5.615

Page 15: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

ILVC with non-parallel training data

Speaker pairs MCD [dB]

SLT to SLT 3.966

BDL to SLT 6.153

RMS to SLT 6.650

CLB to SLT 5.405

JMK to SLT 6.754

AWB to SLT 6.758

KSP to SLT 7.142

Speaker pairs MCD [dB]

BDL to BDL 4.263

SLT to BDL 6.887

RMS to BDL 6.565

CLB to BDL 6.444

JMK to BDL 7.023

AWB to BDL 7.017

KSP to BDL 7.444

Target Speaker MOS Similarity Score

BDL 2.926 2.715

SLT 2.731 2.47

Page 16: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

CLVC

Source Speaker

Target Speaker MOS Similarity Score

NK (Telugu) BDL (English) 2.88 2.77

PRA (Hindi) BDL (English) 2.62 2.15

Page 17: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Conclusion17

The proposed algorithm could be used to capture speaker-specific characteristics.

Hence, can be used in both ILVC and CLVC tasks.

Page 18: SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Thank You

18