SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion...

SRINIVAS DESAI,

B. YEGNANARAYANA,

KISHORE PRAHALLAD

A Framework for Cross-Lingual Voice Conversion

using Artificial Neural Networks

International Institute of Information Technology, Hyderabad, India

Voice Conversion Framework2

Conversion of speech of speaker A into speaker B’s voice.

Conversion achieved through transformation of spectral and excitation parameters.

Spectral parameters: MFCC, LPCC, Formants etc.

Excitation parameters: , residual etc

Voice Conversion

Speaker A Speaker B

Modes of VC

Intra-Lingual Voice Conversion (ILVC) Parallel data:

The source and the target speaker record a same set of utterances.

Non-parallel data: The source and the target speaker record different sets

of utterances.

Cross-Lingual Voice Conversion (CLVC) The source speaker and the target speaker record

utterances in two different languages.

VC with parallel training data

Parallel Data

SourceSpeaker

TargetSpeaker

Feature Extraction

Alignment

MappingFunction

Feature Extraction

Conversion

Synthesis

TRAINING

TESTING

Alignment

o Plot of speech files after alignment

VC with non-parallel training data

Non-parallel data

Feature Extraction

Clustering Mapping

Function

Feature Extraction

Conversion

Synthesis

TRAINING

TESTING

Clustering

SourceSpeaker

TargetSpeaker

Limitations

Requires parallel/pseudo-parallel data. Hence, training data from both the speakers is always

needed.

Model trained on such data can be used to transform speech between the trained speaker pairs only. Hence, any arbitrary speakers’ speech cannot be

transformed.

Capturing speaker-specific characteristics (Hypothesis)

Target Speaker

Formants

& B.Ws

Source Speaker

Formants

& B.Ws

TRAINING

TESTING

Vocal Tract Length Normalization (VTLN)

t ffFFf

Graph of LP spectrum, before and after VTLN

- Formant / BW Frequency

- Pitch value for frame i

- Sampling Frequency

)(002.01 0 meani fFk

Artificial Neural Networks (ANN)10

ANN consists of interconnected processing nodes Each node represents model of an artificial neuron Interconnection between nodes has a weight associated

with itDifferent topologies perform different pattern

recognition tasks Feedforward networks for pattern mapping Feedback networks for pattern association

This work uses feedforward networks for mapping source speaker’s spectral features onto target speaker’s spectral space.

Hypothesis Testing

Three type of experiments Use of parallel data (ILVC)

Formant related features from source speaker and MCEPS from target speaker.

Use of non-parallel data (ILVC) Both the formant related features and MCEPs from the

target speaker.

CLVC Both the formant related features and MCEPs from the

target speaker.

Evaluation

Objective Mel-Cepstral Distortion

2210/ln10=i

ti )mc(mc)(=MCD

• Subjective– Mean Opinion Score

(5: excellent, 4:good, 3:fair, 2:poor, 1:bad)

– Similarity Scores(5: Same speakers, 1: different speakers)

Database13

ILVC CMU ARCTIC databases

SLT, CLB (US Female) BDL, RMS (US Male) JMK (Canadian Male) AWB (Scottish Male) KSP (Indian Male).

CLVC NK (Telugu Female) PRA (Hindi Female)

ILVC with parallel training data

No. Features ANN architecture MCD [dB]

1 4 F 4L 50N 12L 50N 25L 9.786

2 4 F + 4 B 8L 16N 4L 16N 25L 9.557

3 4 F + 4 B + UVN 8L 16N 4L 16N 25L 6.639

4 4 F + 4 B + Δ + ΔΔ + UVN 24L 50N 50N 25L 6.352

5 F0 + 4 F + 4 B + UVN 9L 18N 3L 18N 25L 6.713

6 F0 + 4 F + 4 B + Δ + ΔΔ + UVN 27L 50N 50N 25L 6.375

7 F0 + Prob. of voicing + 4 F + 4 B + Δ + ΔΔ + UVN

30L 50N 50N 25L 6.105

8 F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN

42L 75N 75N 25L 5.992

9 (F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN) + (3L3R MCEP to MCEP error correction)

(42L 75N 75N 25L) + (175L 525N 525N 175L)

ILVC with non-parallel training data

Speaker pairs MCD [dB]

SLT to SLT 3.966

BDL to SLT 6.153

RMS to SLT 6.650

CLB to SLT 5.405

JMK to SLT 6.754

AWB to SLT 6.758

KSP to SLT 7.142

Speaker pairs MCD [dB]

BDL to BDL 4.263

SLT to BDL 6.887

RMS to BDL 6.565

CLB to BDL 6.444

JMK to BDL 7.023

AWB to BDL 7.017

KSP to BDL 7.444

Target Speaker MOS Similarity Score

BDL 2.926 2.715

SLT 2.731 2.47

Source Speaker

Target Speaker MOS Similarity Score

NK (Telugu) BDL (English) 2.88 2.77

PRA (Hindi) BDL (English) 2.62 2.15

Conclusion17

The proposed algorithm could be used to capture speaker-specific characteristics.

Hence, can be used in both ILVC and CLVC tasks.

Thank You

SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion...

Documents

Transcript of SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion...

Speech enhancement using linear prediction residualspeech.iiit.ac.in/svlpubs/article/Yegnanarayana199925.pdf · Speech enhancement using linear prediction residual B. Yegnanarayana

Srinivas Research Project Report_12

Srinivas Akula

Cjb0911008 Srinivas Csn505 Presentation

Srinivas kilambi

TD MXC MySQLjavaDB Srinivas

Srinivas L - download.microsoft.comdownload.microsoft.com/.../forefront_stirling_srinivas.pdf · Srinivas L Technology Specialist - Security . Roadmap H2 2008 Codename “Stirling

Srinivas University...Srinivas University College of Engineering & Technology Srinivas Integrated Campus Mukka, Mangaluru – 574146 Tel- 0824 2477456 Fax- 0824 2477457 Email: deanengg@srinivasuniversity.edu.in

Inventory Management SRINIVAS REDDY

PANCHAYAT PRAHALLAD COLLEGE, NISCHINTAKOILI, CUTTACK ...ppcollege.org.in/Fcimages/SSR.pdf · PANCHAYAT PRAHALLAD COLLEGE, NISCHINTAKOILI, CUTTACK-754207 Website:- E-mail: -ppcnk1@gmail.com

1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India.

Rpg Srinivas

Srinivas seo

Presented By Srinivas Sundaravaradan

DR. Srinivas Murki - Neocon2019 › uploads › D-Srinivas-Murki.pdf · Srinivas Murki Chief Neonatologist Paramitha Children Hospital Kothapet, Hyderabad M.D. DM . Approach to Dyselectrolytemia

Srinivas Potelu

DTIC · 2011-05-14 · [Srinivas b] SRINIVAS, Y. V. Deriving parsing algorithms using sheaves. In 0 Preparation. [Srinivas 92] SRINIVAS, Y. V. Derivation of a parallel matching algorithm.

Prepared By Anshuman Sahu Class VIII Guidede by Mr. Prahallad badapanda fcsa.

TD MXC JavaFX Srinivas

SRINIVAS COLLEGE OF PHARMACYsrinivasgroup.com/srinivas-college-of-pharmacy/wp-content/uploads/sites/4/2017/05/...srinivas college of pharmacy students’ achievement b.pharm rank holders